SOTAVerified

Benchmarking

Papers

Showing 11011150 of 5548 papers

TitleStatusHype
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for HallucinationsCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
Benchmarking Neural Network Robustness to Common Corruptions and Surface VariationsCode1
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle PerceptionCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
DCL-Net: Deep Correspondence Learning Network for 6D Pose EstimationCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
Benchmarking LLMs' Swarm intelligenceCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Data Generating Process to Evaluate Causal Discovery Techniques for Time Series DataCode1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
Benchmarking Low-Shot Robustness to Natural Distribution ShiftsCode1
Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networksCode1
From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image SegmentationCode1
Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking PlatformCode1
Deep Learning-Based Synchronization for Uplink NB-IoTCode1
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
Benchmarking Large Language Models for News SummarizationCode1
Benchmarking machine learning models on multi-centre eICU critical care datasetCode1
3D Common Corruptions and Data AugmentationCode1
Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural NetworksCode1
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark DetectionCode1
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
Benchmarking Meaning Representations in Neural Semantic ParsingCode1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningCode1
Benchmarking Meta-embeddings: What Works and What Does NotCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
Benchmarking Micro-action Recognition: Dataset, Methods, and ApplicationsCode1
DFGC 2022: The Second DeepFake Game CompetitionCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Show:102550
← PrevPage 23 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified