SOTAVerified

Benchmarking

Papers

Showing 18011850 of 5548 papers

TitleStatusHype
Benchmarking MOEAs for solving continuous multi-objective RL problemsCode0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems ImmunityCode0
CompBench: Benchmarking Complex Instruction-guided Image Editing0
OSS-Bench: Benchmark Generator for Coding LLMsCode0
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models0
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind0
Disambiguation in Conversational Question Answering in the Era of LLM: A Survey0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)0
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable ThresholdsCode0
GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation0
HumaniBench: A Human-Centric Framework for Large Multimodal Models EvaluationCode0
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems0
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents0
Benchmarking CFAR and CNN-based Peak Detection Algorithms in ISAC under Hardware Impairments0
Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models0
Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese0
VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning TasksCode0
Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale0
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible BenchmarkingCode0
CleanPatrick: A Benchmark for Image Data CleaningCode0
Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark0
Relation Extraction Across Entire Books to Reconstruct Community Networks: The AffilKG Datasets0
Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and ChallengesCode0
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMsCode0
ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems0
Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding0
PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto LanguageCode0
JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation0
What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs0
Real-World fNIRS-Based Brain-Computer Interfaces: Benchmarking Deep Learning and Classical Models in Interactive Gaming0
DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs0
Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization0
GNN-Suite: a Graph Neural Network Benchmarking Framework for Biomedical InformaticsCode0
On the Evaluation of Engineering Artificial General Intelligence0
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1MCode0
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models0
VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts0
RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
BioVFM-21M: Benchmarking and Scaling Self-Supervised Vision Foundation Models for Biomedical Image AnalysisCode0
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation0
TARGET: Benchmarking Table Retrieval for Generative Tasks0
A Standardized Benchmark Set of Clustering Problem Instances for Comparing Black-Box Optimizers0
How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
ExEBench: Benchmarking Foundation Models on Extreme Earth EventsCode0
Show:102550
← PrevPage 37 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified