SOTAVerified

Benchmarking

Papers

Showing 376400 of 5548 papers

TitleStatusHype
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning0
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on InequalitiesCode1
Benchmarking and Confidence Evaluation of LALMs For Temporal ReasoningCode0
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering AgentsCode1
Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings0
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion SummarizationCode1
OSS-Bench: Benchmark Generator for Coding LLMsCode0
Disambiguation in Conversational Question Answering in the Era of LLM: A Survey0
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models0
CompBench: Benchmarking Complex Instruction-guided Image Editing0
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical TasksCode1
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind0
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species ClassificationCode2
Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text InterpretationCode1
GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation0
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable ThresholdsCode0
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and ChallengesCode0
Benchmarking CFAR and CNN-based Peak Detection Algorithms in ISAC under Hardware Impairments0
Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale0
ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
HumaniBench: A Human-Centric Framework for Large Multimodal Models EvaluationCode0
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems0
Show:102550
← PrevPage 16 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified