Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 376–400 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning	May 19, 2025	Benchmarking	—Unverified	0
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities	May 19, 2025	Automated Theorem ProvingBenchmarking	CodeCode Available	1
Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning	May 19, 2025	Benchmarking	CodeCode Available	0
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents	May 19, 2025	AI AgentBenchmarking	CodeCode Available	1
Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings	May 19, 2025	BenchmarkingCombinatorial Optimization	—Unverified	0
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization	May 18, 2025	Benchmarking	CodeCode Available	1
OSS-Bench: Benchmark Generator for Coding LLMs	May 18, 2025	Benchmarking	CodeCode Available	0
Disambiguation in Conversational Question Answering in the Era of LLM: A Survey	May 18, 2025	BenchmarkingConversational Question Answering	—Unverified	0
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models	May 18, 2025	ArticlesBenchmarking	—Unverified	0
CompBench: Benchmarking Complex Instruction-guided Image Editing	May 18, 2025	BenchmarkingInstruction Following	—Unverified	0
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks	May 18, 2025	BenchmarkingMedical Visual Question Answering	CodeCode Available	1
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind	May 18, 2025	BenchmarkingScene Understanding	—Unverified	0
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification	May 18, 2025	Benchmarking	CodeCode Available	2
Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)	May 17, 2025	BenchmarkingDiagnostic	—Unverified	0
GenderBench: Evaluation Suite for Gender Biases in LLMs	May 17, 2025	Benchmarking	CodeCode Available	0
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation	May 17, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation	May 17, 2025	Benchmarking	—Unverified	0
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds	May 17, 2025	BenchmarkingBinary Classification	CodeCode Available	0
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges	May 16, 2025	BenchmarkingState Estimation	CodeCode Available	0
Benchmarking CFAR and CNN-based Peak Detection Algorithms in ISAC under Hardware Impairments	May 16, 2025	BenchmarkingIntegrated sensing and communication	—Unverified	0
Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale	May 16, 2025	BenchmarkingTAG	—Unverified	0
ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems	May 16, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models	May 16, 2025	BenchmarkingDecision Making	—Unverified	0
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available	0
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems	May 16, 2025	BenchmarkingMixture-of-Experts	—Unverified	0

Show:10 25 50

← PrevPage 16 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified