Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1601–1625 of 5548 papers

Title	Date	Tasks	Status
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents	Jun 9, 2025	BenchmarkingSynthetic Data Generation	—Unverified
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning	Jun 9, 2025	Active LearningBenchmarking	CodeCode Available
REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models	Jun 9, 2025	BenchmarkingDecision Making	—Unverified
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis	Jun 9, 2025	Action ClassificationBenchmarking	—Unverified
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra	Jun 9, 2025	3D ReconstructionBenchmarking	—Unverified
CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems	Jun 9, 2025	AttributeBenchmarking	CodeCode Available
Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding	Jun 9, 2025	BenchmarkingVideo Compression	—Unverified
HuSc3D: Human Sculpture dataset for 3D object reconstruction	Jun 9, 2025	3D Object Reconstruction3D Reconstruction	CodeCode Available
How Far Are We from Optimal Reasoning Efficiency?	Jun 8, 2025	16kBenchmarking	CodeCode Available
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping	Jun 7, 2025	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available
MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks	Jun 6, 2025	Benchmarking	CodeCode Available
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures	Jun 6, 2025	BenchmarkingCPU	—Unverified
DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection	Jun 6, 2025	BenchmarkingDeepFake Detection	—Unverified
Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions	Jun 6, 2025	BenchmarkingState Space Models	—Unverified
Benchmarking Misuse Mitigation Against Covert Adversaries	Jun 6, 2025	BenchmarkingLanguage Modeling	CodeCode Available
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques	Jun 6, 2025	BenchmarkingModel Selection	—Unverified
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition	Jun 5, 2025	BenchmarkingEmotion Recognition	—Unverified
Design of intelligent proofreading system for English translation based on CNN and BERT	Jun 5, 2025	BenchmarkingMachine Translation	—Unverified
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model	Jun 5, 2025	BenchmarkingLanguage Modeling	—Unverified
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems	Jun 5, 2025	BenchmarkingRAG	—Unverified
BSBench: will your LLM find the largest prime number?	Jun 5, 2025	Benchmarking	CodeCode Available
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs	Jun 5, 2025	BenchmarkingVideo Understanding	—Unverified
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx	Jun 5, 2025	2D Pose EstimationBenchmarking	—Unverified
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models	Jun 5, 2025	BenchmarkingDiversity	—Unverified
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values	Jun 5, 2025	Benchmarking	—Unverified

Show:10 25 50

← PrevPage 65 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified