Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1601–1650 of 5548 papers

Title	Date	Tasks	Status
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis	Jun 9, 2025	Action ClassificationBenchmarking	—Unverified
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments	Jun 9, 2025	BenchmarkingNavigate	—Unverified
HuSc3D: Human Sculpture dataset for 3D object reconstruction	Jun 9, 2025	3D Object Reconstruction3D Reconstruction	CodeCode Available
REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models	Jun 9, 2025	BenchmarkingDecision Making	—Unverified
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning	Jun 9, 2025	Active LearningBenchmarking	CodeCode Available
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra	Jun 9, 2025	3D ReconstructionBenchmarking	—Unverified
CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems	Jun 9, 2025	AttributeBenchmarking	CodeCode Available
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents	Jun 9, 2025	BenchmarkingSynthetic Data Generation	—Unverified
How Far Are We from Optimal Reasoning Efficiency?	Jun 8, 2025	16kBenchmarking	CodeCode Available
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping	Jun 7, 2025	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available
Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions	Jun 6, 2025	BenchmarkingState Space Models	—Unverified
MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks	Jun 6, 2025	Benchmarking	CodeCode Available
DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection	Jun 6, 2025	BenchmarkingDeepFake Detection	—Unverified
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques	Jun 6, 2025	BenchmarkingModel Selection	—Unverified
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures	Jun 6, 2025	BenchmarkingCPU	—Unverified
Benchmarking Misuse Mitigation Against Covert Adversaries	Jun 6, 2025	BenchmarkingLanguage Modeling	CodeCode Available
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition	Jun 5, 2025	BenchmarkingEmotion Recognition	—Unverified
FRED: The Florence RGB-Event Drone Dataset	Jun 5, 2025	BenchmarkingTrajectory Forecasting	—Unverified
Urania: Differentially Private Insights into AI Use	Jun 5, 2025	BenchmarkingChatbot	—Unverified
BSBench: will your LLM find the largest prime number?	Jun 5, 2025	Benchmarking	CodeCode Available
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems	Jun 5, 2025	BenchmarkingRAG	—Unverified
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs	Jun 5, 2025	BenchmarkingVideo Understanding	—Unverified
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos	Jun 5, 2025	BenchmarkingMathematical Reasoning	—Unverified
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values	Jun 5, 2025	Benchmarking	—Unverified
Design of intelligent proofreading system for English translation based on CNN and BERT	Jun 5, 2025	BenchmarkingMachine Translation	—Unverified
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model	Jun 5, 2025	BenchmarkingLanguage Modeling	—Unverified
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation	Jun 5, 2025	Benchmarking	CodeCode Available
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx	Jun 5, 2025	2D Pose EstimationBenchmarking	—Unverified
Refer to Anything with Vision-Language Prompts	Jun 5, 2025	BenchmarkingGeneralized Referring Expression Segmentation	—Unverified
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models	Jun 5, 2025	BenchmarkingDiversity	—Unverified
Benchmarking Large Language Models on Homework Assessment in Circuit Analysis	Jun 5, 2025	Benchmarking	—Unverified
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models	Jun 4, 2025	BenchmarkingGeneral Knowledge	CodeCode Available
Knowledge-guided Contextual Gene Set Analysis Using Large Language Models	Jun 4, 2025	Benchmarking	—Unverified
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP	Jun 4, 2025	BenchmarkingLanguage Modelling	—Unverified
N^2: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion	Jun 4, 2025	BenchmarkingCausal Inference	—Unverified
Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems	Jun 4, 2025	BenchmarkingCode Generation	—Unverified
CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking	Jun 4, 2025	BenchmarkingCode Generation	—Unverified
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale	Jun 4, 2025	BenchmarkingLanguage Modeling	—Unverified
Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence	Jun 4, 2025	Benchmarking	—Unverified
A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series	Jun 4, 2025	BenchmarkingIrregular Time Series	CodeCode Available
Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset	Jun 4, 2025	3D geometryBenchmarking	—Unverified
FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models	Jun 3, 2025	BenchmarkingDomain Adaptation	—Unverified
SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation	Jun 3, 2025	BenchmarkingStyle Transfer	—Unverified
Tactile MNIST: Benchmarking Active Tactile Perception	Jun 3, 2025	BenchmarkingScene Understanding	—Unverified
AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering	Jun 3, 2025	Benchmarking	—Unverified
FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes	Jun 3, 2025	BenchmarkingFeature Engineering	CodeCode Available
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models	Jun 2, 2025	Benchmarking	CodeCode Available
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists	Jun 2, 2025	BenchmarkingForm	—Unverified
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents	Jun 2, 2025	BenchmarkingForm	—Unverified
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code	Jun 2, 2025	BenchmarkingCode Generation	—Unverified

Show:10 25 50

← PrevPage 33 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified