SOTAVerified

Benchmarking

Papers

Showing 16011650 of 5548 papers

TitleStatusHype
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis0
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments0
HuSc3D: Human Sculpture dataset for 3D object reconstructionCode0
REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models0
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine LearningCode0
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra0
CuRe: Cultural Gaps in the Long Tail of Text-to-Image SystemsCode0
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents0
How Far Are We from Optimal Reasoning Efficiency?Code0
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and MappingCode0
Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions0
MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based AttacksCode0
DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection0
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques0
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures0
Benchmarking Misuse Mitigation Against Covert AdversariesCode0
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition0
FRED: The Florence RGB-Event Drone Dataset0
Urania: Differentially Private Insights into AI Use0
BSBench: will your LLM find the largest prime number?Code0
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos0
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values0
Design of intelligent proofreading system for English translation based on CNN and BERT0
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model0
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech EvaluationCode0
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx0
Refer to Anything with Vision-Language Prompts0
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models0
Benchmarking Large Language Models on Homework Assessment in Circuit Analysis0
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language ModelsCode0
Knowledge-guided Contextual Gene Set Analysis Using Large Language Models0
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP0
N^2: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion0
Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems0
CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking0
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale0
Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence0
A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time SeriesCode0
Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset0
FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models0
SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation0
Tactile MNIST: Benchmarking Active Tactile Perception0
AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering0
FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure ModesCode0
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language ModelsCode0
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists0
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents0
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code0
Show:102550
← PrevPage 33 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified