SOTAVerified

Benchmarking

Papers

Showing 151175 of 5548 papers

TitleStatusHype
Refer to Anything with Vision-Language Prompts0
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models0
BSBench: will your LLM find the largest prime number?Code0
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech EvaluationCode0
Design of intelligent proofreading system for English translation based on CNN and BERT0
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems0
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model0
MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K CategoriesCode2
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx0
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values0
Urania: Differentially Private Insights into AI Use0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset0
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language ModelsCode0
Knowledge-guided Contextual Gene Set Analysis Using Large Language Models0
macOSWorld: A Multilingual Interactive Benchmark for GUI AgentsCode1
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP0
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale0
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and MaintenanceCode5
A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time SeriesCode0
CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking0
Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems0
Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence0
N^2: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix CompletionCode0
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid MotionsCode1
Show:102550
← PrevPage 7 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified