SOTAVerified

Benchmarking

Papers

Showing 11511175 of 5548 papers

TitleStatusHype
An OpenMind for 3D medical vision self-supervised learningCode2
First-frame Supervised Video Polyp Segmentation via Propagative and Semantic Dual-teacher NetworkCode0
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device ScenariosCode0
Patherea: Cell Detection and Classification for the 2020s0
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient VoiceCode0
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage0
Enriching Social Science Research via Survey Item LinkingCode0
Benchmarking LLMs and SLMs for patient reported outcomes0
Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource ScriptsCode0
AI-generated Image Quality Assessment in Visual CommunicationCode0
XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented GenerationCode2
TelcoLM: collecting data, adapting, and benchmarking language models for the telecommunication domain0
Generative CKM Construction using Partially Observed Data with Diffusion ModelCode1
TOMG-Bench: Evaluating LLMs on Text-based Open Molecule GenerationCode1
Pitfalls of topology-aware image segmentation0
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning0
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World KnowledgeCode0
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and ReasoningCode1
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World TasksCode1
Open Universal Arabic ASR LeaderboardCode2
Generation of Large District Heating System Models Using Open-Source Data and Tools: An Exemplary Workflow0
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference AlignmentCode1
DateLogicQA: Benchmarking Temporal Biases in Large Language ModelsCode0
Show:102550
← PrevPage 47 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified