SOTAVerified

Benchmarking

Papers

Showing 676700 of 5548 papers

TitleStatusHype
FineSurE: Fine-grained Summarization Evaluation using LLMsCode1
AI Agents That MatterCode1
Overcoming Common Flaws in the Evaluation of Selective Classification SystemsCode1
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile AgentsCode1
GraphArena: Benchmarking Large Language Models on Graph Computational ProblemsCode1
iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activitiesCode1
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark DetectionCode1
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)Code1
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?Code1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug DesignCode1
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation TrackCode1
A Closer Look at Mortality Risk Prediction from ElectrocardiogramsCode1
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion ModelsCode1
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular DataCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
BeHonest: Benchmarking Honesty in Large Language ModelsCode1
Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and BenchmarkingCode1
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language ModelsCode1
A GPU-accelerated Large-scale Simulator for Transportation System Optimization BenchmarkingCode1
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMsCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal DataCode1
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-ResolutionCode1
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language ModelsCode1
Show:102550
← PrevPage 28 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified