SOTAVerified

Benchmarking

Papers

Showing 651700 of 5548 papers

TitleStatusHype
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
POGEMA: A Benchmark Platform for Cooperative Multi-Agent PathfindingCode1
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and EvaluationsCode1
Restore Anything Model via Efficient Degradation AdaptationCode1
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse ModalitiesCode1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language ModelsCode1
When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph BenchmarkCode1
Separable Operator NetworksCode1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videosCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse DisciplinesCode1
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generationCode1
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
Training on the Test Task Confounds Evaluation and EmergenceCode1
OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental LearningCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Replication in Visual Diffusion Models: A Survey and OutlookCode1
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality mattersCode1
Benchmark on Drug Target Interaction Modeling from a Structure PerspectiveCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language ModelsCode1
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking DatasetCode1
Occlusion-Aware Seamless SegmentationCode1
FineSurE: Fine-grained Summarization Evaluation using LLMsCode1
AI Agents That MatterCode1
Overcoming Common Flaws in the Evaluation of Selective Classification SystemsCode1
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile AgentsCode1
GraphArena: Benchmarking Large Language Models on Graph Computational ProblemsCode1
iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activitiesCode1
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark DetectionCode1
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)Code1
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?Code1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug DesignCode1
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation TrackCode1
A Closer Look at Mortality Risk Prediction from ElectrocardiogramsCode1
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion ModelsCode1
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular DataCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
BeHonest: Benchmarking Honesty in Large Language ModelsCode1
Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and BenchmarkingCode1
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language ModelsCode1
A GPU-accelerated Large-scale Simulator for Transportation System Optimization BenchmarkingCode1
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMsCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal DataCode1
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-ResolutionCode1
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language ModelsCode1
Show:102550
← PrevPage 14 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified