SOTAVerified

Benchmarking

Papers

Showing 551575 of 5548 papers

TitleStatusHype
Generative CKM Construction using Partially Observed Data with Diffusion ModelCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and ReasoningCode1
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference AlignmentCode1
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World TasksCode1
MT-LENS: An all-in-one Toolkit for Better Machine Translation EvaluationCode1
CharacterBench: Benchmarking Character Customization of Large Language ModelsCode1
AD-LLM: Benchmarking Large Language Models for Anomaly DetectionCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power SystemsCode1
Multi-Behavior Recommendation with Personalized Directed Acyclic Behavior GraphsCode1
Grounding Descriptions in Images informs Zero-Shot Visual RecognitionCode1
Does your model understand genes? A benchmark of gene properties for biological and text modelsCode1
Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"Code1
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OasisCode1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
VidHal: Benchmarking Temporal Hallucinations in Vision LLMsCode1
Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and TasksCode1
Multi-Agent Environments for Vehicle Routing ProblemsCode1
StackEval: Benchmarking LLMs in Coding AssistanceCode1
DLBacktrace: A Model Agnostic Explainability for any Deep Learning ModelsCode1
Introducing Milabench: Benchmarking Accelerators for AICode1
FM-TS: Flow Matching for Time Series GenerationCode1
Show:102550
← PrevPage 23 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified