SOTAVerified

Benchmarking

Papers

Showing 551600 of 5548 papers

TitleStatusHype
Generative CKM Construction using Partially Observed Data with Diffusion ModelCode1
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and ReasoningCode1
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference AlignmentCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World TasksCode1
MT-LENS: An all-in-one Toolkit for Better Machine Translation EvaluationCode1
CharacterBench: Benchmarking Character Customization of Large Language ModelsCode1
AD-LLM: Benchmarking Large Language Models for Anomaly DetectionCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
Multi-Behavior Recommendation with Personalized Directed Acyclic Behavior GraphsCode1
PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power SystemsCode1
Does your model understand genes? A benchmark of gene properties for biological and text modelsCode1
Grounding Descriptions in Images informs Zero-Shot Visual RecognitionCode1
Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"Code1
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OasisCode1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
VidHal: Benchmarking Temporal Hallucinations in Vision LLMsCode1
Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and TasksCode1
StackEval: Benchmarking LLMs in Coding AssistanceCode1
Multi-Agent Environments for Vehicle Routing ProblemsCode1
DLBacktrace: A Model Agnostic Explainability for any Deep Learning ModelsCode1
Introducing Milabench: Benchmarking Accelerators for AICode1
FM-TS: Flow Matching for Time Series GenerationCode1
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantificationCode1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity DatasetCode1
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksCode1
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph GenerationCode1
ROAD-Waymo: Action Awareness at Scale for Autonomous DrivingCode1
MIRFLEX: Music Information Retrieval Feature Library for ExtractionCode1
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language ModelsCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and BenchmarkingCode1
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World ScenariosCode1
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property PredictionCode1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for ElectromyographyCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
Survey of Cultural Awareness in Language Models: Text and BeyondCode1
LLMCBench: Benchmarking Large Language Model Compression for Efficient DeploymentCode1
SPICEPilot: Navigating SPICE Code Generation and Simulation with AI GuidanceCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart ProblemsCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them allCode1
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluationCode1
RClicks: Realistic Click Simulation for Benchmarking Interactive SegmentationCode1
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsCode1
Show:102550
← PrevPage 12 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified