SOTAVerified

Benchmarking

Papers

Showing 576600 of 5548 papers

TitleStatusHype
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantificationCode1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity DatasetCode1
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksCode1
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph GenerationCode1
ROAD-Waymo: Action Awareness at Scale for Autonomous DrivingCode1
MIRFLEX: Music Information Retrieval Feature Library for ExtractionCode1
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language ModelsCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and BenchmarkingCode1
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World ScenariosCode1
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property PredictionCode1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for ElectromyographyCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
Survey of Cultural Awareness in Language Models: Text and BeyondCode1
LLMCBench: Benchmarking Large Language Model Compression for Efficient DeploymentCode1
SPICEPilot: Navigating SPICE Code Generation and Simulation with AI GuidanceCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart ProblemsCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them allCode1
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluationCode1
RClicks: Realistic Click Simulation for Benchmarking Interactive SegmentationCode1
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsCode1
Show:102550
← PrevPage 24 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified