SOTAVerified

Benchmarking

Papers

Showing 20762100 of 5548 papers

TitleStatusHype
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video GenerationCode1
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents0
Examining Post-Training Quantization for Mixture-of-Experts: A BenchmarkCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives0
How well it works: Benchmarking performance of GPT models on medical natural language processing tasks0
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition0
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection0
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing0
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images0
RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly DetectionCode1
Benchmarking Vision-Language Contrastive Methods for Medical Representation LearningCode0
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models0
AudioMarkBench: Benchmarking Robustness of Audio WatermarkingCode1
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language ModelsCode0
Data-driven Power Flow Linearization: Simulation0
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model ArchitectureCode0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionCode0
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery AgentsCode3
Can Language Models Serve as Text-Based World Simulators?0
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking0
TopoBench: A Framework for Benchmarking Topological Deep LearningCode3
Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular dockingCode1
QGEval: Benchmarking Multi-dimensional Evaluation for Question GenerationCode1
ICU-Sepsis: A Benchmark MDP Built from Real Medical DataCode1
Show:102550
← PrevPage 84 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified