SOTAVerified

Benchmarking

Papers

Showing 30763100 of 5548 papers

TitleStatusHype
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing0
Benchmarking Vision-Language Contrastive Methods for Medical Representation LearningCode0
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition0
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images0
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionCode0
Can Language Models Serve as Text-Based World Simulators?0
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking0
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model ArchitectureCode0
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language ModelsCode0
Data-driven Power Flow Linearization: Simulation0
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications0
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation0
GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models0
Deep Jansen-Rit Parameter Inference for Model-Driven Analysis of Brain ActivityCode0
Scenarios and Approaches for Situated Natural Language Explanations0
Behavior Structformer: Learning Players Representations with Structured Tokenization0
VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metricCode0
Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation0
Better Late Than Never: Formulating and Benchmarking Recommendation EditingCode0
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation0
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As0
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning0
BEADs: Bias Evaluation Across Domains0
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices0
Show:102550
← PrevPage 124 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified