SOTAVerified

Benchmarking

Papers

Showing 20512100 of 5548 papers

TitleStatusHype
TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous GraphsCode3
Beyond Slow Signs in High-fidelity Model ExtractionCode0
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal DataCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
On the Evaluation of Speech Foundation Models for Spoken Language Understanding0
CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking0
ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents0
Decoding the Diversity: A Review of the Indic AI Research Landscape0
ECBD: Evidence-Centered Benchmark Design for NLPCode0
BTS: Building Timeseries Dataset: Empowering Large-Scale Building AnalyticsCode2
DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning BenchmarksCode3
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language ModelsCode1
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition0
A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions0
StreamBench: Towards Benchmarking Continuous Improvement of Language AgentsCode2
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living0
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMsCode2
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-ResolutionCode1
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets0
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective TasksCode3
Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial ObservationsCode0
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video GenerationCode1
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents0
Examining Post-Training Quantization for Mixture-of-Experts: A BenchmarkCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives0
How well it works: Benchmarking performance of GPT models on medical natural language processing tasks0
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition0
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection0
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing0
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images0
RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly DetectionCode1
Benchmarking Vision-Language Contrastive Methods for Medical Representation LearningCode0
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models0
AudioMarkBench: Benchmarking Robustness of Audio WatermarkingCode1
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language ModelsCode0
Data-driven Power Flow Linearization: Simulation0
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model ArchitectureCode0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionCode0
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery AgentsCode3
Can Language Models Serve as Text-Based World Simulators?0
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking0
TopoBench: A Framework for Benchmarking Topological Deep LearningCode3
Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular dockingCode1
QGEval: Benchmarking Multi-dimensional Evaluation for Question GenerationCode1
ICU-Sepsis: A Benchmark MDP Built from Real Medical DataCode1
Show:102550
← PrevPage 42 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified