SOTAVerified

Benchmarking

Papers

Showing 20512075 of 5548 papers

TitleStatusHype
On the Evaluation of Speech Foundation Models for Spoken Language Understanding0
Beyond Slow Signs in High-fidelity Model ExtractionCode0
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal DataCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous GraphsCode3
CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking0
ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents0
Decoding the Diversity: A Review of the Indic AI Research Landscape0
DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning BenchmarksCode3
BTS: Building Timeseries Dataset: Empowering Large-Scale Building AnalyticsCode2
ECBD: Evidence-Centered Benchmark Design for NLPCode0
StreamBench: Towards Benchmarking Continuous Improvement of Language AgentsCode2
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition0
A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions0
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language ModelsCode1
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living0
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMsCode2
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-ResolutionCode1
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets0
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective TasksCode3
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video GenerationCode1
Show:102550
← PrevPage 83 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified