SOTAVerified

Benchmarking

Papers

Showing 20262050 of 5548 papers

TitleStatusHype
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language ModelsCode2
The Liouville Generator for Producing Integrable Expressions0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference ContentCode0
Standardizing Structural Causal ModelsCode0
Benchmarking of LLM Detection: Comparing Two Competing Approaches0
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language ModelsCode1
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language ModelsCode0
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
Evaluating the Performance of Large Language Models via Debates0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language ModelsCode2
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences0
GANmut: Generating and Modifying Facial Expressions0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation ModelsCode0
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results0
A GPU-accelerated Large-scale Simulator for Transportation System Optimization BenchmarkingCode1
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework0
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic GradingCode0
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate DisclosuresCode0
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMsCode1
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming0
Show:102550
← PrevPage 82 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified