SOTAVerified

Benchmarking

Papers

Showing 851900 of 5548 papers

TitleStatusHype
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors0
Medical Hallucinations in Foundation Models and Their Impact on HealthcareCode2
Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv100
Agentic Mixture-of-Workflows for Multi-Modal Chemical Search0
Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review0
Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking StudyCode1
Modelling Regional Solar Photovoltaic Capacity in Great Britain0
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and InvestigationCode1
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering0
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life PredictionCode3
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval0
Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMsCode1
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers0
CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs0
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation0
Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement LearningCode0
A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization0
Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking0
Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement0
SynthRAD2025 Grand Challenge dataset: generating synthetic CTs for radiotherapy0
Benchmarking Temporal Reasoning and Alignment Across Chinese DynastiesCode0
MULTITAT: Benchmarking Multilingual Table-and-Text Question AnsweringCode0
Benchmarking Retrieval-Augmented Generation in Multi-Modal ContextsCode2
On Neural Inertial Classification Networks for Pedestrian Activity Recognition0
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic EvaluationCode4
Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications0
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs0
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway ReasoningCode1
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language ModelsCode0
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data ScienceCode0
Unmasking Societal Biases in Respiratory Support for ICU Patients through Social Determinants of HealthCode0
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive QueriesCode0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
Methods and Trends in Detecting Generated Images: A Comprehensive Review0
Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation0
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained modelsCode0
Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis0
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton OperatorsCode2
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk0
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide0
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation FrameworkCode0
Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks0
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models0
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems0
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
PredictaBoard: Benchmarking LLM Score PredictabilityCode0
Show:102550
← PrevPage 18 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified