SOTAVerified

Benchmarking

Papers

Showing 17511800 of 5548 papers

TitleStatusHype
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs0
Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance0
DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes0
BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research0
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text0
Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDECode0
Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality ControlCode0
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models0
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques0
Experimental robustness benchmark of quantum neural network on a superconducting quantum processor0
Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI0
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning0
Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets0
UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and BenchmarkingCode0
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMsCode0
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response TheoryCode0
SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation0
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation0
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models0
Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and BenchmarkingCode0
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction ModelsCode0
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals0
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems0
A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents0
Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks0
NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction0
NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation0
ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations0
Benchmarking data encoding methods in Quantum Machine Learning0
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis0
Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach0
TransBench: Benchmarking Machine Translation for Industrial-Scale Applications0
A Data-Driven Method to Identify IBRs with Dominant Participation in Sub-Synchronous Oscillations0
SlangDIT: Benchmarking LLMs in Interpretative Slang Translation0
LLM-based Evaluation Policy Extraction for Ecological Modeling0
NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI0
SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival AnalysisCode0
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas0
Benchmarking and Confidence Evaluation of LALMs For Temporal ReasoningCode0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings0
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference0
SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference0
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning0
A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design0
Show:102550
← PrevPage 36 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified