SOTAVerified

Benchmarking

Papers

Showing 301350 of 5548 papers

TitleStatusHype
Is Single-View Mesh Reconstruction Ready for Robotics?0
Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questionsCode1
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsCode0
Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts0
SEvoBench : A C++ Framework For Evolutionary Single-Objective Optimization Benchmarking0
Semantic Correspondence: Unified Benchmarking and a Strong BaselineCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
Wildfire spread forecasting with Deep LearningCode0
DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes0
Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDECode0
Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms0
Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS20
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text0
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models0
CUB: Benchmarking Context Utilisation Techniques for Language Models0
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language ModelsCode3
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models0
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks0
Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing0
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance0
BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research0
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational ScenariosCode1
REOBench: Benchmarking Robustness of Earth Observation Foundation ModelsCode1
MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries0
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality ControlCode0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques0
Experimental robustness benchmark of quantum neural network on a superconducting quantum processor0
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs0
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language ModelsCode3
SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation0
NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction0
Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets0
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation0
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction ModelsCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals0
UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and BenchmarkingCode0
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning0
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems0
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMsCode0
Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI0
Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models0
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models0
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response TheoryCode0
Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and BenchmarkingCode0
A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents0
Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks0
DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis0
Show:102550
← PrevPage 7 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified