SOTAVerified

Benchmarking

Papers

Showing 17511775 of 5548 papers

TitleStatusHype
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models0
Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing0
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs0
Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDECode0
Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance0
Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms0
BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research0
Experimental robustness benchmark of quantum neural network on a superconducting quantum processor0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques0
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text0
Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality ControlCode0
Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI0
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response TheoryCode0
Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets0
Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models0
SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation0
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction ModelsCode0
Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and BenchmarkingCode0
Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks0
NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction0
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems0
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning0
UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and BenchmarkingCode0
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals0
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models0
Show:102550
← PrevPage 71 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified