SOTAVerified

Benchmarking

Papers

Showing 326350 of 5548 papers

TitleStatusHype
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality ControlCode0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques0
Experimental robustness benchmark of quantum neural network on a superconducting quantum processor0
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs0
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language ModelsCode3
SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation0
NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction0
Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets0
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation0
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction ModelsCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals0
UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and BenchmarkingCode0
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning0
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems0
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMsCode0
Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI0
Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models0
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models0
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response TheoryCode0
Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and BenchmarkingCode0
A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents0
Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks0
DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis0
Show:102550
← PrevPage 14 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified