SOTAVerified

Benchmarking

Papers

Showing 801825 of 5548 papers

TitleStatusHype
FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User DataCode1
Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature PerturbationCode0
Understanding the Limits of Lifelong Knowledge Editing in LLMs0
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol0
FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance0
Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms0
Benchmarking Reasoning Robustness in Large Language Models0
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets0
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model CompressionCode0
CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained ModelsCode0
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical CasesCode0
ThrowBench: Benchmarking LLMs by Predicting Runtime ExceptionsCode0
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination0
InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
Eventprop training for efficient neuromorphic applications0
Towards Universal Learning-based Model for Cardiac Image Reconstruction: Summary of the CMRxRecon2024 Challenge0
UnPuzzle: A Unified Framework for Pathology Image AnalysisCode1
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber AttacksCode0
Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum SystemsCode0
Technical report of a DMD-based Characterization Method for Vision Sensors0
Optimizing open-domain question answering with graph-based retrieval augmented generation0
A2Perf: Real-World Autonomous Agents Benchmark0
Evaluation of Architectural Synthesis Using Generative AI0
Show:102550
← PrevPage 33 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified