SOTAVerified

Benchmarking

Papers

Showing 801810 of 5548 papers

TitleStatusHype
Understanding the Limits of Lifelong Knowledge Editing in LLMs0
FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User DataCode1
Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders0
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol0
FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance0
Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms0
Benchmarking Reasoning Robustness in Large Language Models0
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets0
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model CompressionCode0
CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained ModelsCode0
Show:102550
← PrevPage 81 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified