SOTAVerified

Benchmarking

Papers

Showing 201225 of 5548 papers

TitleStatusHype
Progressive Class-level Distillation0
GenSpace: Benchmarking Spatially-Aware Image Generation0
Segmenting France Across Four CenturiesCode0
ByzFL: Research Framework for Robust Federated LearningCode1
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM AgentsCode2
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization0
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models0
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image GenerationCode1
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMsCode0
Benchmarking Foundation Models for Zero-Shot Biometric Tasks0
Geospatial Foundation Models to Enable Progress on Sustainable Development Goals0
Bench4KE: Benchmarking Automated Competency Question GenerationCode1
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation0
Automated Structured Radiology Report Generation0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge0
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns0
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation0
Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems0
Toward Memory-Aided World Models: Benchmarking via Spatial ConsistencyCode1
VERINA: Benchmarking Verifiable Code GenerationCode2
LLM Performance for Code Generation on Noisy TasksCode0
Benchmarking Abstract and Reasoning Abilities Through A Theoretical PerspectiveCode0
Show:102550
← PrevPage 9 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified