SOTAVerified

Benchmarking

Papers

Showing 281290 of 5548 papers

TitleStatusHype
VL-ICL Bench: The Devil in the Details of Multimodal In-Context LearningCode2
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model AgentsCode2
SciAssess: Benchmarking LLM Proficiency in Scientific Literature AnalysisCode2
REAL-Colon: A dataset for developing real-world AI applications in colonoscopyCode2
Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized TasksCode2
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
Event-Based Motion MagnificationCode2
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A BenchmarkCode2
Show:102550
← PrevPage 29 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified