SOTAVerified

Benchmarking

Papers

Showing 251300 of 5548 papers

TitleStatusHype
How far are today's time-series models from real-world weather forecasting applications?Code2
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?Code2
A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentationsCode2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AICode2
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language ModelsCode2
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language ModelsCode2
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMsCode2
BTS: Building Timeseries Dataset: Empowering Large-Scale Building AnalyticsCode2
StreamBench: Towards Benchmarking Continuous Improvement of Language AgentsCode2
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese MedicineCode2
Benchmarking and Improving Detail Image CaptionCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language ModelsCode2
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep LearningCode2
MTVQA: Benchmarking Multilingual Text-Centric Visual Question AnsweringCode2
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language ModelsCode2
OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMsCode2
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image RetrievalCode2
Benchmarking Representations for Speech, Music, and Acoustic EventsCode2
HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and BeyondCode2
SIDBench: A Python Framework for Reliably Assessing Synthetic Image Detection MethodsCode2
Benchmarking Benchmark Leakage in Large Language ModelsCode2
LongEmbed: Extending Embedding Models for Long Context RetrievalCode2
VBR: A Vision Benchmark in RomeCode2
Revealing data leakage in protein interaction benchmarksCode2
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model PerformanceCode2
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and BenchmarkingCode2
Are large language models superhuman chemists?Code2
VL-ICL Bench: The Devil in the Details of Multimodal In-Context LearningCode2
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model AgentsCode2
REAL-Colon: A dataset for developing real-world AI applications in colonoscopyCode2
SciAssess: Benchmarking LLM Proficiency in Scientific Literature AnalysisCode2
Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized TasksCode2
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
Event-Based Motion MagnificationCode2
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A BenchmarkCode2
PEDANTS: Cheap but Effective and Interpretable Answer EquivalenceCode2
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction SimulatorCode2
MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language ModelsCode2
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied AgentsCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
AIR-Bench: Benchmarking Large Audio-Language Models via Generative ComprehensionCode2
InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph PriorCode2
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256KCode2
LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and CosmologyCode2
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
Show:102550
← PrevPage 6 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified