SOTAVerified

Benchmarking

Papers

Showing 476500 of 5548 papers

TitleStatusHype
False Promises in Medical Imaging AI? Assessing Validity of Outperformance ClaimsCode0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?Code0
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
RGB-Event Fusion with Self-Attention for Collision PredictionCode1
Advancing and Benchmarking Personalized Tool Invocation for LLMsCode0
Benchmarking LLMs' Swarm intelligenceCode1
Alpha Excel Benchmark0
Call for Action: towards the next generation of symbolic regression benchmark0
Multimodal Benchmarking and Recommendation of Text-to-Image Generation ModelsCode0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding ApproachCode0
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking0
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealitiesCode0
Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning0
NbBench: Benchmarking Language Models for Comprehensive Nanobody TasksCode0
Meta-Black-Box-Optimization through Offline Q-function LearningCode0
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive SegmentationCode0
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time VideoCode1
Representation Learning of Limit Order Book: A Comprehensive Study and BenchmarkingCode0
Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing0
CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture0
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking0
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach0
Show:102550
← PrevPage 20 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified