SOTAVerified

Benchmarking

Papers

Showing 31763200 of 5548 papers

TitleStatusHype
Fluorescent Neuronal Cells v2: Multi-Task, Multi-Format Annotations for Deep Learning in Microscopy0
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks0
ForamViT-GAN: Exploring New Paradigms in Deep Learning for Micropaleontological Image Analysis0
Forecasting Lithium-Ion Battery Longevity with Limited Data Availability: Benchmarking Different Machine Learning Algorithms0
Forecasting NIFTY 50 benchmark Index using Seasonal ARIMA time series models0
FOR-instance: a UAV laser scanning benchmark dataset for semantic and instance segmentation of individual trees0
FORLAPS: An Innovative Data-Driven Reinforcement Learning Approach for Prescriptive Process Monitoring0
Formal Covariate Benchmarking to Bound Omitted Variable Bias0
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents0
Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization0
Foundations for learning from noisy quantum experiments0
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate0
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting0
Framework and Benchmarks for Combinatorial and Mixed-variable Bayesian Optimization0
FRED: The Florence RGB-Event Drone Dataset0
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification0
From 2D to 3D: Re-thinking Benchmarking of Monocular Depth Prediction0
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems0
From Code to Play: Benchmarking Program Search for Games Using Large Language Models0
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation0
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents0
Show:102550
← PrevPage 128 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified