SOTAVerified

Benchmarking

Papers

Showing 2650 of 5548 papers

TitleStatusHype
STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and BenchmarkingCode0
LANTERN: A Machine Learning Framework for Lipid Nanoparticle Transfection Efficiency PredictionCode0
Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited DataCode1
CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks0
TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation0
State and Memory is All You Need for Robust and Reliable AI Agents0
Point Cloud Compression and Objective Quality Assessment: A Survey0
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge0
mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at ScaleCode0
FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation0
Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset EvaluationCode0
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection0
MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans0
FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization0
AI-Driven MRI-based Brain Tumour Segmentation Benchmarking0
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequencesCode0
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot InteractionCode0
Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision0
Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time SeriesCode0
A Survey of Predictive Maintenance Methods: An Analysis of Prognostics via Classification and Regression0
BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos0
WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI WorkloadsCode1
Quantitative Benchmarking of Anomaly Detection Methods in Digital Pathology0
MDR-DeePC: Model-Inspired Distributionally Robust Data-Enabled Predictive Control0
Show:102550
← PrevPage 2 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified