SOTAVerified

Benchmarking

Papers

Showing 20512100 of 5548 papers

TitleStatusHype
LIM: Large Interpolator Model for Dynamic Reconstruction0
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition0
Benchmarking Deep Learning-Based Methods for Irradiance Nowcasting with Sky Images0
CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?Code0
Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance0
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics0
CSPO: Cross-Market Synergistic Stock Price Movement Forecasting with Pseudo-volatility Optimization0
RxRx3-core: Benchmarking drug-target interactions in High-Content Microscopy0
Benchmarking and optimizing organism wide single-cell RNA alignment methodsCode0
Can geometric combinatorics improve RNA branching predictions?Code0
Benchmarking Machine Learning Methods for Distributed Acoustic Sensing0
Reservoir Computing with a Single Oscillating Gas Bubble: Emphasizing the Chaotic Regime0
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy0
Writing as a testbed for open ended agents0
LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming LanguagesCode0
Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis0
EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation0
Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages0
Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition0
Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch SchedulingCode0
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering0
Regularization of ML models for Earth systems by using longer model timesteps0
A Study on Neuro-Symbolic Artificial Intelligence: Healthcare Perspectives0
Accurate Peak Detection in Multimodal Optimization via Approximated Landscape LearningCode0
CardioTabNet: A Novel Hybrid Transformer Model for Heart Disease Prediction using Tabular Medical Data0
Benchmark Dataset for Pore-Scale CO2-Water Interaction0
IceBench: A Benchmark for Deep Learning based Sea Ice Type ClassificationCode0
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object UnderstandingCode0
CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series0
ContextGNN goes to Elliot: Towards Benchmarking Relational Deep Learning for Static Link Prediction (aka Personalized Item Recommendation)Code0
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph0
A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?0
Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI0
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs0
ImputeGAP: A Comprehensive Library for Time Series Imputation0
Kolmogorov-Arnold Network for Transistor Compact Modeling0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes0
Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks0
Language-based Image Colorization: A Benchmark and BeyondCode0
Benchmarking Large Language Models for Handwritten Text Recognition0
Benchmarking Failures in Tool-Augmented Language ModelsCode0
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language ModelsCode0
COPA: Comparing the Incomparable to Explore the Pareto Front0
ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models0
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysisCode0
Stable Virtual Camera: Generative View Synthesis with Diffusion Models0
HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard0
Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering0
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack0
Show:102550
← PrevPage 42 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified