SOTAVerified

Benchmarking

Papers

Showing 651700 of 5548 papers

TitleStatusHype
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers0
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks0
Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions0
BlenderGym: Benchmarking Foundational Model Systems for Graphics EditingCode1
FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking0
Horizon Scans can be accelerated using novel information retrieval and artificial intelligence tools0
Accelerating IoV Intrusion Detection: Benchmarking GPU-Accelerated vs CPU-Based ML Libraries0
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation FrameworkCode2
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models0
TDBench: Benchmarking Vision-Language Models in Understanding Top-Down ImagesCode0
Scaling Up Resonate-and-Fire Networks for Fast Deep LearningCode0
Benchmarking Federated Machine Unlearning methods for Tabular Data0
Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models0
Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-BenchCode0
LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactionsCode0
On Benchmarking Code LLMs for Android Malware Analysis0
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research PapersCode1
Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers0
Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios0
Simple Feedfoward Neural Networks are Almost All You Need for Time Series Forecasting0
Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models0
MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation0
Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous DomainsCode0
RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations0
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis0
SimBank: from Simulation to Solution in Prescriptive Process Monitoring0
Generalization Bias in Large Language Model Summarization of Scientific Research0
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric VideosCode1
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug ErrorsCode0
Benchmarking Ultra-Low-Power μNPUs0
An Advanced Ensemble Deep Learning Framework for Stock Price Prediction Using VAE, Transformer, and LSTM Model0
LIM: Large Interpolator Model for Dynamic Reconstruction0
Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery0
Benchmarking Deep Learning-Based Methods for Irradiance Nowcasting with Sky Images0
CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?Code0
Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance0
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition0
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics0
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
CSPO: Cross-Market Synergistic Stock Price Movement Forecasting with Pseudo-volatility Optimization0
Can geometric combinatorics improve RNA branching predictions?Code0
RxRx3-core: Benchmarking drug-target interactions in High-Content Microscopy0
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIsCode3
Benchmarking and optimizing organism wide single-cell RNA alignment methodsCode0
TerraTorch: The Geospatial Foundation Models ToolkitCode4
Benchmarking Machine Learning Methods for Distributed Acoustic Sensing0
Reservoir Computing with a Single Oscillating Gas Bubble: Emphasizing the Chaotic Regime0
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy0
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
Show:102550
← PrevPage 14 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified