SOTAVerified

Benchmarking

Papers

Showing 20262050 of 5548 papers

TitleStatusHype
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers0
Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions0
Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms0
Benchmarking Federated Machine Unlearning methods for Tabular Data0
TDBench: Benchmarking Vision-Language Models in Understanding Top-Down ImagesCode0
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models0
LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactionsCode0
Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-BenchCode0
Scaling Up Resonate-and-Fire Networks for Fast Deep LearningCode0
On Benchmarking Code LLMs for Android Malware Analysis0
Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models0
Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers0
Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios0
Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models0
Simple Feedfoward Neural Networks are Almost All You Need for Time Series Forecasting0
RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations0
Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous DomainsCode0
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis0
MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation0
Generalization Bias in Large Language Model Summarization of Scientific Research0
An Advanced Ensemble Deep Learning Framework for Stock Price Prediction Using VAE, Transformer, and LSTM Model0
LIM: Large Interpolator Model for Dynamic Reconstruction0
Benchmarking Ultra-Low-Power μNPUs0
Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery0
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug ErrorsCode0
Show:102550
← PrevPage 82 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified