SOTAVerified

Benchmarking

Papers

Showing 28012850 of 5548 papers

TitleStatusHype
Benchmarking and Validation of Sub-mW 30GHz VG-LNAs in 22nm FDSOI CMOS for 5G/6G Phased-Array Receivers0
Mahalanobis k-NN: A Statistical Lens for Robust Point-Cloud RegistrationsCode0
VoiceWukong: Benchmarking Deepfake Voice Detection0
Benchmarking Sub-Genre Classification For Mainstage Dance Music0
Ransomware Detection Using Machine Learning in the Linux Kernel0
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context UnderstandingCode0
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs0
Selecting Differential Splicing Methods: Practical Considerations0
Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E50
RBoard: A Unified Platform for Reproducible and Reusable Recommender System Benchmarks0
NeIn: Telling What You Don't Want0
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Quantum Kernel Methods under Scrutiny: A Benchmarking Study0
Absolute Ranking: An Essential Normalization for Benchmarking Optimization Algorithms0
Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm0
Question-Answering Dense Video EventsCode0
Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression0
LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like PostsCode0
InfraLib: Enabling Reinforcement Learning and Decision-Making for Large-Scale Infrastructure Management0
Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift0
Benchmarking Spurious Bias in Few-Shot Image ClassifiersCode0
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation0
NUMOSIM: A Synthetic Mobility Dataset with Anomaly Detection Benchmarks0
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision0
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical StudyCode0
Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture0
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents0
Revisiting Safe Exploration in Safe Reinforcement learning0
Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification0
A practical generalization metric for deep networks benchmarking0
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages0
Accelerating the discovery of steady-states of planetary interior dynamics with machine learning0
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckListsCode0
Understanding the User: An Intent-Based Ranking Dataset0
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction0
Illuminating the Diversity-Fitness Trade-Off in Black-Box OptimizationCode0
Benchmarking foundation models as feature extractors for weakly-supervised computational pathology0
Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games0
VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily ActivitiesCode0
Applications in CityLearn Gym Environment for Multi-Objective Control Benchmarking in Grid-Interactive Buildings and Districts0
Cross-subject Brain Functional Connectivity Analysis for Multi-task Cognitive State Evaluation0
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis0
Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper0
BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization0
FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text SpottingCode0
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences0
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study0
Comparative Analysis: Violence Recognition from Videos using Transfer LearningCode0
DHP Benchmark: Are LLMs Good NLG Evaluators?0
Show:102550
← PrevPage 57 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified