SOTAVerified

Benchmarking

Papers

Showing 16511700 of 5548 papers

TitleStatusHype
Benchmarking VLMs' Reasoning About Persuasive Atypical Images0
Benchmarking Large Language Model Uncertainty for Prompt OptimizationCode0
Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data0
Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering0
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study0
Text-To-Speech Synthesis In The Wild0
ODAQ: Open Dataset of Audio Quality - Benchmark on GitHubCode1
Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning0
Linear energy storage and flexibility model with ramp rate, ramping, deadline and capacity constraintsCode0
Efficient Sparse Coding with the Adaptive Locally Competitive Algorithm for Speech Classification0
The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal0
The JPEG Pleno Learning-based Point Cloud Coding Standard: Serving Man and Machine0
Improve Machine Learning carbon footprint using Nvidia GPU and Mixed Precision training for classification models -- Part ICode0
Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG0
Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots0
Benchmarking and Validation of Sub-mW 30GHz VG-LNAs in 22nm FDSOI CMOS for 5G/6G Phased-Array Receivers0
Understanding Foundation Models: Are We Back in 1924?0
Unsupervised Novelty Detection Methods Benchmarking with Wavelet DecompositionCode0
Benchmarking 2D Egocentric Hand Pose Datasets0
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context UnderstandingCode0
Ransomware Detection Using Machine Learning in the Linux Kernel0
Benchmarking Sub-Genre Classification For Mainstage Dance Music0
Mahalanobis k-NN: A Statistical Lens for Robust Point-Cloud RegistrationsCode0
VoiceWukong: Benchmarking Deepfake Voice Detection0
Selecting Differential Splicing Methods: Practical Considerations0
RBoard: A Unified Platform for Reproducible and Reusable Recommender System Benchmarks0
NeIn: Telling What You Don't Want0
Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E50
Assessing SPARQL capabilities of Large Language ModelsCode2
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection0
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Insights from Benchmarking Frontier Language Models on Web App Code GenerationCode1
Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm0
Absolute Ranking: An Essential Normalization for Benchmarking Optimization Algorithms0
PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease SegmentationCode2
Quantum Kernel Methods under Scrutiny: A Benchmarking Study0
Question-Answering Dense Video EventsCode0
Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression0
Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift0
LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like PostsCode0
InfraLib: Enabling Reinforcement Learning and Decision-Making for Large-Scale Infrastructure Management0
RTLRewriter: Methodologies for Large Models aided RTL Code OptimizationCode1
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation0
NUMOSIM: A Synthetic Mobility Dataset with Anomaly Detection Benchmarks0
Benchmarking Spurious Bias in Few-Shot Image ClassifiersCode0
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical StudyCode0
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMsCode1
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision0
Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture0
Show:102550
← PrevPage 34 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified