SOTAVerified

Benchmarking

Papers

Showing 23512400 of 5548 papers

TitleStatusHype
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal CorruptionsCode1
IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian ContextCode0
TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting MethodsCode5
Are Large Language Models Good at Utility Judgments?Code0
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of TransformersCode1
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic ObjectCode1
Benchmarking Object Detectors with COCO: A New Path ForwardCode1
Towards Image Ambient Lighting NormalizationCode1
Benchmarking Image Transformers for Prostate Cancer Detection from Ultrasound Data0
GPTs and Language Barrier: A Cross-Lingual Legal QA Examination0
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
Benchmarking Video Frame Interpolation0
DISL: Fueling Research with A Large Dataset of Solidity Smart Contracts0
NSINA: A News Corpus for SinhalaCode0
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmarkCode1
On the Fragility of Active Learners for Text ClassificationCode0
TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based ScoringCode0
Unifying Large Language Model and Deep Reinforcement Learning for Human-in-Loop Interactive Socially-aware Navigation0
Transactive Local Energy Markets Enable Community-Level Resource Coordination Using Individual Rewards0
Broadening the Scope of Neural Network Potentials through Direct Inclusion of Additional Molecular Attributes0
Subjective Quality Assessment of Compressed Tone-Mapped High Dynamic Range Videos0
Can 3D Vision-Language Models Truly Understand Natural Language?Code1
RoDLA: Benchmarking the Robustness of Document Layout Analysis ModelsCode1
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
ChatGPT Alternative Solutions: Large Language Models Survey0
DomainLab: A modular Python package for domain generalization in deep learningCode1
Practical End-to-End Optical Music Recognition for Pianoform MusicCode1
MARTA: a model for the automatic phonemic grouping of the parkinsonian speechCode0
VL-ICL Bench: The Devil in the Details of Multimodal In-Context LearningCode2
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly DetectionCode3
MELTing point: Mobile Evaluation of Language TransformersCode1
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain FrameworkCode3
ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsCode1
Embarrassingly Simple Scribble Supervision for 3D Medical Segmentation0
Benchmarking Badminton Action Recognition with a New Fine-Grained Dataset0
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety0
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K TokensCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
Leveraging Spatial and Semantic Feature Extraction for Skin Cancer Diagnosis with Capsule Networks and Graph Neural Networks0
Benchmarking the Robustness of UAV Tracking Against Common CorruptionsCode0
A Sober Look at the Robustness of CLIPs to Spurious Features0
FlowMind: Automatic Workflow Generation with LLMs0
Granular Change Accuracy: A More Accurate Performance Metric for Dialogue State Tracking0
Depression Detection on Social Media with Large Language Models0
An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsCode1
Benchmarking Adversarial Robustness of Image Shadow Removal with Shadow-adaptive Attacks0
Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide ImagesCode1
Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot StudyCode0
Show:102550
← PrevPage 48 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified