SOTAVerified

Benchmarking

Papers

Showing 23762400 of 5548 papers

TitleStatusHype
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
ChatGPT Alternative Solutions: Large Language Models Survey0
DomainLab: A modular Python package for domain generalization in deep learningCode1
Practical End-to-End Optical Music Recognition for Pianoform MusicCode1
MARTA: a model for the automatic phonemic grouping of the parkinsonian speechCode0
VL-ICL Bench: The Devil in the Details of Multimodal In-Context LearningCode2
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly DetectionCode3
MELTing point: Mobile Evaluation of Language TransformersCode1
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain FrameworkCode3
ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsCode1
Embarrassingly Simple Scribble Supervision for 3D Medical Segmentation0
Benchmarking Badminton Action Recognition with a New Fine-Grained Dataset0
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety0
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K TokensCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
Leveraging Spatial and Semantic Feature Extraction for Skin Cancer Diagnosis with Capsule Networks and Graph Neural Networks0
Benchmarking the Robustness of UAV Tracking Against Common CorruptionsCode0
A Sober Look at the Robustness of CLIPs to Spurious Features0
FlowMind: Automatic Workflow Generation with LLMs0
Granular Change Accuracy: A More Accurate Performance Metric for Dialogue State Tracking0
Depression Detection on Social Media with Large Language Models0
An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsCode1
Benchmarking Adversarial Robustness of Image Shadow Removal with Shadow-adaptive Attacks0
Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide ImagesCode1
Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot StudyCode0
Show:102550
← PrevPage 96 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified