SOTAVerified

Benchmarking

Papers

Showing 701750 of 5548 papers

TitleStatusHype
A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character RecognitionCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy DetectionCode1
Automatic sleep stage classification with deep residual networks in a mixed-cohort settingCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet DatasetsCode1
DCL-Net: Deep Correspondence Learning Network for 6D Pose EstimationCode1
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal ProcessingCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Autonomous Reinforcement Learning: Formalism and BenchmarkingCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
Deep learning model solves change point detection for multiple change typesCode1
Benchmarking Algorithms for Federated Domain GeneralizationCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
Benchmarking Algorithms for Submodular Optimization Problems Using IOHProfilerCode1
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and RethinkingCode1
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language modelsCode1
DependEval: Benchmarking LLMs for Repository Dependency UnderstandingCode1
ALTO: A Large-Scale Dataset for UAV Visual Place Recognition and LocalizationCode1
Descending through a Crowded Valley - Benchmarking Deep Learning OptimizersCode1
Detecting beats in the photoplethysmogram: benchmarking open-source algorithmsCode1
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
Bag of Tricks for Adversarial TrainingCode1
DFGC 2021: A DeepFake Game CompetitionCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in ConversationsCode1
A Ladder of Causal DistancesCode1
Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical CyclonesCode1
Disentangled Feature Representation for Few-shot Image ClassificationCode1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and ObjectsCode1
ATOMMIC: An Advanced Toolbox for Multitask Medical Imaging Consistency to facilitate Artificial Intelligence applications from acquisition to analysis in Magnetic Resonance ImagingCode1
Atom-Level Optical Chemical Structure Recognition with Limited SupervisionCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
Active-Passive SimStereo -- Benchmarking the Cross-Generalization Capabilities of Deep Learning-based Stereo MethodsCode1
Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue SystemCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?Code1
CIDEr: Consensus-based Image Description EvaluationCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
Benchmarking AI scientists in omics data-driven biological researchCode1
Beacon, a lightweight deep reinforcement learning benchmark library for flow controlCode1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning RobustnessCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
A Japanese Dataset for Subjective and Objective Sentiment Polarity Classification in Micro Blog DomainCode1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic WritingCode1
Show:102550
← PrevPage 15 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified