SOTAVerified

Benchmarking

Papers

Showing 751800 of 5548 papers

TitleStatusHype
Towards Sim-to-Real Industrial Parts Classification with Synthetic DatasetCode1
Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation ModelCode1
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal ModelCode1
Outlier-Efficient Hopfield Layers for Large Transformer-Based ModelsCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Atom-Level Optical Chemical Structure Recognition with Limited SupervisionCode1
PREGO: online mistake detection in PRocedural EGOcentric videosCode1
Benchmarking Counterfactual Image GenerationCode1
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal CorruptionsCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic ObjectCode1
RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of TransformersCode1
Towards Image Ambient Lighting NormalizationCode1
Benchmarking Object Detectors with COCO: A New Path ForwardCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmarkCode1
RoDLA: Benchmarking the Robustness of Document Layout Analysis ModelsCode1
DomainLab: A modular Python package for domain generalization in deep learningCode1
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
Can 3D Vision-Language Models Truly Understand Natural Language?Code1
Practical End-to-End Optical Music Recognition for Pianoform MusicCode1
ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsCode1
MELTing point: Mobile Evaluation of Language TransformersCode1
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K TokensCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsCode1
Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide ImagesCode1
Leveraging Foundation Models for Content-Based Medical Image Retrieval in RadiologyCode1
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource LanguagesCode1
Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New BenchmarkCode1
Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis AgentsCode1
Benchmarking Micro-action Recognition: Dataset, Methods, and ApplicationsCode1
R^2-Bench: Benchmarking the Robustness of Referring Perception Models under PerturbationsCode1
Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in RecommendationCode1
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMsCode1
Efficient Lifelong Model Evaluation in an Era of Rapid ProgressCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Beacon, a lightweight deep reinforcement learning benchmark library for flow controlCode1
Benchmarking Data Science AgentsCode1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
PST-Bench: Tracing and Benchmarking the Source of PublicationsCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation LearningCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
Show:102550
← PrevPage 16 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified