SOTAVerified

Benchmarking

Papers

Showing 776800 of 5548 papers

TitleStatusHype
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K TokensCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsCode1
Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide ImagesCode1
Leveraging Foundation Models for Content-Based Medical Image Retrieval in RadiologyCode1
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource LanguagesCode1
Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New BenchmarkCode1
Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis AgentsCode1
Benchmarking Micro-action Recognition: Dataset, Methods, and ApplicationsCode1
R^2-Bench: Benchmarking the Robustness of Referring Perception Models under PerturbationsCode1
Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in RecommendationCode1
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMsCode1
Efficient Lifelong Model Evaluation in an Era of Rapid ProgressCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Beacon, a lightweight deep reinforcement learning benchmark library for flow controlCode1
Benchmarking Data Science AgentsCode1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
PST-Bench: Tracing and Benchmarking the Source of PublicationsCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation LearningCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
Show:102550
← PrevPage 32 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified