SOTAVerified

Benchmarking

Papers

Showing 14011450 of 5548 papers

TitleStatusHype
Benchmarking Image Retrieval for Visual LocalizationCode1
Quantitative Certification of Bias in Large Language ModelsCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level PlanningCode1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them allCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot SystemsCode1
Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K datasetCode1
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference LettersCode1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for ElectromyographyCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range MultilaterationCode1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergenceCode1
KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and KirundiCode1
Beyond neural scaling laws: beating power law scaling via data pruningCode1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language ModelsCode1
Beyond Normal: On the Evaluation of Mutual Information EstimatorsCode1
Enhancing Biomedical Relation Extraction with DirectionalityCode1
Enhancing Ligand Pose Sampling for Molecular DockingCode1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell DataCode1
Just Rank: Rethinking Evaluation with Word and Sentence SimilaritiesCode1
Autonomous Reinforcement Learning: Formalism and BenchmarkingCode1
Labelling unlabelled videos from scratch with multi-modal self-supervisionCode1
JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill MatchingCode1
ENRICH: Multi-purposE dataset for beNchmaRking In Computer vision and pHotogrammetryCode1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksCode1
Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and BenchmarkingCode1
SHARP: Environment and Person Independent Activity Recognition with Commodity IEEE 802.11 Access PointsCode1
JoinGym: An Efficient Query Optimization Environment for Reinforcement LearningCode1
A Critical Assessment of State-of-the-Art in Entity AlignmentCode1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity DatasetCode1
ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsCode1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language ModelsCode1
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 MinutesCode1
Jojajovai: A Parallel Guarani-Spanish Corpus for MT BenchmarkingCode1
Best practices for constructing, preparing, and evaluating protein-ligand binding affinity benchmarksCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondCode1
Evaluating histopathology transfer learning with ChampKitCode1
Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New BenchmarkingCode1
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language modelsCode1
Evaluating Multimodal Representations on Visual Semantic Textual SimilarityCode1
ISSAFE: Improving Semantic Segmentation in Accidents by Fusing Event-based DataCode1
Rethinking Machine Unlearning in Image Generation ModelsCode1
JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in CrowdsCode1
Benchmark on Drug Target Interaction Modeling from a Structure PerspectiveCode1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning AlgorithmsCode1
Show:102550
← PrevPage 29 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified