SOTAVerified

Benchmarking

Papers

Showing 13011350 of 5548 papers

TitleStatusHype
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
Exploiting News Article Structure for Automatic Corpus Generation of Entailment DatasetsCode1
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantificationCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
Benchmarking Robustness of Text-Image Composed RetrievalCode1
Benchmarking Robustness to Adversarial Image ObfuscationsCode1
Benchmarking the Generation of Fact Checking ExplanationsCode1
A framework for benchmarking class-out-of-distribution detection and its application to ImageNetCode1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
Is Multi-Hop Reasoning Really Explainable? Towards Benchmarking Reasoning InterpretabilityCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
BiBench: Benchmarking and Analyzing Network BinarizationCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
A Ladder of Causal DistancesCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and RethinkingCode1
Benchmarking Self-Supervised Learning on Diverse Pathology DatasetsCode1
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference LettersCode1
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range MultilaterationCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
KO codes: Inventing Nonlinear Encoding and Decoding for Reliable Wireless Communication via Deep-learningCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Benchmarking Simulation-Based InferenceCode1
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient LearningCode1
Labelling unlabelled videos from scratch with multi-modal self-supervisionCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character RecognitionCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
Benchmarking Spatial Relationships in Text-to-Image GenerationCode1
Quantum machine learning of large datasets using randomized measurementsCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
AudioMarkBench: Benchmarking Robustness of Audio WatermarkingCode1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRACode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
LEAF: A Benchmark for Federated SettingsCode1
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality mattersCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway ReasoningCode1
LEMUR Neural Network Dataset: Towards Seamless AutoMLCode1
Less Is More: A Comparison of Active Learning Strategies for 3D Medical Image SegmentationCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Show:102550
← PrevPage 27 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified