SOTAVerified

Benchmarking

Papers

Showing 701750 of 5548 papers

TitleStatusHype
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-ResolutionCode1
Examining Post-Training Quantization for Mixture-of-Experts: A BenchmarkCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video GenerationCode1
RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly DetectionCode1
AudioMarkBench: Benchmarking Robustness of Audio WatermarkingCode1
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language ModelsCode1
Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular dockingCode1
ICU-Sepsis: A Benchmark MDP Built from Real Medical DataCode1
QGEval: Benchmarking Multi-dimensional Evaluation for Question GenerationCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
CattleFace-RGBT: RGB-T Cattle Facial Landmark BenchmarkCode1
TIDMAD: Time Series Dataset for Discovering Dark Matter with AI DenoisingCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
An Empirical Study into Clustering of Unseen Datasets with Self-Supervised EncodersCode1
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacousticsCode1
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation ModelsCode1
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wildCode1
SECURE: Benchmarking Large Language Models for CybersecurityCode1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
Quantitative Certification of Bias in Large Language ModelsCode1
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn InteractionsCode1
DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment RegimeCode1
Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking SequencesCode1
Analog or Digital In-memory Computing? Benchmarking through Quantitative ModelingCode1
GCondenser: Benchmarking Graph CondensationCode1
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture BreedingCode1
DocuMint: Docstring Generation for Python using Small Language ModelsCode1
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure InterpretationCode1
Benchmarking Classical and Learning-Based Multibeam Point Cloud RegistrationCode1
AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan DatasetsCode1
Position: Quo Vadis, Unsupervised Time Series Anomaly Detection?Code1
ATOMMIC: An Advanced Toolbox for Multitask Medical Imaging Consistency to facilitate Artificial Intelligence applications from acquisition to analysis in Magnetic Resonance ImagingCode1
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?Code1
4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBsCode1
Multi-Stream Cellular Test-Time Adaptation of Real-Time Models Evolving in Dynamic EnvironmentsCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value ExtractionCode1
SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic DataCode1
TAVGBench: Benchmarking Text to Audible-Video GenerationCode1
Experimental Validation of Ultrasound Beamforming with End-to-End Deep Learning for Single Plane Wave ImagingCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity LinkingCode1
How to Benchmark Vision Foundation Models for Semantic Segmentation?Code1
Second Edition FRCSyn Challenge at CVPR 2024: Face Recognition Challenge in the Era of Synthetic DataCode1
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for HallucinationsCode1
A Review and Efficient Implementation of Scene Graph Generation MetricsCode1
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming ProblemsCode1
nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image SegmentationCode1
RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via DiffusionCode1
Show:102550
← PrevPage 15 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified