SOTAVerified

Benchmarking

Papers

Showing 651700 of 5548 papers

TitleStatusHype
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
A Computed Tomography Vertebral Segmentation Dataset with Anatomical Variations and Multi-Vendor Scanner DataCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
AudioMarkBench: Benchmarking Robustness of Audio WatermarkingCode1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic WritingCode1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning RobustnessCode1
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?Code1
A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character RecognitionCode1
Chaos as an interpretable benchmark for forecasting and data-driven modellingCode1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion GamesCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy DetectionCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CharacterBench: Benchmarking Character Customization of Large Language ModelsCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital TwinsCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Bencher: Simple and Reproducible Benchmarking for Black-Box OptimizationCode1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT ScansCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and RethinkingCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
Show:102550
← PrevPage 14 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified