SOTAVerified

Benchmarking

Papers

Showing 12011250 of 5548 papers

TitleStatusHype
Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State DecodingCode1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite ImageryCode1
Benchmarking Object Detectors with COCO: A New Path ForwardCode1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsCode1
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation AlgorithmsCode1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRACode1
Benchmarking the Generation of Fact Checking ExplanationsCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
CBench: Towards Better Evaluation of Question Answering Over Knowledge GraphsCode1
How to Train Neural Field Representations: A Comprehensive Study and BenchmarkCode1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic ClaimsCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic WritingCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
LEMUR Neural Network Dataset: Towards Seamless AutoMLCode1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning RobustnessCode1
Benchmarking Omni-Vision Representation through the Lens of Visual RealmsCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
HINT3: Raising the bar for Intent Detection in the WildCode1
CIDEr: Consensus-based Image Description EvaluationCode1
Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide ImagesCode1
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
Hierarchical graph neural nets can capture long-range interactionsCode1
Uncovering the Limits of Machine Learning for Automatic Vulnerability DetectionCode1
A Reinforcement Learning Environment for Multi-Service UAV-enabled Wireless SystemsCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
Benchmarking Language Models for Code Syntax UnderstandingCode1
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wildCode1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event ExtractionCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive CareCode1
HAWKS: Evolving Challenging Benchmark Sets for Cluster AnalysisCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality mattersCode1
HazeSpace2M: A Dataset for Haze Aware Single Image DehazingCode1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate CampaignsCode1
Show:102550
← PrevPage 25 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified