SOTAVerified

Benchmarking

Papers

Showing 41014150 of 5548 papers

TitleStatusHype
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests0
The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics0
The Adversarial AI-Art: Understanding, Generation, Detection, and Benchmarking0
The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artificial Intelligence0
Language Models as a Service: Overview of a New Paradigm and its Challenges0
The Benchmark Lottery0
The Brain Tumor Segmentation (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI0
The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach0
The Curious Case of Integrator Reach Sets, Part I: Basic Theory0
The Design and Implementation of a Scalable DL Benchmarking Platform0
The Disagreement Problem in Faithfulness Metrics0
The DLV System for Knowledge Representation and Reasoning0
The Dota 2 Bot Competition0
The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation0
The EuroCity Persons Dataset: A Novel Benchmark for Object Detection0
The Evolutionary Computation Methods No One Should Use0
The Expressive Power of Word Embeddings0
The Extractive-Abstractive Axis: Measuring Content "Borrowing" in Generative Language Models0
The FaceChannelS: Strike of the Sequences for the AffWild 2 Challenge0
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input0
The Forchheim Image Database for Camera Identification in the Wild0
The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech0
The Impact of Genomic Variation on Function (IGVF) Consortium0
The iNaturalist Sounds Dataset0
The Interactive Effects of Operators and Parameters to GA Performance Under Different Problem Sizes0
The JPEG Pleno Learning-based Point Cloud Coding Standard: Serving Man and Machine0
The Jungle of Generative Drug Discovery: Traps, Treasures, and Ways Out0
The Karp Dataset0
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs0
The Leaderboard Illusion0
The Liouville Generator for Producing Integrable Expressions0
The Low Emission Oil&Gas Open (LEOGO) Reference Platform of an Off-Grid Energy System for Renewable Integration Studies0
The Moral Mind(s) of Large Language Models0
The Multi-speaker Multi-style Voice Cloning Challenge 20210
The Neural Painter: Multi-Turn Image Generation0
The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects0
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests0
The Oxford Spires Dataset: Benchmarking Large-Scale LiDAR-Visual Localisation, Reconstruction and Radiance Field Methods0
The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways0
The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering0
The Partial Response Network: a neural network nomogram0
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong0
The Protein Engineering Tournament: An Open Science Benchmark for Protein Modeling and Design0
Thermal Image-based Fault Diagnosis in Induction Machines via Self-Organized Operational Neural Networks0
The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search0
The Russian practice of applying cluster approach in regional development0
The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection0
The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks0
The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence0
Show:102550
← PrevPage 83 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified