SOTAVerified

Benchmarking

Papers

Showing 43014350 of 5548 papers

TitleStatusHype
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI0
Quantifying Social Biases Using Templates is Unreliable0
Quantifying the Complexity of Standard Benchmarking Datasets for Long-Term Human Trajectory Prediction0
Quantifying the Impact of Boundary Constraint Handling Methods on Differential Evolution0
A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification0
Quantitative Benchmarking of Anomaly Detection Methods in Digital Pathology0
A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking0
Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception0
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values0
Understanding Foundation Models: Are We Back in 1924?0
Quantitative Metrics for Benchmarking Medical Image Harmonization0
Benchmarking Bayesian neural networks and evaluation metrics for regression tasks0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
Quantum-Assisted Learning of Hardware-Embedded Probabilistic Graphical Models0
Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender Systems0
Quantum classification of the MNIST dataset with Slow Feature Analysis0
Quantum Cognitively Motivated Decision Fusion for Video Sentiment Analysis0
A Comparison of Directional Distances for Hand Pose Estimation0
Quantum Kernel Methods under Scrutiny: A Benchmarking Study0
Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting0
Quantum Kernel Learning for Small Dataset Modeling in Semiconductor Fabrication: Application to Ohmic Contact0
Quantum-tunnelling deep neural network for optical illusion recognition0
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture0
Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach0
Understanding Recurrent Neural Architectures by Analyzing and Synthesizing Long Distance Dependencies in Benchmark Sequential Datasets0
Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease0
Understanding the Limits of Lifelong Knowledge Editing in LLMs0
Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice0
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective0
Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture0
A Two-Step Framework for Multi-Material Decomposition of Dual Energy Computed Tomography from Projection Domain0
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models0
R2H: Building Multimodal Navigation Helpers that Respond to Help Requests0
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation0
R3L: Connecting Deep Reinforcement Learning to Recurrent Neural Networks for Image Denoising via Residual Recovery0
A Two-Stage Neural-Filter Pareto Front Extractor and the need for Benchmarking0
RadFusion: Benchmarking Performance and Fairness for Multimodal Pulmonary Embolism Detection from CT and EHR0
A tutorial on multi-view autoencoders using the multi-view-AE library0
Understanding the User: An Intent-Based Ranking Dataset0
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems0
Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking0
A Theory of Dynamic Benchmarks0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF0
Rail-5k: a Real-World Dataset for Rail Surface Defects Detection0
On the Evaluation of Engineering Artificial General Intelligence0
A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality0
RAN-GNNs: breaking the capacity limits of graph neural networks0
ATG: Benchmarking Automated Theorem Generation for Generative Language Models0
A Comparison of Cryptocurrency Volatility-benchmarking New and Mature Asset Classes0
Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games0
Show:102550
← PrevPage 87 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified