SOTAVerified

Benchmarking

Papers

Showing 34513500 of 5548 papers

TitleStatusHype
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection0
Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems0
MA-BBOB: A Problem Generator for Black-Box Optimization Using Affine Combinations and Shifts0
MA-BBOB: Many-Affine Combinations of BBOB Functions for Evaluating AutoML Approaches in Noiseless Numerical Black-Box Optimization Contexts0
Towards an AI Accountability Policy0
Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance0
Towards an Automated SOAP Note: Classifying Utterances from Medical Conversations0
A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video0
Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)0
Towards a Taxonomy of Graph Learning Datasets0
Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices0
Machine learning for modelling unstructured grid data in computational physics: a review0
Towards a Theory-Guided Benchmarking Suite for Discrete Black-Box Optimization Heuristics: Profiling (1+λ) EA Variants on OneMax and LeadingOnes0
Machine Learning for Ranking f-wave Extraction Methods in Single-Lead ECGs0
Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving0
Uncertainty estimation of machine learning spatial precipitation predictions from satellite data0
Benchmarking LLMs for Mimicking Child-Caregiver Language in Interaction0
Benchmarking LLMs and SLMs for patient reported outcomes0
Benchmarking LLM powered Chatbots: Methods and Metrics0
Machine Vision based Sample-Tube Localization for Mars Sample Return0
Benchmarking LLM Guardrails in Handling Multilingual Toxicity0
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V30
Towards a Unified Framework for Determining Conformational Ensembles of Disordered Proteins0
Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios0
Making Sense of Data in the Wild: Data Analysis Automation at Scale0
OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User0
A Deep Q-Learning Method for Downlink Power Allocation in Multi-Cell Networks0
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages0
Benchmarking LiDAR Sensors for Development and Evaluation of Automotive Perception0
Towards Benchmarking and Evaluating Deepfake Detection0
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation0
MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects0
Deep Patent Landscaping Model Using Transformer and Graph Embedding0
Manual Verbalizer Enrichment for Few-Shot Text Classification0
Towards Benchmarking Explainable Artificial Intelligence Methods0
Mapping global dynamics of benchmark creation and saturation in artificial intelligence0
Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions0
Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR0
Towards Benchmarking Scene Background Initialization0
MarineGym: A High-Performance Reinforcement Learning Platform for Underwater Robotics0
Benchmarking Lexical Simplification Systems0
Towards Benchmarking the Utility of Explanations for Model Debugging0
WER We Stand: Benchmarking Urdu ASR Models0
Benchmarking Learnt Radio Localisation under Distribution Shift0
Benchmarking learned non-Cartesian k-space trajectories and reconstruction networks0
Match Stereo Videos via Bidirectional Alignment0
MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities0
PINNs for Medical Image Analysis: A Survey0
(N,K)-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model0
Benchmarking learned algorithms for computed tomography image reconstruction tasks0
Show:102550
← PrevPage 70 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified