SOTAVerified

Benchmarking

Papers

Showing 25012550 of 5548 papers

TitleStatusHype
BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media TextsCode0
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning0
Benchmarking symbolic regression constant optimization schemes0
Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data0
OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations0
Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction MethodsCode0
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking0
AI Benchmarks and Datasets for LLM Evaluation0
Agentic-HLS: An agentic reasoning based high-level synthesis system using large language models (AI for EDA workshop 2024)Code0
Understanding the World's Museums through Vision-Language ReasoningCode0
TextClass Benchmark: A Continuous Elo Rating of LLMs in Social SciencesCode0
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark0
One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering0
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos0
Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks0
λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics0
Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems0
Benchmarking Agility and Reconfigurability in Satellite Systems for Tropical Cyclone Monitoring0
Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches0
Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals0
Abnormality-Driven Representation Learning for Radiology Imaging0
Performance Benchmarking of Psychomotor Skills Using Wearable Devices: An Application in Sport0
A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation0
Benchmarking Active Learning for NILM0
ChemSafetyBench: Benchmarking LLM Safety on Chemistry DomainCode0
Reassessing Layer Pruning in LLMs: New Insights and MethodsCode0
AdamZ: An Enhanced Optimisation Method for Neural Network TrainingCode0
Benchmarking the Robustness of Optical Flow Estimation to CorruptionsCode0
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains0
Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise LevelsCode0
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time SeriesCode0
Forecasting Future International Events: A Reliable Dataset for Text-Based Event ModelingCode0
Benchmarking a wide range of optimisers for solving the Fermi-Hubbard model using the variational quantum eigensolver0
BelHouse3D: A Benchmark Dataset for Assessing Occlusion Robustness in 3D Point Cloud Semantic Segmentation0
Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking0
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games0
Delta-Influence: Unlearning Poisons via Influence FunctionsCode0
Integrating Dynamic Correlation Shifts and Weighted Benchmarking in Extreme Value Analysis0
Benchmarking Positional Encodings for GNNs and Graph TransformersCode0
The Moral Mind(s) of Large Language Models0
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media ContextsCode0
Benchmarking pre-trained text embedding models in aligning built asset informationCode0
Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies0
FastDraft: How to Train Your Draft0
Reinforcing Competitive Multi-Agents for Playing So Long Sucker0
Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML0
Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer SectionsCode0
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level0
The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering0
Automated Coding of Communications in Collaborative Problem-solving Tasks Using ChatGPT0
Show:102550
← PrevPage 51 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified