SOTAVerified

Benchmarking

Papers

Showing 25262550 of 5548 papers

TitleStatusHype
Reassessing Layer Pruning in LLMs: New Insights and MethodsCode0
AdamZ: An Enhanced Optimisation Method for Neural Network TrainingCode0
Benchmarking the Robustness of Optical Flow Estimation to CorruptionsCode0
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains0
Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise LevelsCode0
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time SeriesCode0
Forecasting Future International Events: A Reliable Dataset for Text-Based Event ModelingCode0
Benchmarking a wide range of optimisers for solving the Fermi-Hubbard model using the variational quantum eigensolver0
BelHouse3D: A Benchmark Dataset for Assessing Occlusion Robustness in 3D Point Cloud Semantic Segmentation0
Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking0
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games0
Delta-Influence: Unlearning Poisons via Influence FunctionsCode0
Integrating Dynamic Correlation Shifts and Weighted Benchmarking in Extreme Value Analysis0
Benchmarking Positional Encodings for GNNs and Graph TransformersCode0
The Moral Mind(s) of Large Language Models0
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media ContextsCode0
Benchmarking pre-trained text embedding models in aligning built asset informationCode0
Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies0
FastDraft: How to Train Your Draft0
Reinforcing Competitive Multi-Agents for Playing So Long Sucker0
Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML0
Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer SectionsCode0
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level0
The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering0
Automated Coding of Communications in Collaborative Problem-solving Tasks Using ChatGPT0
Show:102550
← PrevPage 102 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified