SOTAVerified

Benchmarking

Papers

Showing 14511500 of 5548 papers

TitleStatusHype
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and StyleCode2
Benchmarking Pathology Foundation Models: Adaptation Strategies and ScenariosCode0
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping0
A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data0
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence0
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement LearningCode2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationCode2
FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational LearningCode0
Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart ProblemsCode1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them allCode1
UCFE: A User-Centric Financial Expertise Benchmark for Large Language ModelsCode0
Sum Secrecy Rate Maximization for Full Duplex ISAC Systems0
Trust but Verify: Programmatic VLM Evaluation in the Wild0
Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large pCode0
debiaSAE: Benchmarking and Mitigating Vision-Language Model BiasCode0
ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue SummarizationCode0
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMsCode0
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluationCode1
Understanding the Role of LLMs in Multimodal Evaluation BenchmarksCode0
Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation0
AERO: Softmax-Only LLMs for Efficient Private Inference0
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions0
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs0
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AICode4
Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum ChemistryCode0
Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos0
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting0
RClicks: Realistic Click Simulation for Benchmarking Interactive SegmentationCode1
The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence0
Personalised Feedback Framework for Online Education Programmes Using Generative AI0
ChakmaNMT: A Low-resource Machine Translation On Chakma Language0
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive MemoryCode3
Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning PerspectiveCode0
Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)0
SensorBench: Benchmarking LLMs in Coding-Based Sensor ProcessingCode0
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsCode1
Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning0
RMB: Comprehensively Benchmarking Reward Models in LLM AlignmentCode1
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
LoLI-Street: Benchmarking Low-Light Image Enhancement and BeyondCode1
Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection ModelsCode0
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in EnglishCode0
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human FeedbackCode0
A Comparative Analysis on Ethical Benchmarking in Large Language Models0
Enterprise Benchmarks for Large Language Model EvaluationCode0
Show:102550
← PrevPage 30 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified