SOTAVerified

Benchmarking

Papers

Showing 26512675 of 5548 papers

TitleStatusHype
Trust but Verify: Programmatic VLM Evaluation in the Wild0
Sum Secrecy Rate Maximization for Full Duplex ISAC Systems0
Understanding the Role of LLMs in Multimodal Evaluation BenchmarksCode0
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions0
Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation0
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs0
AERO: Softmax-Only LLMs for Efficient Private Inference0
Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum ChemistryCode0
Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos0
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting0
Personalised Feedback Framework for Online Education Programmes Using Generative AI0
The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence0
ChakmaNMT: A Low-resource Machine Translation On Chakma Language0
Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)0
Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning0
SensorBench: Benchmarking LLMs in Coding-Based Sensor ProcessingCode0
Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning PerspectiveCode0
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in EnglishCode0
Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection ModelsCode0
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human FeedbackCode0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks0
Enterprise Benchmarks for Large Language Model EvaluationCode0
A Comparative Analysis on Ethical Benchmarking in Large Language Models0
Show:102550
← PrevPage 107 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified