SOTAVerified

Benchmarking

Papers

Showing 14761500 of 5548 papers

TitleStatusHype
AERO: Softmax-Only LLMs for Efficient Private Inference0
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions0
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs0
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AICode4
Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum ChemistryCode0
Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos0
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting0
RClicks: Realistic Click Simulation for Benchmarking Interactive SegmentationCode1
The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence0
Personalised Feedback Framework for Online Education Programmes Using Generative AI0
ChakmaNMT: A Low-resource Machine Translation On Chakma Language0
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive MemoryCode3
Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning PerspectiveCode0
Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)0
SensorBench: Benchmarking LLMs in Coding-Based Sensor ProcessingCode0
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsCode1
Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning0
RMB: Comprehensively Benchmarking Reward Models in LLM AlignmentCode1
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
LoLI-Street: Benchmarking Low-Light Image Enhancement and BeyondCode1
Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection ModelsCode0
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in EnglishCode0
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human FeedbackCode0
A Comparative Analysis on Ethical Benchmarking in Large Language Models0
Enterprise Benchmarks for Large Language Model EvaluationCode0
Show:102550
← PrevPage 60 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified