SOTAVerified

Benchmarking

Papers

Showing 26512700 of 5548 papers

TitleStatusHype
UCFE: A User-Centric Financial Expertise Benchmark for Large Language ModelsCode0
debiaSAE: Benchmarking and Mitigating Vision-Language Model BiasCode0
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions0
Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation0
Understanding the Role of LLMs in Multimodal Evaluation BenchmarksCode0
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs0
AERO: Softmax-Only LLMs for Efficient Private Inference0
Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum ChemistryCode0
Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos0
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting0
Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning0
ChakmaNMT: A Low-resource Machine Translation On Chakma Language0
Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)0
The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence0
Personalised Feedback Framework for Online Education Programmes Using Generative AI0
SensorBench: Benchmarking LLMs in Coding-Based Sensor ProcessingCode0
Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning PerspectiveCode0
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in EnglishCode0
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human FeedbackCode0
Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection ModelsCode0
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks0
Enterprise Benchmarks for Large Language Model EvaluationCode0
A Comparative Analysis on Ethical Benchmarking in Large Language Models0
Identifying Money Laundering Subgraphs on the BlockchainCode0
Audio Explanation Synthesis with Generative Foundation ModelsCode0
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty SimulationsCode0
Advocating Character Error Rate for Multilingual ASR Evaluation0
InAttention: Linear Context Scaling for Transformers0
Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated LearningCode0
TuringQ: Benchmarking AI Comprehension in Theory of ComputationCode0
Analysis of different disparity estimation techniques on aerial stereo image datasets0
OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB0
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding0
M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes0
Active Evaluation Acquisition for Efficient LLM Benchmarking0
Manual Verbalizer Enrichment for Few-Shot Text Classification0
Benchmarking of a new data splitting method on volcanic eruption data0
QGym: Scalable Simulation and Benchmarking of Queuing Network ControllersCode0
Named Clinical Entity Recognition BenchmarkCode0
Precise Model Benchmarking with Only a Few Observations0
Rule-based Data Selection for Large Language Models0
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation ModelsCode0
Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems0
Adjusting Pretrained Backbones for PerformativityCode0
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection0
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels0
Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends0
Show:102550
← PrevPage 54 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified