SOTAVerified

Benchmarking

Papers

Showing 35513600 of 5548 papers

TitleStatusHype
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests0
A Metadata-Driven Approach to Understand Graph Neural Networks0
Domain Generalization in Computational Pathology: Survey and Guidelines0
LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection0
Evaluating LLP Methods: Challenges and ApproachesCode0
Benchmark Generation Framework with Customizable Distortions for Image Classifier RobustnessCode0
On General Language Understanding0
OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame CompressionCode0
OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User0
RDBench: ML Benchmark for Relational Databases0
ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair0
XFEVER: Exploring Fact Verification across LanguagesCode0
Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting0
BLESS: Benchmarking Large Language Models on Sentence SimplificationCode0
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic0
XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series ClassificationCode0
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual DesignCode0
A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Benchmarking and Improving Text-to-SQL Generation under AmbiguityCode0
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language ModelsCode0
Standardised workflow for mass spectrometry-based single-cell proteomics data processing and analysis using the scp package0
Almost Equivariance via Lie Algebra Convolutions0
Benchmarking GPUs on SVBRDF Extractor Model0
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot InteractionsCode0
Alexpaca: Learning Factual Clarification Question Generation Without Examples0
BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali0
A Novel Benchmarking Paradigm and a Scale- and Motion-Aware Model for Egocentric Pedestrian Trajectory Prediction0
An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition0
TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language ModelsCode0
Assessing Encoder-Decoder Architectures for Robust Coronary Artery Segmentation0
Evaluating Robustness of Visual Representations for Object Assembly Task Requiring Spatio-Geometrical Reasoning0
Prompting Scientific Names for Zero-Shot Species Recognition0
Benchmarking the Sim-to-Real Gap in Cloth Manipulation0
Randomized Benchmarking of Local Zeroth-Order Optimizers for Variational Quantum SystemsCode0
Mirage: Model-Agnostic Graph Distillation for Graph ClassificationCode0
BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models for Sentiment Analysis of Bangla Social Media PostsCode0
A Benchmarking Protocol for SAR Colorization: From Regression to Deep Learning Approaches0
Who Said That? Benchmarking Social Media AI Detection0
Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images0
Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms0
Deep Reinforcement Learning for Autonomous Cyber Defence: A Survey0
Risk Aware Benchmarking of Large Language Models0
Transformers for Green Semantic Communication: Less Energy, More SemanticsCode0
FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning0
Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design0
BeSt-LeS: Benchmarking Stroke Lesion Segmentation using Deep SupervisionCode0
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets0
CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods0
Distributed Evolution Strategies with Multi-Level Learning for Large-Scale Black-Box Optimization0
Show:102550
← PrevPage 72 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified