SOTAVerified

Benchmarking

Papers

Showing 28762900 of 5548 papers

TitleStatusHype
CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous DrivingCode3
Deep Reinforcement Learning for Autonomous Cyber Defence: A Survey0
FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning0
Transformers for Green Semantic Communication: Less Energy, More SemanticsCode0
Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design0
Risk Aware Benchmarking of Large Language Models0
Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms0
ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction HorizonsCode2
BeSt-LeS: Benchmarking Stroke Lesion Segmentation using Deep SupervisionCode0
CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods0
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language ModelsCode1
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric ApproachCode1
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets0
Distributed Evolution Strategies with Multi-Level Learning for Large-Scale Black-Box Optimization0
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity AnalysisCode3
Transcending the Attention Paradigm: Representation Learning from Geospatial Social Media DataCode0
Simple GNNs with Low Rank Non-parametric AggregatorsCode0
Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE CorpusCode0
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue SystemsCode0
Benchmarking Large Language Models with Augmented Instructions for Fine-grained Information Extraction0
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets0
Beyond Text: A Deep Dive into Large Language Models' Ability on Understanding Graph Data0
AKFruitYield: Modular benchmarking and video analysis software for Azure Kinect cameras for fruit size and fruit yield estimation in apple orchardsCode0
Full-scale modal testing of a Hawk T1A aircraft for benchmarking vibration-based methods0
LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation0
Show:102550
← PrevPage 116 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified