SOTAVerified

Benchmarking

Papers

Showing 28512900 of 5548 papers

TitleStatusHype
Alexpaca: Learning Factual Clarification Question Generation Without Examples0
EvalCrafter: Benchmarking and Evaluating Large Video Generation ModelsCode1
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in ConversationsCode1
BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali0
An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition0
Assessing Encoder-Decoder Architectures for Robust Coronary Artery Segmentation0
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language ModelsCode0
A Novel Benchmarking Paradigm and a Scale- and Motion-Aware Model for Egocentric Pedestrian Trajectory Prediction0
Prompting Scientific Names for Zero-Shot Species Recognition0
Evaluating Robustness of Visual Representations for Object Assembly Task Requiring Spatio-Geometrical Reasoning0
Randomized Benchmarking of Local Zeroth-Order Optimizers for Variational Quantum SystemsCode0
Benchmarking the Sim-to-Real Gap in Cloth Manipulation0
Mirage: Model-Agnostic Graph Distillation for Graph ClassificationCode0
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference LettersCode1
pose-format: Library for Viewing, Augmenting, and Handling .pose FilesCode1
BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models for Sentiment Analysis of Bangla Social Media PostsCode0
Welfare Diplomacy: Benchmarking Language Model CooperationCode1
MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement LearningCode1
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution ShiftsCode1
A Benchmarking Protocol for SAR Colorization: From Regression to Deep Learning Approaches0
Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images0
Who Said That? Benchmarking Social Media AI Detection0
Towards Evaluating Generalist Agents: An Automated Benchmark in Open WorldCode1
Octopus: Embodied Vision-Language Programmer from Environmental FeedbackCode2
CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous DrivingCode3
Deep Reinforcement Learning for Autonomous Cyber Defence: A Survey0
FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning0
Transformers for Green Semantic Communication: Less Energy, More SemanticsCode0
Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design0
Risk Aware Benchmarking of Large Language Models0
Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms0
ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction HorizonsCode2
BeSt-LeS: Benchmarking Stroke Lesion Segmentation using Deep SupervisionCode0
CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods0
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language ModelsCode1
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric ApproachCode1
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets0
Distributed Evolution Strategies with Multi-Level Learning for Large-Scale Black-Box Optimization0
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity AnalysisCode3
Transcending the Attention Paradigm: Representation Learning from Geospatial Social Media DataCode0
Simple GNNs with Low Rank Non-parametric AggregatorsCode0
Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE CorpusCode0
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue SystemsCode0
Benchmarking Large Language Models with Augmented Instructions for Fine-grained Information Extraction0
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets0
Beyond Text: A Deep Dive into Large Language Models' Ability on Understanding Graph Data0
AKFruitYield: Modular benchmarking and video analysis software for Azure Kinect cameras for fruit size and fruit yield estimation in apple orchardsCode0
Full-scale modal testing of a Hawk T1A aircraft for benchmarking vibration-based methods0
LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation0
Show:102550
← PrevPage 58 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified