SOTAVerified

Benchmarking

Papers

Showing 22512300 of 5548 papers

TitleStatusHype
Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing0
A CUDA-Based Real Parameter Optimization Benchmark0
Beyond Text: A Deep Dive into Large Language Models' Ability on Understanding Graph Data0
BEADs: Bias Evaluation Across Domains0
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency0
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets0
FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance0
Energy Models for Better Pseudo-Labels: Improving Semi-Supervised Classification with the 1-Laplacian Graph Energy0
Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages0
Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages0
BEACON: A Benchmark for Efficient and Accurate Counting of Subgraphs0
FIMP: Foundation Model-Informed Message Passing for Graph Neural Networks0
FineText: Text Classification via Attention-based Language Model Fine-tuning0
Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms0
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems0
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities0
BBOB Instance Analysis: Landscape Properties and Algorithm Performance across Problem Instances0
A Benchmark for Multi-speaker Anonymization0
FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking0
A Modular Framework for Centrality and Clustering in Complex Networks0
Beyond Monocular Deraining: Stereo Image Deraining via Semantic Understanding0
Beyond Monocular Deraining: Parallel Stereo Deraining Network Via Semantic Prior0
Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study0
SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
Bayesian Multi-type Mean Field Multi-agent Imitation Learning0
A Bayesian Model for Bivariate Causal Inference0
Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding0
Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding0
Financial Numeric Extreme Labelling: A Dataset and Benchmarking for XBRL Tagging0
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada0
AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving0
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models0
Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems0
Finance Language Model Evaluation (FLaME)0
Beyond Benchmarks: On The False Promise of AI Regulation0
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models0
Active Learning for Community Detection in Stochastic Block Models0
Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark0
Fine-Grained Classification of Pedestrians in Video: Benchmark and State of the Art0
FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures0
Better Practices for Domain Adaptation0
Barkour: Benchmarking Animal-level Agility with Quadruped Robots0
Active Evaluation Acquisition for Efficient LLM Benchmarking0
AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering0
FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding0
Few-Shot Defect Segmentation Leveraging Abundant Normal Training Samples Through Normal Background Regularization and Crop-and-Paste Operation0
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers0
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures0
BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali0
Show:102550
← PrevPage 46 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified