Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2251–2300 of 5548 papers

Title	Date	Tasks	Status
Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing	Jan 20, 2025	BenchmarkingEvolutionary Algorithms	—Unverified
A CUDA-Based Real Parameter Optimization Benchmark	Jul 29, 2014	BenchmarkingCPU	—Unverified
Beyond Text: A Deep Dive into Large Language Models' Ability on Understanding Graph Data	Oct 7, 2023	Benchmarking	—Unverified
BEADs: Bias Evaluation Across Domains	Jun 6, 2024	BenchmarkingBias Detection	—Unverified
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency	Jan 30, 2025	BenchmarkingLanguage Modeling	—Unverified
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets	Oct 7, 2023	Benchmarkingnamed-entity-recognition	—Unverified
FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance	Mar 7, 2025	ArticlesBenchmarking	—Unverified
Energy Models for Better Pseudo-Labels: Improving Semi-Supervised Classification with the 1-Laplacian Graph Energy	Jun 20, 2019	BenchmarkingMulti-class Classification	—Unverified
Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages	May 12, 2022	BenchmarkingDiversity	—Unverified
Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages	May 26, 2025	BenchmarkingTransliteration	—Unverified
BEACON: A Benchmark for Efficient and Accurate Counting of Subgraphs	Apr 15, 2025	BenchmarkingSubgraph Counting	—Unverified
FIMP: Foundation Model-Informed Message Passing for Graph Neural Networks	Oct 17, 2022	BenchmarkingGraph Neural Network	—Unverified
FineText: Text Classification via Attention-based Language Model Fine-tuning	Oct 25, 2019	BenchmarkingClassification	—Unverified
Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms	Mar 1, 2024	BenchmarkingStochastic Optimization	—Unverified
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems	Feb 20, 2025	BenchmarkingDecision Making	—Unverified
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities	Oct 4, 2024	Benchmarkingcounterfactual	—Unverified
BBOB Instance Analysis: Landscape Properties and Algorithm Performance across Problem Instances	Nov 29, 2022	Benchmarking	—Unverified
A Benchmark for Multi-speaker Anonymization	Jul 8, 2024	BenchmarkingDisentanglement	—Unverified
FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking	Apr 2, 2025	3D Scene ReconstructionBenchmarking	—Unverified
A Modular Framework for Centrality and Clustering in Complex Networks	Nov 23, 2021	BenchmarkingClustering	—Unverified
Beyond Monocular Deraining: Stereo Image Deraining via Semantic Understanding	Aug 1, 2020	BenchmarkingRain Removal	—Unverified
Beyond Monocular Deraining: Parallel Stereo Deraining Network Via Semantic Prior	May 9, 2021	BenchmarkingRain Removal	—Unverified
Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study	May 23, 2020	BenchmarkingNetwork Pruning	—Unverified
SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems	Aug 4, 2024	BenchmarkingComputational Efficiency	—Unverified
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks	Jul 29, 2024	BenchmarkingLanguage Model Evaluation	—Unverified
Bayesian Multi-type Mean Field Multi-agent Imitation Learning	Dec 1, 2020	BenchmarkingImitation Learning	—Unverified
A Bayesian Model for Bivariate Causal Inference	Dec 24, 2018	BenchmarkingCausal Inference	—Unverified
Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding	Jul 1, 2022	Benchmarking	—Unverified
Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding	Jan 16, 2022	Benchmarking	—Unverified
Financial Numeric Extreme Labelling: A Dataset and Benchmarking for XBRL Tagging	Jun 6, 2023	BenchmarkingSentence	—Unverified
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada	Apr 1, 2021	BenchmarkingLanguage Identification	—Unverified
AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving	Sep 12, 2023	Autonomous DrivingBenchmarking	—Unverified
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models	Apr 14, 2025	BenchmarkingDescriptive	—Unverified
Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems	Mar 9, 2025	Benchmarking	—Unverified
Finance Language Model Evaluation (FLaME)	Jun 18, 2025	BenchmarkingLanguage Model Evaluation	—Unverified
Beyond Benchmarks: On The False Promise of AI Regulation	Jan 26, 2025	Benchmarking	—Unverified
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models	Jul 10, 2024	Benchmarking	—Unverified
Active Learning for Community Detection in Stochastic Block Models	May 8, 2016	Active LearningBenchmarking	—Unverified
Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark	Nov 23, 2021	BenchmarkingBIG-bench Machine Learning	—Unverified
Fine-Grained Classification of Pedestrians in Video: Benchmark and State of the Art	May 20, 2016	BenchmarkingGeneral Classification	—Unverified
FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures	Jan 1, 2024	BenchmarkingInstance Segmentation	—Unverified
Better Practices for Domain Adaptation	Sep 7, 2023	BenchmarkingDomain Adaptation	—Unverified
Barkour: Benchmarking Animal-level Agility with Quadruped Robots	May 24, 2023	BenchmarkingNavigate	—Unverified
Active Evaluation Acquisition for Efficient LLM Benchmarking	Oct 8, 2024	Benchmarking	—Unverified
AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering	Jun 3, 2025	Benchmarking	—Unverified
FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding	Nov 16, 2021	BenchmarkingNatural Language Understanding	—Unverified
Few-Shot Defect Segmentation Leveraging Abundant Normal Training Samples Through Normal Background Regularization and Crop-and-Paste Operation	Jul 18, 2020	Anomaly DetectionBenchmarking	—Unverified
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers	Apr 2, 2025	BenchmarkingManagement	—Unverified
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures	Jun 6, 2025	BenchmarkingCPU	—Unverified
BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali	Oct 16, 2023	BenchmarkingData Augmentation	—Unverified

Show:10 25 50

← PrevPage 46 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified