Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1351–1400 of 5548 papers

Title	Date	Tasks	Status	Hype
HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images	Nov 7, 2024	AnatomyBenchmarking	—Unverified	0
Perspective on recent developments and challenges in regulatory and systems genomics	Nov 7, 2024	Benchmarking	—Unverified	0
HourVideo: 1-Hour Video-Language Understanding	Nov 7, 2024	Benchmarkingcounterfactual	CodeCode Available	2
Learn to Solve Vehicle Routing Problems ASAP: A Neural Optimization Approach for Time-Constrained Vehicle Routing Problems with Finite Vehicle Fleet	Nov 7, 2024	BenchmarkingCombinatorial Optimization	—Unverified	0
Benchmarking Large Language Models with Integer Sequence Generation Tasks	Nov 7, 2024	BenchmarkingComputational Efficiency	—Unverified	0
Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking	Nov 6, 2024	Benchmarking	—Unverified	0
Beemo: Benchmark of Expert-edited Machine-generated Outputs	Nov 6, 2024	Benchmarking	CodeCode Available	0
SPINEX_ Symbolic Regression: Similarity-based Symbolic Regression with Explainable Neighbors Exploration	Nov 5, 2024	Benchmarkingregression	—Unverified	0
TDDBench: A Benchmark for Training data detection	Nov 5, 2024	BenchmarkingComputational Efficiency	—Unverified	0
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset	Nov 5, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level	Nov 5, 2024	Bayesian OptimisationBenchmarking	—Unverified	0
On the Loss of Context-awareness in General Instruction Fine-tuning	Nov 5, 2024	BenchmarkingInstruction Following	CodeCode Available	0
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent	Nov 5, 2024	BenchmarkingHallucination	CodeCode Available	3
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping	Nov 5, 2024	BenchmarkingCode Generation	CodeCode Available	2
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks	Nov 4, 2024	Action GenerationBenchmarking	CodeCode Available	1
Imagining and building wise machines: The centrality of AI metacognition	Nov 4, 2024	BenchmarkingNavigate	—Unverified	0
Benchmarking XAI Explanations with Human-Aligned Evaluations	Nov 4, 2024	Benchmarking	—Unverified	0
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation	Nov 4, 2024	BenchmarkingGraph Generation	CodeCode Available	1
TableGPT2: A Large Multimodal Model with Tabular Data Integration	Nov 4, 2024	BenchmarkingData Integration	CodeCode Available	4
ROAD-Waymo: Action Awareness at Scale for Autonomous Driving	Nov 3, 2024	Autonomous DrivingBenchmarking	CodeCode Available	1
SinaTools: Open Source Toolkit for Arabic Natural Language Processing	Nov 3, 2024	BenchmarkingLemmatization	—Unverified	0
FEET: A Framework for Evaluating Embedding Techniques	Nov 2, 2024	BenchmarkingRepresentation Learning	CodeCode Available	0
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models	Nov 2, 2024	Benchmarking	—Unverified	0
Artificial Intelligence for Microbiology and Microbiome Research	Nov 2, 2024	BenchmarkingDeep Learning	—Unverified	0
A Review of Reinforcement Learning in Financial Applications	Nov 1, 2024	BenchmarkingDecision Making	—Unverified	0
Modern, Efficient, and Differentiable Transport Equation Models using JAX: Applications to Population Balance Equations	Nov 1, 2024	BenchmarkingComputational Efficiency	—Unverified	0
Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model	Nov 1, 2024	BenchmarkingCross-Domain Named Entity Recognition	—Unverified	0
MIRFLEX: Music Information Retrieval Feature Library for Extraction	Nov 1, 2024	BenchmarkingInformation Retrieval	CodeCode Available	1
Benchmarking Bias in Large Language Models during Role-Playing	Nov 1, 2024	BenchmarkingFairness	—Unverified	0
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image Editing	Nov 1, 2024	BenchmarkingSemantic Segmentation	CodeCode Available	0
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models	Nov 1, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators	Oct 31, 2024	BenchmarkingText Generation	CodeCode Available	2
IdeaBench: Benchmarking Large Language Models for Research Idea Generation	Oct 31, 2024	Benchmarkingscientific discovery	CodeCode Available	0
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction	Oct 31, 2024	BenchmarkingPrediction	CodeCode Available	1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking	Oct 31, 2024	BenchmarkingImputation	CodeCode Available	1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography	Oct 31, 2024	BenchmarkingElectromyography (EMG)	CodeCode Available	1
Benchmark Data Repositories for Better Benchmarking	Oct 31, 2024	Benchmarking	—Unverified	0
XRDSLAM: A Flexible and Modular Framework for Deep Learning based SLAM	Oct 31, 2024	3DGSBenchmarking	CodeCode Available	3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	Oct 31, 2024	Benchmarking	CodeCode Available	3
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios	Oct 31, 2024	BenchmarkingLLM-generated Text Detection	CodeCode Available	1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery	Oct 31, 2024	BenchmarkingCloud Removal	CodeCode Available	1
CALE: Continuous Arcade Learning Environment	Oct 31, 2024	Atari GamesBenchmarking	CodeCode Available	7
Low-Density 3D Point Cloud Classification	Oct 30, 2024	3D Point Cloud ClassificationAutonomous Driving	—Unverified	0
Survey of Cultural Awareness in Language Models: Text and Beyond	Oct 30, 2024	Benchmarking	CodeCode Available	1
NCAdapt: Dynamic adaptation with domain-specific Neural Cellular Automata for continual hippocampus segmentation	Oct 30, 2024	BenchmarkingContinual Learning	CodeCode Available	0
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning	Oct 30, 2024	BenchmarkingHallucination	—Unverified	0
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes	Oct 30, 2024	Benchmarking	—Unverified	0
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models	Oct 30, 2024	Benchmarking	CodeCode Available	2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation	Oct 30, 2024	BenchmarkingPassage Retrieval	CodeCode Available	2
Evaluating Cultural and Social Awareness of LLM Web Agents	Oct 30, 2024	BenchmarkingNavigate	—Unverified	0

Show:10 25 50

← PrevPage 28 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified