Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 551–600 of 5548 papers

Title	Date	Tasks	Status	Hype
Generative CKM Construction using Partially Observed Data with Diffusion Model	Dec 19, 2024	Benchmarking	CodeCode Available	1
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning	Dec 18, 2024	BenchmarkingGraph Learning	CodeCode Available	1
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment	Dec 18, 2024	BenchmarkingRAG	CodeCode Available	1
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks	Dec 18, 2024	Benchmarking	CodeCode Available	1
MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation	Dec 16, 2024	AllBenchmarking	CodeCode Available	1
CharacterBench: Benchmarking Character Customization of Large Language Models	Dec 16, 2024	Benchmarking	CodeCode Available	1
AD-LLM: Benchmarking Large Language Models for Anomaly Detection	Dec 15, 2024	Anomaly DetectionBenchmarking	CodeCode Available	1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning	Dec 11, 2024	AttributeBenchmarking	CodeCode Available	1
Multi-Behavior Recommendation with Personalized Directed Acyclic Behavior Graphs	Dec 9, 2024	BenchmarkingComputational Efficiency	CodeCode Available	1
PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems	Dec 9, 2024	BenchmarkingPrediction	CodeCode Available	1
Does your model understand genes? A benchmark of gene properties for biological and text models	Dec 5, 2024	BenchmarkingMulti-class Classification	CodeCode Available	1
Grounding Descriptions in Images informs Zero-Shot Visual Recognition	Dec 5, 2024	AttributeBenchmarking	CodeCode Available	1
Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"	Dec 2, 2024	BenchmarkingRepresentation Learning	CodeCode Available	1
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis	Nov 29, 2024	BenchmarkingClaim Verification	CodeCode Available	1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning	Nov 29, 2024	BenchmarkingDeepFake Detection	CodeCode Available	1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models	Nov 27, 2024	BenchmarkingEarth Observation	CodeCode Available	1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM	Nov 26, 2024	BenchmarkingText-to-Video Generation	CodeCode Available	1
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs	Nov 25, 2024	BenchmarkingHallucination	CodeCode Available	1
Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks	Nov 25, 2024	Benchmarkingobject-detection	CodeCode Available	1
StackEval: Benchmarking LLMs in Coding Assistance	Nov 21, 2024	Benchmarking	CodeCode Available	1
Multi-Agent Environments for Vehicle Routing Problems	Nov 21, 2024	Benchmarkingreinforcement-learning	CodeCode Available	1
DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models	Nov 19, 2024	BenchmarkingDeep Learning	CodeCode Available	1
Introducing Milabench: Benchmarking Accelerators for AI	Nov 18, 2024	BenchmarkingDeep Learning	CodeCode Available	1
FM-TS: Flow Matching for Time Series Generation	Nov 12, 2024	BenchmarkingImputation	CodeCode Available	1
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantification	Nov 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset	Nov 5, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks	Nov 4, 2024	Action GenerationBenchmarking	CodeCode Available	1
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation	Nov 4, 2024	BenchmarkingGraph Generation	CodeCode Available	1
ROAD-Waymo: Action Awareness at Scale for Autonomous Driving	Nov 3, 2024	Autonomous DrivingBenchmarking	CodeCode Available	1
MIRFLEX: Music Information Retrieval Feature Library for Extraction	Nov 1, 2024	BenchmarkingInformation Retrieval	CodeCode Available	1
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models	Nov 1, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery	Oct 31, 2024	BenchmarkingCloud Removal	CodeCode Available	1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking	Oct 31, 2024	BenchmarkingImputation	CodeCode Available	1
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios	Oct 31, 2024	BenchmarkingLLM-generated Text Detection	CodeCode Available	1
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction	Oct 31, 2024	BenchmarkingPrediction	CodeCode Available	1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography	Oct 31, 2024	BenchmarkingElectromyography (EMG)	CodeCode Available	1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems	Oct 30, 2024	BenchmarkingManagement	CodeCode Available	1
Survey of Cultural Awareness in Language Models: Text and Beyond	Oct 30, 2024	Benchmarking	CodeCode Available	1
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment	Oct 28, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance	Oct 27, 2024	BenchmarkingCode Generation	CodeCode Available	1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios	Oct 25, 2024	BenchmarkingDiversity	CodeCode Available	1
Benchmarking Multi-Scene Fire and Smoke Detection	Oct 22, 2024	Benchmarking	CodeCode Available	1
Comprehensive benchmarking of large language models for RNA secondary structure prediction	Oct 21, 2024	Benchmarking	CodeCode Available	1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems	Oct 18, 2024	BenchmarkingQuestion Answering	CodeCode Available	1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments	Oct 18, 2024	Autonomous NavigationBenchmarking	CodeCode Available	1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all	Oct 17, 2024	AllBenchmarking	CodeCode Available	1
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	Oct 16, 2024	BenchmarkingFairness	CodeCode Available	1
RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation	Oct 15, 2024	BenchmarkingInteractive Segmentation	CodeCode Available	1
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	Oct 14, 2024	2kBenchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 12 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified