Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1501–1550 of 5548 papers

Title	Date	Tasks	Status	Hype
When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning	Oct 11, 2024	AttributeBenchmarking	CodeCode Available	1
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation	Oct 11, 2024	BenchmarkingSentence	—Unverified	0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example	Oct 11, 2024	BenchmarkingCode Generation	—Unverified	0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment	Oct 11, 2024	BenchmarkingReinforcement Learning (RL)	—Unverified	0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks	Oct 11, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation	Oct 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations	Oct 10, 2024	BenchmarkingDecision Making	CodeCode Available	0
Identifying Money Laundering Subgraphs on the Blockchain	Oct 10, 2024	Benchmarking	CodeCode Available	0
Benchmarking Agentic Workflow Generation	Oct 10, 2024	Benchmarking	CodeCode Available	2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act	Oct 10, 2024	BenchmarkingFairness	CodeCode Available	2
Audio Explanation Synthesis with Generative Foundation Models	Oct 10, 2024	BenchmarkingDecision Making	CodeCode Available	0
Advocating Character Error Rate for Multilingual ASR Evaluation	Oct 9, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated Learning	Oct 9, 2024	BenchmarkingFairness	CodeCode Available	0
Towards Generalisable Time Series Understanding Across Domains	Oct 9, 2024	BenchmarkingTime Series	CodeCode Available	1
Analysis of different disparity estimation techniques on aerial stereo image datasets	Oct 9, 2024	BenchmarkingDepth Estimation	—Unverified	0
InAttention: Linear Context Scaling for Transformers	Oct 9, 2024	BenchmarkingDecoder	—Unverified	0
OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB	Oct 9, 2024	BenchmarkingDiversity	—Unverified	0
TuringQ: Benchmarking AI Comprehension in Theory of Computation	Oct 9, 2024	Benchmarking	CodeCode Available	0
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding	Oct 9, 2024	BenchmarkingInstruction Following	—Unverified	0
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond	Oct 9, 2024	Benchmarking	CodeCode Available	2
M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes	Oct 9, 2024	BenchmarkingMotion Generation	—Unverified	0
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making	Oct 9, 2024	BenchmarkingDecision Making	CodeCode Available	3
Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective	Oct 8, 2024	AttributeBenchmarking	CodeCode Available	1
QGym: Scalable Simulation and Benchmarking of Queuing Network Controllers	Oct 8, 2024	Benchmarking	CodeCode Available	0
FedGraph: A Research Library and Benchmark for Federated Graph Learning	Oct 8, 2024	BenchmarkingFederated Learning	CodeCode Available	2
Active Evaluation Acquisition for Efficient LLM Benchmarking	Oct 8, 2024	Benchmarking	—Unverified	0
Manual Verbalizer Enrichment for Few-Shot Text Classification	Oct 8, 2024	BenchmarkingClassification	—Unverified	0
Benchmarking of a new data splitting method on volcanic eruption data	Oct 8, 2024	Benchmarking	—Unverified	0
Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems	Oct 7, 2024	BenchmarkingMachine Translation	—Unverified	0
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild	Oct 7, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
Rule-based Data Selection for Large Language Models	Oct 7, 2024	BenchmarkingMath	—Unverified	0
Precise Model Benchmarking with Only a Few Observations	Oct 7, 2024	Benchmarkingmodel	—Unverified	0
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense	Oct 7, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	2
Named Clinical Entity Recognition Benchmark	Oct 7, 2024	BenchmarkingDecoder	CodeCode Available	0
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models	Oct 7, 2024	BenchmarkingSegmentation	CodeCode Available	0
Large Scale MRI Collection and Segmentation of Cirrhotic Liver	Oct 6, 2024	BenchmarkingDiagnostic	CodeCode Available	1
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection	Oct 6, 2024	BenchmarkingMathematical Reasoning	—Unverified	0
dattri: A Library for Efficient Data Attribution	Oct 6, 2024	Benchmarking	CodeCode Available	2
Adjusting Pretrained Backbones for Performativity	Oct 6, 2024	BenchmarkingDeep Learning	CodeCode Available	0
Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends	Oct 5, 2024	BenchmarkingChart Understanding	—Unverified	0
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms	Oct 5, 2024	BenchmarkingGPU	—Unverified	0
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning	Oct 5, 2024	BenchmarkingDrug Design	CodeCode Available	1
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels	Oct 5, 2024	Benchmarking	—Unverified	0
TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions	Oct 5, 2024	BenchmarkingHallucination	CodeCode Available	0
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension	Oct 4, 2024	BenchmarkingComputational chemistry	—Unverified	0
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities	Oct 4, 2024	Benchmarkingcounterfactual	—Unverified	0
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices	Oct 4, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Benchmarking the Fidelity and Utility of Synthetic Relational Data	Oct 4, 2024	BenchmarkingFeature Importance	—Unverified	0
PersoBench: Benchmarking Personalized Response Generation in Large Language Models	Oct 4, 2024	BenchmarkingDialogue Generation	CodeCode Available	0
Ward: Provable RAG Dataset Inference via LLM Watermarks	Oct 4, 2024	BenchmarkingRAG	—Unverified	0

Show:10 25 50

← PrevPage 31 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified