Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1501–1525 of 5548 papers

Title	Date	Tasks	Status	Hype
When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning	Oct 11, 2024	AttributeBenchmarking	CodeCode Available	1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation	Oct 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks	Oct 11, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment	Oct 11, 2024	BenchmarkingReinforcement Learning (RL)	—Unverified	0
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation	Oct 11, 2024	BenchmarkingSentence	—Unverified	0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example	Oct 11, 2024	BenchmarkingCode Generation	—Unverified	0
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations	Oct 10, 2024	BenchmarkingDecision Making	CodeCode Available	0
Identifying Money Laundering Subgraphs on the Blockchain	Oct 10, 2024	Benchmarking	CodeCode Available	0
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act	Oct 10, 2024	BenchmarkingFairness	CodeCode Available	2
Benchmarking Agentic Workflow Generation	Oct 10, 2024	Benchmarking	CodeCode Available	2
Audio Explanation Synthesis with Generative Foundation Models	Oct 10, 2024	BenchmarkingDecision Making	CodeCode Available	0
Advocating Character Error Rate for Multilingual ASR Evaluation	Oct 9, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated Learning	Oct 9, 2024	BenchmarkingFairness	CodeCode Available	0
Towards Generalisable Time Series Understanding Across Domains	Oct 9, 2024	BenchmarkingTime Series	CodeCode Available	1
Analysis of different disparity estimation techniques on aerial stereo image datasets	Oct 9, 2024	BenchmarkingDepth Estimation	—Unverified	0
OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB	Oct 9, 2024	BenchmarkingDiversity	—Unverified	0
TuringQ: Benchmarking AI Comprehension in Theory of Computation	Oct 9, 2024	Benchmarking	CodeCode Available	0
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding	Oct 9, 2024	BenchmarkingInstruction Following	—Unverified	0
InAttention: Linear Context Scaling for Transformers	Oct 9, 2024	BenchmarkingDecoder	—Unverified	0
M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes	Oct 9, 2024	BenchmarkingMotion Generation	—Unverified	0
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond	Oct 9, 2024	Benchmarking	CodeCode Available	2
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making	Oct 9, 2024	BenchmarkingDecision Making	CodeCode Available	3
FedGraph: A Research Library and Benchmark for Federated Graph Learning	Oct 8, 2024	BenchmarkingFederated Learning	CodeCode Available	2
Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective	Oct 8, 2024	AttributeBenchmarking	CodeCode Available	1
Manual Verbalizer Enrichment for Few-Shot Text Classification	Oct 8, 2024	BenchmarkingClassification	—Unverified	0

Show:10 25 50

← PrevPage 61 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified