Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1901–1925 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking GNNs Using Lightning Network Data	Jul 5, 2024	Benchmarking	—Unverified	0
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters	Jul 5, 2024	Benchmarkingvalid	CodeCode Available	1
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano	Jul 5, 2024	AttributeBenchmarking	—Unverified	0
Towards Stable 3D Object Detection	Jul 5, 2024	3D Object DetectionAutonomous Driving	—Unverified	0
SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry	Jul 5, 2024	Benchmarkingobject-detection	CodeCode Available	2
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation	Jul 4, 2024	BenchmarkingChatbot	—Unverified	0
Craftium: An Extensible Framework for Creating Reinforcement Learning Environments	Jul 4, 2024	BenchmarkingMinecraft	CodeCode Available	2
Benchmarking Complex Instruction-Following with Multiple Constraints Composition	Jul 4, 2024	BenchmarkingInstruction Following	CodeCode Available	2
Benchmark on Drug Target Interaction Modeling from a Structure Perspective	Jul 4, 2024	BenchmarkingDrug Discovery	CodeCode Available	1
Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms	Jul 3, 2024	BenchmarkingCPU	—Unverified	0
Comics Datasets Framework: Mix of Comics datasets for detection benchmarking	Jul 3, 2024	BenchmarkingObject	CodeCode Available	1
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias	Jul 3, 2024	BenchmarkingBias Detection	CodeCode Available	0
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models	Jul 3, 2024	BenchmarkingCode Search	CodeCode Available	2
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models	Jul 3, 2024	Benchmarking	CodeCode Available	1
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset	Jul 3, 2024	BenchmarkingDiversity	CodeCode Available	1
TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations	Jul 2, 2024	Benchmarkingtext-to-speech	—Unverified	0
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks	Jul 2, 2024	Activity PredictionAnomaly Detection	CodeCode Available	0
Open foundation models for Azerbaijani language	Jul 2, 2024	Benchmarking	—Unverified	0
Occlusion-Aware Seamless Segmentation	Jul 2, 2024	BenchmarkingDomain Adaptation	CodeCode Available	1
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents	Jul 1, 2024	Benchmarking	CodeCode Available	1
Modified CMA-ES Algorithm for Multi-Modal Optimization: Incorporating Niching Strategies and Dynamic Adaptation Mechanism	Jul 1, 2024	BenchmarkingDiversity	—Unverified	0
Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy	Jul 1, 2024	Benchmarking	—Unverified	0
MIRAI: Evaluating LLM Agents for Event Forecasting	Jul 1, 2024	ArticlesBenchmarking	—Unverified	0
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation	Jul 1, 2024	BenchmarkingRAG	CodeCode Available	3
ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions	Jul 1, 2024	BenchmarkingQuestion Generation	—Unverified	0

Show:10 25 50

← PrevPage 77 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified