Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1901–1950 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking GNNs Using Lightning Network Data	Jul 5, 2024	Benchmarking	—Unverified	0
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano	Jul 5, 2024	AttributeBenchmarking	—Unverified	0
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters	Jul 5, 2024	Benchmarkingvalid	CodeCode Available	1
Towards Stable 3D Object Detection	Jul 5, 2024	3D Object DetectionAutonomous Driving	—Unverified	0
SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry	Jul 5, 2024	Benchmarkingobject-detection	CodeCode Available	2
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation	Jul 4, 2024	BenchmarkingChatbot	—Unverified	0
Craftium: An Extensible Framework for Creating Reinforcement Learning Environments	Jul 4, 2024	BenchmarkingMinecraft	CodeCode Available	2
Benchmarking Complex Instruction-Following with Multiple Constraints Composition	Jul 4, 2024	BenchmarkingInstruction Following	CodeCode Available	2
Benchmark on Drug Target Interaction Modeling from a Structure Perspective	Jul 4, 2024	BenchmarkingDrug Discovery	CodeCode Available	1
Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms	Jul 3, 2024	BenchmarkingCPU	—Unverified	0
Comics Datasets Framework: Mix of Comics datasets for detection benchmarking	Jul 3, 2024	BenchmarkingObject	CodeCode Available	1
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias	Jul 3, 2024	BenchmarkingBias Detection	CodeCode Available	0
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models	Jul 3, 2024	BenchmarkingCode Search	CodeCode Available	2
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset	Jul 3, 2024	BenchmarkingDiversity	CodeCode Available	1
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models	Jul 3, 2024	Benchmarking	CodeCode Available	1
TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations	Jul 2, 2024	Benchmarkingtext-to-speech	—Unverified	0
Open foundation models for Azerbaijani language	Jul 2, 2024	Benchmarking	—Unverified	0
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks	Jul 2, 2024	Activity PredictionAnomaly Detection	CodeCode Available	0
Occlusion-Aware Seamless Segmentation	Jul 2, 2024	BenchmarkingDomain Adaptation	CodeCode Available	1
Modified CMA-ES Algorithm for Multi-Modal Optimization: Incorporating Niching Strategies and Dynamic Adaptation Mechanism	Jul 1, 2024	BenchmarkingDiversity	—Unverified	0
MIRAI: Evaluating LLM Agents for Event Forecasting	Jul 1, 2024	ArticlesBenchmarking	—Unverified	0
Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy	Jul 1, 2024	Benchmarking	—Unverified	0
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation	Jul 1, 2024	BenchmarkingRAG	CodeCode Available	3
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents	Jul 1, 2024	Benchmarking	CodeCode Available	1
ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions	Jul 1, 2024	BenchmarkingQuestion Generation	—Unverified	0
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models	Jul 1, 2024	BenchmarkingFairness	CodeCode Available	2
EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting	Jul 1, 2024	3D ReconstructionBenchmarking	—Unverified	0
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations	Jul 1, 2024	Benchmarkingdocument understanding	CodeCode Available	2
FineSurE: Fine-grained Summarization Evaluation using LLMs	Jul 1, 2024	BenchmarkingHallucination	CodeCode Available	1
Reinvestigating the R2 Indicator: Achieving Pareto Compliance by Integration	Jul 1, 2024	Benchmarking	CodeCode Available	0
Benchmarking Predictive Coding Networks -- Made Simple	Jul 1, 2024	Benchmarking	CodeCode Available	2
AI Agents That Matter	Jul 1, 2024	Benchmarking	CodeCode Available	1
Overcoming Common Flaws in the Evaluation of Selective Classification Systems	Jul 1, 2024	BenchmarkingClassification	CodeCode Available	1
Commute Graph Neural Networks	Jun 30, 2024	Benchmarking	—Unverified	0
GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing	Jun 30, 2024	Benchmarkingcounterfactual	—Unverified	0
PerSEval: Assessing Personalization in Text Summarizers	Jun 29, 2024	BenchmarkingHuman Judgment Correlation	—Unverified	0
GraphArena: Benchmarking Large Language Models on Graph Computational Problems	Jun 29, 2024	BenchmarkingHallucination	CodeCode Available	1
iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities	Jun 27, 2024	Benchmarking	CodeCode Available	1
Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges	Jun 27, 2024	BenchmarkingClinical Knowledge	—Unverified	0
Benchmarking M6 Competitors: An Analysis of Financial Metrics and Discussion of Incentives	Jun 27, 2024	Benchmarking	—Unverified	0
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models	Jun 27, 2024	AttributeBenchmarking	CodeCode Available	2
Quantum-tunnelling deep neural network for optical illusion recognition	Jun 26, 2024	Autonomous VehiclesBenchmarking	—Unverified	0
Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI	Jun 26, 2024	BenchmarkingCrop Type Mapping	—Unverified	0
XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis	Jun 26, 2024	Autonomous DrivingBenchmarking	—Unverified	0
GenRL: Multimodal-foundation world models for generalization in embodied agents	Jun 26, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data	Jun 26, 2024	BenchmarkingMath	CodeCode Available	2
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems	Jun 25, 2024	BenchmarkingRAG	—Unverified	0
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making	Jun 25, 2024	BenchmarkingDecision Making	—Unverified	0
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection	Jun 25, 2024	BenchmarkingPrompt Learning	CodeCode Available	1
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)	Jun 25, 2024	BenchmarkingExperimental Design	CodeCode Available	1

Show:10 25 50

← PrevPage 39 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified