Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3726–3750 of 5548 papers

Title	Date	Tasks	Status
Progressive Multi-view Human Mesh Recovery with Self-Supervision	Dec 10, 2022	BenchmarkingDiversity	—Unverified
Progressive with Purpose: Guiding Progressive Inpainting DNNs through Context and Structure	Sep 21, 2022	BenchmarkingImage Inpainting	—Unverified
Projective simulation applied to the grid-world and the mountain-car problem	May 21, 2014	Benchmarkingreinforcement-learning	—Unverified
Project MPG: towards a generalized performance benchmark for LLM capabilities	Oct 28, 2024	BenchmarkingChatbot	—Unverified
Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study	Jan 25, 2025	Benchmarking	—Unverified
Prompting Scientific Names for Zero-Shot Species Recognition	Oct 15, 2023	BenchmarkingZero-Shot Learning	—Unverified
Prompt Sketching for Large Language Models	Nov 8, 2023	Arithmetic ReasoningBenchmarking	—Unverified
Proof of Humanity: A Multi-Layer Network Framework for Certifying Human-Originated Content in an AI-Dominated Internet	Apr 2, 2025	Benchmarking	—Unverified
Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning	Sep 25, 2024	BenchmarkingFormal Logic	—Unverified
ProtIR: Iterative Refinement between Retrievers and Predictors for Protein Function Annotation	Feb 10, 2024	BenchmarkingLanguage Modeling	—Unverified
Protocol for Executing and Benchmarking Eight Computational Doublet-Detection Methods in Single-Cell RNA Sequencing Data Analysis	Jan 21, 2021	Benchmarking	—Unverified
Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking	May 13, 2022	Benchmarkingreinforcement-learning	—Unverified
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding	Nov 7, 2024	BenchmarkingMultiple-choice	—Unverified
PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice	Feb 28, 2025	BenchmarkingDiagnostic	—Unverified
PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents	Jan 3, 2025	Benchmarking	—Unverified
Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms	Oct 11, 2023	BenchmarkingDenoising	—Unverified
Share, Collaborate, Benchmark: Advancing Travel Demand Research through rigorous open-source collaboration	Jun 9, 2023	BenchmarkingTime Series	—Unverified
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation	Sep 4, 2024	Benchmarking	—Unverified
Pulse Shape-Aided Multipath Delay Estimation for Fine-Grained WiFi Sensing	Jun 27, 2023	Benchmarking	—Unverified
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension	Dec 16, 2024	BenchmarkingImage Captioning	—Unverified
Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models	Dec 30, 2023	Benchmarkingimage-classification	—Unverified
Pushing the Frontiers of Unconstrained Face Detection and Recognition: IARPA Janus Benchmark A	Jun 1, 2015	BenchmarkingFace Detection	—Unverified
PySTACHIO: Python Single-molecule TrAcking stoiCHiometry Intensity and simulatiOn, a flexible, extensible, beginner-friendly and optimized program for analysis of single-molecule microscopy	Mar 18, 2021	Art AnalysisBenchmarking	—Unverified
Pythae: Unifying Generative Autoencoders in Python -- A Benchmarking Use Case	Jun 16, 2022	BenchmarkingDensity Estimation	—Unverified
Python Random Graph Generator	Sep 20, 2017	BenchmarkingGraph Generation	—Unverified

Show:10 25 50

← PrevPage 150 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified