Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 926–950 of 5548 papers

Title	Date	Tasks	Status	Hype
HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims	Feb 17, 2025	BenchmarkingFact Checking	CodeCode Available	1
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance	Feb 17, 2025	BenchmarkingDependency Parsing	—Unverified	0
Knowledge-aware contrastive heterogeneous molecular graph learning	Feb 17, 2025	BenchmarkingContrastive Learning	—Unverified	0
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models	Feb 17, 2025	Benchmarking	—Unverified	0
Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment	Feb 17, 2025	BenchmarkingCommon Sense Reasoning	—Unverified	0
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption	Feb 17, 2025	BenchmarkingCode Summarization	—Unverified	0
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling	Feb 17, 2025	Benchmarking	—Unverified	0
JExplore: Design Space Exploration Tool for Nvidia Jetson Boards	Feb 16, 2025	BenchmarkingGPU	CodeCode Available	0
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking	Feb 16, 2025	Benchmarking	—Unverified	0
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs	Feb 16, 2025	Benchmarking	—Unverified	0
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support	Feb 15, 2025	BenchmarkingEpidemiology	—Unverified	0
User Profile with Large Language Models: Construction, Updating, and Benchmarking	Feb 15, 2025	BenchmarkingProfile Generation	—Unverified	0
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow	Feb 14, 2025	Benchmarking	—Unverified	0
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing	Feb 14, 2025	BenchmarkingRAG	CodeCode Available	0
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?	Feb 14, 2025	BenchmarkingIn-Context Learning	—Unverified	0
Benchmarking the rationality of AI decision making using the transitivity axiom	Feb 14, 2025	BenchmarkingDecision Making	—Unverified	0
Forecasting time series with constraints	Feb 14, 2025	Additive modelsBenchmarking	CodeCode Available	0
A Survey on LLM-based News Recommender Systems	Feb 13, 2025	BenchmarkingFairness	—Unverified	0
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit	Feb 13, 2025	BenchmarkingEdge-computing	—Unverified	0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency	Feb 13, 2025	BenchmarkingMath	—Unverified	0
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs	Feb 13, 2025	BenchmarkingRetrieval	CodeCode Available	1
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis	Feb 13, 2025	Benchmarking	—Unverified	0
Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation	Feb 13, 2025	Benchmarking	—Unverified	0
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	Feb 13, 2025	Benchmarking	—Unverified	0
Zero-shot generation of synthetic neurosurgical data with large language models	Feb 13, 2025	BenchmarkingDe-identification	CodeCode Available	0

Show:10 25 50

← PrevPage 38 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified