Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2226–2250 of 5548 papers

Title	Date	Tasks	Status
Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics	Feb 18, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation	Feb 18, 2025	Benchmarking	—Unverified
A new pathway to generative artificial intelligence by minimizing the maximum entropy	Feb 18, 2025	Benchmarking	—Unverified
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis	Feb 18, 2025	BenchmarkingMamba	CodeCode Available
Multilingual European Language Models: Benchmarking Approaches and Challenges	Feb 18, 2025	BenchmarkingQuestion Answering	—Unverified
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models	Feb 18, 2025	BenchmarkingLarge Language Model	—Unverified
Benchmarking MedMNIST dataset on real quantum hardware	Feb 18, 2025	Benchmarkingimage-classification	—Unverified
LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation	Feb 18, 2025	BenchmarkingText Generation	—Unverified
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption	Feb 17, 2025	BenchmarkingCode Summarization	—Unverified
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling	Feb 17, 2025	Benchmarking	—Unverified
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models	Feb 17, 2025	Benchmarking	—Unverified
Knowledge-aware contrastive heterogeneous molecular graph learning	Feb 17, 2025	BenchmarkingContrastive Learning	—Unverified
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance	Feb 17, 2025	BenchmarkingDependency Parsing	—Unverified
Integrating Expert Knowledge into Logical Programs via LLMs	Feb 17, 2025	BenchmarkingLogical Reasoning	CodeCode Available
Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment	Feb 17, 2025	BenchmarkingCommon Sense Reasoning	—Unverified
Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics	Feb 17, 2025	BenchmarkingDiagnostic	—Unverified
JExplore: Design Space Exploration Tool for Nvidia Jetson Boards	Feb 16, 2025	BenchmarkingGPU	CodeCode Available
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs	Feb 16, 2025	Benchmarking	—Unverified
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking	Feb 16, 2025	Benchmarking	—Unverified
User Profile with Large Language Models: Construction, Updating, and Benchmarking	Feb 15, 2025	BenchmarkingProfile Generation	—Unverified
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support	Feb 15, 2025	BenchmarkingEpidemiology	—Unverified
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing	Feb 14, 2025	BenchmarkingRAG	CodeCode Available
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?	Feb 14, 2025	BenchmarkingIn-Context Learning	—Unverified
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow	Feb 14, 2025	Benchmarking	—Unverified
Benchmarking the rationality of AI decision making using the transitivity axiom	Feb 14, 2025	BenchmarkingDecision Making	—Unverified

Show:10 25 50

← PrevPage 90 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified