Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 901–950 of 5548 papers

Title	Date	Tasks	Status	Hype
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis	Feb 20, 2025	Age EstimationBenchmarking	CodeCode Available	2
PredictaBoard: Benchmarking LLM Score Predictability	Feb 20, 2025	BenchmarkingCommon Sense Reasoning	CodeCode Available	0
Synthetic Porous Microstructures: Automatic Design, Simulation, and Permeability Analysis	Feb 20, 2025	Benchmarking	CodeCode Available	0
Position: There are no Champions in Long-Term Time Series Forecasting	Feb 19, 2025	BenchmarkingPosition	—Unverified	0
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking	Feb 19, 2025	Benchmarking	—Unverified	0
Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification	Feb 19, 2025	Benchmarking	—Unverified	0
A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior	Feb 19, 2025	BenchmarkingMisinformation	—Unverified	0
Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction	Feb 19, 2025	BenchmarkingMRI Reconstruction	CodeCode Available	0
Benchmarking LLMs for Political Science: A United Nations Perspective	Feb 19, 2025	BenchmarkingDecision Making	CodeCode Available	1
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare	Feb 19, 2025	BenchmarkingDiversity	—Unverified	0
Multilingual European Language Models: Benchmarking Approaches and Challenges	Feb 18, 2025	BenchmarkingQuestion Answering	—Unverified	0
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models	Feb 18, 2025	BenchmarkingLarge Language Model	—Unverified	0
A deep learning framework for efficient pathology image analysis	Feb 18, 2025	BenchmarkingDeep Learning	CodeCode Available	4
Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics	Feb 18, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation	Feb 18, 2025	Benchmarking	—Unverified	0
LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation	Feb 18, 2025	BenchmarkingText Generation	—Unverified	0
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking	Feb 18, 2025	BenchmarkingBinary Classification	—Unverified	0
Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope?	Feb 18, 2025	BenchmarkingBlocking	CodeCode Available	1
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis	Feb 18, 2025	BenchmarkingMamba	CodeCode Available	0
A new pathway to generative artificial intelligence by minimizing the maximum entropy	Feb 18, 2025	Benchmarking	—Unverified	0
Benchmarking MedMNIST dataset on real quantum hardware	Feb 18, 2025	Benchmarkingimage-classification	—Unverified	0
Positional Encoding in Transformer-Based Time Series Models: A Survey	Feb 17, 2025	Anomaly DetectionBenchmarking	CodeCode Available	1
Integrating Expert Knowledge into Logical Programs via LLMs	Feb 17, 2025	BenchmarkingLogical Reasoning	CodeCode Available	0
Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics	Feb 17, 2025	BenchmarkingDiagnostic	—Unverified	0
ILIAS: Instance-Level Image retrieval At Scale	Feb 17, 2025	BenchmarkingImage Retrieval	CodeCode Available	1
HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims	Feb 17, 2025	BenchmarkingFact Checking	CodeCode Available	1
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance	Feb 17, 2025	BenchmarkingDependency Parsing	—Unverified	0
Knowledge-aware contrastive heterogeneous molecular graph learning	Feb 17, 2025	BenchmarkingContrastive Learning	—Unverified	0
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models	Feb 17, 2025	Benchmarking	—Unverified	0
Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment	Feb 17, 2025	BenchmarkingCommon Sense Reasoning	—Unverified	0
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption	Feb 17, 2025	BenchmarkingCode Summarization	—Unverified	0
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling	Feb 17, 2025	Benchmarking	—Unverified	0
JExplore: Design Space Exploration Tool for Nvidia Jetson Boards	Feb 16, 2025	BenchmarkingGPU	CodeCode Available	0
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking	Feb 16, 2025	Benchmarking	—Unverified	0
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs	Feb 16, 2025	Benchmarking	—Unverified	0
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support	Feb 15, 2025	BenchmarkingEpidemiology	—Unverified	0
User Profile with Large Language Models: Construction, Updating, and Benchmarking	Feb 15, 2025	BenchmarkingProfile Generation	—Unverified	0
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow	Feb 14, 2025	Benchmarking	—Unverified	0
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing	Feb 14, 2025	BenchmarkingRAG	CodeCode Available	0
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?	Feb 14, 2025	BenchmarkingIn-Context Learning	—Unverified	0
Benchmarking the rationality of AI decision making using the transitivity axiom	Feb 14, 2025	BenchmarkingDecision Making	—Unverified	0
Forecasting time series with constraints	Feb 14, 2025	Additive modelsBenchmarking	CodeCode Available	0
A Survey on LLM-based News Recommender Systems	Feb 13, 2025	BenchmarkingFairness	—Unverified	0
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit	Feb 13, 2025	BenchmarkingEdge-computing	—Unverified	0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency	Feb 13, 2025	BenchmarkingMath	—Unverified	0
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs	Feb 13, 2025	BenchmarkingRetrieval	CodeCode Available	1
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis	Feb 13, 2025	Benchmarking	—Unverified	0
Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation	Feb 13, 2025	Benchmarking	—Unverified	0
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	Feb 13, 2025	Benchmarking	—Unverified	0
Zero-shot generation of synthetic neurosurgical data with large language models	Feb 13, 2025	BenchmarkingDe-identification	CodeCode Available	0

Show:10 25 50

← PrevPage 19 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified