SOTAVerified

Benchmarking

Papers

Showing 901950 of 5548 papers

TitleStatusHype
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
PredictaBoard: Benchmarking LLM Score PredictabilityCode0
Synthetic Porous Microstructures: Automatic Design, Simulation, and Permeability AnalysisCode0
Position: There are no Champions in Long-Term Time Series Forecasting0
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking0
Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification0
A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior0
Benchmarking Self-Supervised Learning Methods for Accelerated MRI ReconstructionCode0
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare0
Multilingual European Language Models: Benchmarking Approaches and Challenges0
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models0
A deep learning framework for efficient pathology image analysisCode4
Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics0
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation0
LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation0
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking0
Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope?Code1
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative AnalysisCode0
A new pathway to generative artificial intelligence by minimizing the maximum entropy0
Benchmarking MedMNIST dataset on real quantum hardware0
Positional Encoding in Transformer-Based Time Series Models: A SurveyCode1
Integrating Expert Knowledge into Logical Programs via LLMsCode0
Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics0
ILIAS: Instance-Level Image retrieval At ScaleCode1
HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic ClaimsCode1
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
Knowledge-aware contrastive heterogeneous molecular graph learning0
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models0
Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment0
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption0
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling0
JExplore: Design Space Exploration Tool for Nvidia Jetson BoardsCode0
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking0
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs0
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support0
User Profile with Large Language Models: Construction, Updating, and Benchmarking0
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow0
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG RoutingCode0
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?0
Benchmarking the rationality of AI decision making using the transitivity axiom0
Forecasting time series with constraintsCode0
A Survey on LLM-based News Recommender Systems0
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency0
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMsCode1
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis0
Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation0
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents0
Zero-shot generation of synthetic neurosurgical data with large language modelsCode0
Show:102550
← PrevPage 19 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified