SOTAVerified

Benchmarking

Papers

Showing 926950 of 5548 papers

TitleStatusHype
HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic ClaimsCode1
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
Knowledge-aware contrastive heterogeneous molecular graph learning0
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models0
Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment0
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption0
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling0
JExplore: Design Space Exploration Tool for Nvidia Jetson BoardsCode0
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking0
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs0
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support0
User Profile with Large Language Models: Construction, Updating, and Benchmarking0
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow0
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG RoutingCode0
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?0
Benchmarking the rationality of AI decision making using the transitivity axiom0
Forecasting time series with constraintsCode0
A Survey on LLM-based News Recommender Systems0
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency0
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMsCode1
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis0
Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation0
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents0
Zero-shot generation of synthetic neurosurgical data with large language modelsCode0
Show:102550
← PrevPage 38 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified