SOTAVerified

Benchmarking

Papers

Showing 22012250 of 5548 papers

TitleStatusHype
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive QueriesCode0
Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications0
On Neural Inertial Classification Networks for Pedestrian Activity Recognition0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation0
Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis0
Methods and Trends in Detecting Generated Images: A Comprehensive Review0
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained modelsCode0
PredictaBoard: Benchmarking LLM Score PredictabilityCode0
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks0
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models0
Synthetic Porous Microstructures: Automatic Design, Simulation, and Permeability AnalysisCode0
Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models0
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide0
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk0
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation FrameworkCode0
Reinforcement Learning with Graph Attention for Routing and Wavelength Assignment with Lightpath Reuse0
Position: There are no Champions in Long-Term Time Series Forecasting0
A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior0
Benchmarking Self-Supervised Learning Methods for Accelerated MRI ReconstructionCode0
Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification0
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking0
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare0
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking0
Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics0
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation0
A new pathway to generative artificial intelligence by minimizing the maximum entropy0
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative AnalysisCode0
Multilingual European Language Models: Benchmarking Approaches and Challenges0
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models0
Benchmarking MedMNIST dataset on real quantum hardware0
LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation0
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption0
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling0
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models0
Knowledge-aware contrastive heterogeneous molecular graph learning0
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
Integrating Expert Knowledge into Logical Programs via LLMsCode0
Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment0
Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics0
JExplore: Design Space Exploration Tool for Nvidia Jetson BoardsCode0
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs0
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking0
User Profile with Large Language Models: Construction, Updating, and Benchmarking0
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support0
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG RoutingCode0
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?0
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow0
Benchmarking the rationality of AI decision making using the transitivity axiom0
Show:102550
← PrevPage 45 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified