SOTAVerified

Benchmarking

Papers

Showing 54015450 of 5548 papers

TitleStatusHype
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals0
Exploration of TPUs for AI Applications0
Exploring and Benchmarking the Planning Capabilities of Large Language Models0
Exploring Capabilities of Time Series Foundation Models in Building Analytics0
A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management0
Exploring Continual Learning of Diffusion Models0
Capsa: A Unified Framework for Quantifying Risk in Deep Neural Networks0
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era0
Airport Capacity and Performance in Europe -- A study of transport economics, service quality and sustainability0
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment0
Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial Hypervolume0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance0
Visual Attention on the Sun: What Do Existing Models Actually Predict?0
Exploring Thermography Technology: A Comprehensive Facial Dataset for Face Detection, Recognition, and Emotion0
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs0
Can time series forecasting be automated? A benchmark and analysis0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
Can Machines “Learn” Halide Perovskite Crystal Formation without Accurate Physicochemical Features?0
Extended Labeled Faces in-the-Wild (ELFW): Augmenting Classes for Face Segmentation0
Extensible Logging and Empirical Attainment Function for IOHexperimenter0
Extraction of clinical information from the non-invasive fetal electrocardiogram0
Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis0
ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content0
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning0
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates0
Face Detection on Surveillance Images0
Face Morphing Attack Generation & Detection: A Comprehensive Survey0
FACT: Learning Governing Abstractions Behind Integer Sequences0
FactLens: Benchmarking Fine-Grained Fact Verification0
Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets0
TDDBench: A Benchmark for Training data detection0
A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System0
FAIRification of MLC data0
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind0
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs0
Fairness-Aware Graph Neural Networks: A Survey0
Fairness Index Measures to Evaluate Bias in Biometric Recognition0
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs0
FakeWatch ElectionShield: A Benchmarking Framework to Detect Fake News for Credible US Elections0
TeamTrack: A Dataset for Multi-Sport Multi-Object Tracking in Full-pitch Videos0
Teaspoon: A comprehensive python package for topological signal processing0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension0
Can Language Models Serve as Text-Based World Simulators?0
Fantastic Questions and Where to Find Them: FairytaleQA – An Authentic Dataset for Narrative Comprehension0
FarsBase-KBP: A Knowledge Base Population System for the Persian Knowledge Graph0
Can humans help BERT gain "confidence"?0
Technical report of a DMD-based Characterization Method for Vision Sensors0
Show:102550
← PrevPage 109 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified