SOTAVerified

Benchmarking

Papers

Showing 23512400 of 5548 papers

TitleStatusHype
Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings0
Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features0
Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series ClassificationCode0
The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways0
Lessons From Red Teaming 100 Generative AI Products0
Stronger Than You Think: Benchmarking Weak Supervision on Realistic TasksCode0
Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles0
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI0
Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure0
Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural NetworksCode0
Benchmarking Rotary Position Embeddings for Automatic Speech Recognition0
Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models0
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning0
CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing0
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation0
AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization0
An Analysis of Model Robustness across Concurrent Distribution Shifts0
Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks0
Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices0
Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding0
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input0
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models0
Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence BenchmarksCode0
ANTHROPOS-V: benchmarking the novel task of Crowd Volume EstimationCode0
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture0
AI-Powered Cow Detection in Complex Farm Environments0
PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents0
Benchmarking Constraint-Based Bayesian Structure Learning Algorithms: Role of Network Topology0
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings0
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model DiscoveryCode0
TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer0
MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception0
State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects0
RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse CorruptionsCode0
CroCoDL: Cross-device Collaborative Dataset for Localization0
Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models0
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools0
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices0
Segmenting Maxillofacial Structures in CBCT Volumes0
Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback0
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation0
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation0
On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds0
Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries0
A review of faithfulness metrics for hallucination assessment in Large Language Models0
AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects0
Measuring Large Language Models Capacity to Annotate Journalistic Sourcing0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
Show:102550
← PrevPage 48 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified