SOTAVerified

Benchmarking

Papers

Showing 22512300 of 5548 papers

TitleStatusHype
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual ComprehensionCode3
Benchmarking Mobile Device Control Agents across Diverse Configurations0
ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey beesCode0
Empirical Analysis of the Dynamic Binary Value Problem with IOHprofiler0
SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic DataCode1
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value ExtractionCode1
DPO: A Differential and Pointwise Control Approach to Reinforcement Learning0
Importance of Disjoint Sampling in Conventional and Transformer Models for Hyperspectral Image ClassificationCode0
Experimental Validation of Ultrasound Beamforming with End-to-End Deep Learning for Single Plane Wave ImagingCode1
Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches0
The Adversarial AI-Art: Understanding, Generation, Detection, and Benchmarking0
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
EnzChemRED, a rich enzyme chemistry relation extraction dataset0
Open Datasets for Satellite Radio Resource Control0
TAVGBench: Benchmarking Text to Audible-Video GenerationCode1
TeamTrack: A Dataset for Multi-Sport Multi-Object Tracking in Full-pitch Videos0
In-situ process monitoring and adaptive quality enhancement in laser additive manufacturing: a critical review0
Authentic Emotion Mapping: Benchmarking Facial Expressions in Real NewsCode0
Bridging the Gap Between Theory and Practice: Benchmarking Transfer Evolutionary Optimization0
DeepFake-O-Meter v2.0: An Open Platform for DeepFake DetectionCode3
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge BasesCode3
Integrated Sensing and Communication enabled Multiple Base Stations Cooperative UAV Detection0
REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity LinkingCode1
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning0
Environment-aware UAV Communications: CKM Construction and Predictive Beamforming0
How to Benchmark Vision Foundation Models for Semantic Segmentation?Code1
LongEmbed: Extending Embedding Models for Long Context RetrievalCode2
Neural Network Approach for Non-Markovian Dissipative Dynamics of Many-Body Open Quantum Systems0
Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions0
VBR: A Vision Benchmark in RomeCode2
Benchmarking changepoint detection algorithms on cardiac time series0
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs0
Data Collection of Real-Life Knowledge Work in Context: The RLKWiC Dataset0
Iterated Invariant Extended Kalman Filter (IterIEKF)0
Neuromorphic Vision-based Motion Segmentation with Graph Transformer Neural Network0
Revealing data leakage in protein interaction benchmarksCode2
Second Edition FRCSyn Challenge at CVPR 2024: Face Recognition Challenge in the Era of Synthetic DataCode1
LLM Evaluators Recognize and Favor Their Own Generations0
Feature selection in linear SVMs via a hard cardinality constraint: a scalable SDP decomposition approach0
A Universal Protocol to Benchmark Camera Calibration for Sports0
A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic CountingCode0
nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image SegmentationCode1
A Large-Scale Evaluation of Speech Foundation Models0
MMInA: Benchmarking Multihop Multimodal Internet Agents0
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming ProblemsCode1
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for HallucinationsCode1
A Review and Efficient Implementation of Scene Graph Generation MetricsCode1
AMPCliff: quantitative definition and benchmarking of activity cliffs in antimicrobial peptidesCode0
RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via DiffusionCode1
Show:102550
← PrevPage 46 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified