SOTAVerified

Benchmarking

Papers

Showing 52515300 of 5548 papers

TitleStatusHype
PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object UnderstandingCode0
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language ModelsCode0
Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos for MediaEval 2022Code0
PATCH! Psychometrics-AssisTed BenCHmarking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade MathematicsCode0
Aggregated Attributions for Explanatory Analysis of 3D Segmentation ModelsCode0
A Position Paper on the Automatic Generation of Machine Learning LeaderboardsCode0
Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series ClassificationCode0
ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey beesCode0
PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image DatasetCode0
Attribution of Predictive Uncertainties in Classification ModelsCode0
Conformal Prediction: A Theoretical Note and Benchmarking Transductive Node Classification in GraphsCode0
Agentic-HLS: An agentic reasoning based high-level synthesis system using large language models (AI for EDA workshop 2024)Code0
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action LevelCode0
Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QACode0
Custom Dual Transportation Mode Detection by Smartphone Devices Exploiting Sensor DiversityCode0
CuRe: Cultural Gaps in the Long Tail of Text-to-Image SystemsCode0
PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language ModelsCode0
CURATe: Benchmarking Personalised Alignment of Conversational AI AssistantsCode0
CUDA-GHR: Controllable Unsupervised Domain Adaptation for Gaze and Head RedirectionCode0
Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise LevelsCode0
Ants can orienteer a thief in their robberyCode0
3DOS: Towards 3D Open Set Learning -- Benchmarking and Understanding Semantic Novelty Detection on Point CloudsCode0
Benchmarking Generative Latent Variable Models for SpeechCode0
Benchmarking Generative AI Models for Deep Learning Test Input GenerationCode0
Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video GroundingCode0
C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized RecommendationCode0
Performance Evaluation of Real-Time Object Detection for Electric ScootersCode0
Benchmarking Framework for Performance-Evaluation of Causal Inference AnalysisCode0
A General Benchmarking Framework for Text GenerationCode0
Performance Modeling of Data Storage Systems using Generative ModelsCode0
Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality ControlCode0
Vector-Based Data Improves Left-Right Eye-Tracking Classifier Performance After a Covariate Distributional ShiftCode0
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World KnowledgeCode0
Periodic Extrapolative Generalisation in Neural NetworksCode0
Standardizing Structural Causal ModelsCode0
Standard Vs Uniform Binary Search and Their Variants in Learned Static Indexing: The Case of the Searching on Sorted Data Benchmarking Software PlatformCode0
StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association AnalysisCode0
Benchmarking framework for machine learning classification from fNIRS dataCode0
PersoBench: Benchmarking Personalized Response Generation in Large Language ModelsCode0
STA: Self-controlled Text Augmentation for Improving Text ClassificationsCode0
Architecture Analysis and Benchmarking of 3D U-shaped Deep Learning Models for Thoracic Anatomical SegmentationCode0
XCompress: LLM assisted Python-based text compression toolkitCode0
A Framework for Generating Informative Benchmark InstancesCode0
What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?Code0
Towards Robust Metrics for Concept Representation EvaluationCode0
Statistical Multicriteria Evaluation of LLM-Generated TextCode0
ANTHROPOS-V: benchmarking the novel task of Crowd Volume EstimationCode0
Answer Consolidation: Formulation and BenchmarkingCode0
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting ApproachesCode0
A novel evaluation methodology for supervised Feature Ranking algorithmsCode0
Show:102550
← PrevPage 106 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified