SOTAVerified

Benchmarking

Papers

Showing 38013850 of 5548 papers

TitleStatusHype
Share, Collaborate, Benchmark: Advancing Travel Demand Research through rigorous open-source collaboration0
Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation FrameworkCode0
FedSecurity: Benchmarking Attacks and Defenses in Federated Learning and Federated LLMsCode0
DynamoRep: Trajectory-Based Population Dynamics for Classification of Black-box Optimization ProblemsCode0
FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems0
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language ModelsCode0
RD-Suite: A Benchmark for Ranking Distillation0
Self-Adjusting Weighted Expected Improvement for Bayesian OptimizationCode0
Benchmarking Foundation Models with Language-Model-as-an-Examiner0
ICON^2: Reliably Benchmarking Predictive Inequity in Object Detection0
Knowing-how & Knowing-that: A New Task for Machine Comprehension of User ManualsCode0
Improved statistical benchmarking of digital pathology models using pairwise frames evaluation0
Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems: Challenges and Opportunities0
Applying Standards to Advance Upstream & Downstream Ethics in Large Language Models0
Explainable AI using expressive Boolean formulas0
Financial Numeric Extreme Labelling: A Dataset and Benchmarking for XBRL Tagging0
Benchmarking Middle-Trained Language Models for Neural Search0
N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition0
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning0
EfficientSRFace: An Efficient Network with Super-Resolution Enhancement for Accurate Face Detection0
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models0
ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation0
Break a Lag: Triple Exponential Moving Average for Enhanced Optimization0
Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study0
The Brain Tumor Segmentation (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI0
Revisiting Hate Speech Benchmarks: From Data Curation to System DeploymentCode0
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?Code0
HySpecNet-11k: A Large-Scale Hyperspectral Dataset for Benchmarking Learning-Based Hyperspectral Image Compression Methods0
The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects0
Dynamic Neighborhood Construction for Structured Large Discrete Action SpacesCode0
ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context LearningCode0
ShuffleMix: Improving Representations via Channel-Wise Shuffle of Interpolated Hidden StatesCode0
Design and implementation of intelligent packet filtering in IoT microcontroller-based devicesCode0
Large-scale Ridesharing DARP Instances Based on Real Travel DemandCode0
Human Body Shape Classification Based on a Single Image0
InDL: A New Dataset and Benchmark for In-Diagram Logic Interpretation based on Visual IllusionCode0
Exploring the Practicality of Generative Retrieval on Dynamic Corpora0
BASED: Benchmarking, Analysis, and Structural Estimation of DeblurringCode0
Benchmarking Diverse-Modal Entity Linking with Generative Models0
Learning from Integral Losses in Physics Informed Neural NetworksCode0
Benchmarking state-of-the-art gradient boosting algorithms for classification0
CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical DatasetCode0
Investigation of UAV Detection in Images with Complex Backgrounds and Rainy ArtifactsCode0
Analysis of modular CMA-ES on strict box-constrained problems in the SBOX-COST benchmarking suite0
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer0
LAraBench: Benchmarking Arabic AI with Large Language Models0
Barkour: Benchmarking Animal-level Agility with Quadruped Robots0
R2H: Building Multimodal Navigation Helpers that Respond to Help Requests0
When the Music Stops: Tip-of-the-Tongue Retrieval for MusicCode0
Show:102550
← PrevPage 77 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified