SOTAVerified

Benchmarking

Papers

Showing 10011050 of 5548 papers

TitleStatusHype
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in RecommendationCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
Benchmarking Robustness of 3D Object Detection to Common CorruptionsCode1
A Comparison of Image Denoising MethodsCode1
Formalizing Multimedia Recommendation through Multimodal Deep LearningCode1
Continual Learning with Foundation Models: An Empirical Study of Latent ReplayCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
Dynatask: A Framework for Creating Dynamic AI Benchmark TasksCode1
AI Agents That MatterCode1
Earnings-22: A Practical Benchmark for Accents in the WildCode1
FNBench: Benchmarking Robust Federated Learning against Noisy LabelsCode1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089Code1
Benchmarking Reinforcement Learning Techniques for Autonomous NavigationCode1
EBES: Easy Benchmarking for Event SequencesCode1
AI Accelerator Survey and TrendsCode1
FM-TS: Flow Matching for Time Series GenerationCode1
FORB: A Flat Object Retrieval Benchmark for Universal Image EmbeddingCode1
EDFace-Celeb-1M: Benchmarking Face Hallucination with a Million-scale DatasetCode1
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational ScenariosCode1
Flames: Benchmarking Value Alignment of LLMs in ChineseCode1
Benchmarking Quantized Neural Networks on FPGAs with FINNCode1
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data MiningCode1
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defensesCode1
FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone NavigationCode1
ForgeryNet: A Versatile Benchmark for Comprehensive Forgery AnalysisCode1
Foundation Model of Electronic Medical Records for Adaptive Risk EstimationCode1
A skeletonization algorithm for gradient-based optimizationCode1
Benchmarking Visual Localization for Autonomous NavigationCode1
FiFAR: A Fraud Detection Dataset for Learning to DeferCode1
A GPU-accelerated Large-scale Simulator for Transportation System Optimization BenchmarkingCode1
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingCode1
A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective OptimizationCode1
FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language UnderstandingCode1
Benchmarking emergency department triage prediction models with machine learning and large public electronic health recordsCode1
Benchmarking Pathology Feature Extractors for Whole Slide Image ClassificationCode1
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness MethodsCode1
FineSurE: Fine-grained Summarization Evaluation using LLMsCode1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionCode1
A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance ImagingCode1
A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise ModelsCode1
A global analysis of metrics used for measuring performance in natural language processingCode1
Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution TracesCode1
FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User DataCode1
Benchmarking: Past, Present and FutureCode1
FedCV: A Federated Learning Framework for Diverse Computer Vision TasksCode1
A Comparative Attention Framework for Better Few-Shot Object Detection on Aerial ImagesCode1
ArtFID: Quantitative Evaluation of Neural Style TransferCode1
Show:102550
← PrevPage 21 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified