SOTAVerified

Benchmarking

Papers

Showing 29763000 of 5548 papers

TitleStatusHype
Open foundation models for Azerbaijani language0
ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions0
EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting0
Reinvestigating the R2 Indicator: Achieving Pareto Compliance by IntegrationCode0
Modified CMA-ES Algorithm for Multi-Modal Optimization: Incorporating Niching Strategies and Dynamic Adaptation Mechanism0
MIRAI: Evaluating LLM Agents for Event Forecasting0
Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy0
GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing0
Commute Graph Neural Networks0
PerSEval: Assessing Personalization in Text Summarizers0
Benchmarking M6 Competitors: An Analysis of Financial Metrics and Discussion of Incentives0
Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges0
Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI0
Quantum-tunnelling deep neural network for optical illusion recognition0
XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis0
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems0
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models0
Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language0
Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical InvestigationCode0
NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods0
Towards Efficient and Scalable Training of Differentially Private Deep LearningCode0
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender SystemsCode0
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
Show:102550
← PrevPage 120 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified