SOTAVerified

Benchmarking

Papers

Showing 29513000 of 5548 papers

TitleStatusHype
Experimental Benchmarking of Energy-saving Sub-Optimal Sliding Mode Control0
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs0
Automated detection of gibbon calls from passive acoustic monitoring data using convolutional neural networks in the "torch for R" ecosystem0
Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic EnvironmentCode0
Evaluating Nuanced Bias in Large Language Model Free Response Answers0
A Comprehensive Survey on Retrieval Methods in Recommender Systems0
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models0
How Aligned are Different Alignment Metrics?0
HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability predictionCode0
Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability0
SPINEX-Clustering: Similarity-based Predictions with Explainable Neighbors Exploration for Clustering Problems0
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation0
Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation ExperimentsCode0
TARGO: Benchmarking Target-driven Object Grasping under Occlusions0
MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition0
A Benchmark for Multi-speaker Anonymization0
Rethinking the Effectiveness of Graph Classification Datasets in Benchmarks for Assessing GNNsCode0
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano0
Benchmarking GNNs Using Lightning Network Data0
Towards Stable 3D Object Detection0
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation0
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious BiasCode0
Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms0
TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations0
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining TasksCode0
Open foundation models for Azerbaijani language0
ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions0
EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting0
Reinvestigating the R2 Indicator: Achieving Pareto Compliance by IntegrationCode0
Modified CMA-ES Algorithm for Multi-Modal Optimization: Incorporating Niching Strategies and Dynamic Adaptation Mechanism0
MIRAI: Evaluating LLM Agents for Event Forecasting0
Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy0
GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing0
Commute Graph Neural Networks0
PerSEval: Assessing Personalization in Text Summarizers0
Benchmarking M6 Competitors: An Analysis of Financial Metrics and Discussion of Incentives0
Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges0
Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI0
Quantum-tunnelling deep neural network for optical illusion recognition0
XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis0
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems0
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models0
Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language0
Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical InvestigationCode0
NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods0
Towards Efficient and Scalable Training of Differentially Private Deep LearningCode0
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender SystemsCode0
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
Show:102550
← PrevPage 60 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified