SOTAVerified

Benchmarking

Papers

Showing 19011950 of 5548 papers

TitleStatusHype
Benchmarking GNNs Using Lightning Network Data0
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano0
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality mattersCode1
Towards Stable 3D Object Detection0
SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing IndustryCode2
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation0
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
Benchmarking Complex Instruction-Following with Multiple Constraints CompositionCode2
Benchmark on Drug Target Interaction Modeling from a Structure PerspectiveCode1
Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms0
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious BiasCode0
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking DatasetCode1
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language ModelsCode1
TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations0
Open foundation models for Azerbaijani language0
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining TasksCode0
Occlusion-Aware Seamless SegmentationCode1
Modified CMA-ES Algorithm for Multi-Modal Optimization: Incorporating Niching Strategies and Dynamic Adaptation Mechanism0
MIRAI: Evaluating LLM Agents for Event Forecasting0
Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy0
BERGEN: A Benchmarking Library for Retrieval-Augmented GenerationCode3
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile AgentsCode1
ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions0
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation ModelsCode2
EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting0
MMLongBench-Doc: Benchmarking Long-context Document Understanding with VisualizationsCode2
FineSurE: Fine-grained Summarization Evaluation using LLMsCode1
Reinvestigating the R2 Indicator: Achieving Pareto Compliance by IntegrationCode0
Benchmarking Predictive Coding Networks -- Made SimpleCode2
AI Agents That MatterCode1
Overcoming Common Flaws in the Evaluation of Selective Classification SystemsCode1
Commute Graph Neural Networks0
GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing0
PerSEval: Assessing Personalization in Text Summarizers0
GraphArena: Benchmarking Large Language Models on Graph Computational ProblemsCode1
iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activitiesCode1
Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges0
Benchmarking M6 Competitors: An Analysis of Financial Metrics and Discussion of Incentives0
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language ModelsCode2
Quantum-tunnelling deep neural network for optical illusion recognition0
Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI0
XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis0
GenRL: Multimodal-foundation world models for generalization in embodied agentsCode2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems0
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making0
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark DetectionCode1
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)Code1
Show:102550
← PrevPage 39 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified