SOTAVerified

Benchmarking

Papers

Showing 32263250 of 5548 papers

TitleStatusHype
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action LevelCode0
HOEG: A New Approach for Object-Centric Predictive Process MonitoringCode0
EFSA: Towards Event-Level Financial Sentiment AnalysisCode0
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language ModelsCode0
A Comparison of Cryptocurrency Volatility-benchmarking New and Mature Asset Classes0
Multicalibration for Confidence Scoring in LLMs0
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition DynamicsCode0
SDFR: Synthetic Data for Face Recognition Competition0
Enhancing Video Summarization with Context AwarenessCode0
GNNBENCH: Fair and Productive Benchmarking for Single-GPU GNN System0
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)Code0
Dynamic Risk Assessment Methodology with an LDM-based System for Parking Scenarios0
Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text GenerationCode0
Benchmarking ChatGPT on Algorithmic ReasoningCode0
Schroedinger's Threshold: When the AUC doesn't predict AccuracyCode0
Benchmarking Parameter Control Methods in Differential Evolution for Mixed-Integer Black-Box OptimizationCode0
DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior0
A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking The Privacy-Utility Trade-offCode0
NL2KQL: From Natural Language to Kusto Query0
PATCH! Psychometrics-AssisTed BenCHmarking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade MathematicsCode0
On the reduction of Linear Parameter-Varying State-Space models0
Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach0
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations0
Diffusion-Driven Domain Adaptation for Generating 3D Molecules0
SpiralMLP: A Lightweight Vision MLP Architecture0
Show:102550
← PrevPage 130 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified