SOTAVerified

Benchmarking

Papers

Showing 23012350 of 5548 papers

TitleStatusHype
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language RepresentationCode0
Towards Sim-to-Real Industrial Parts Classification with Synthetic DatasetCode1
Practical Guidelines for Cell Segmentation Models Under Optical Aberrations in Microscopy0
Exploring the Decentraland Economy: Multifaceted Parcel Attributes, Key Insights, and Benchmarking0
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCode7
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMsCode0
Certifying almost all quantum states with few single-qubit measurements0
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models0
Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation ModelCode1
Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NASCode0
From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution0
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs0
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
EFSA: Towards Event-Level Financial Sentiment AnalysisCode0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering0
HOEG: A New Approach for Object-Centric Predictive Process MonitoringCode0
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action LevelCode0
A Comparison of Cryptocurrency Volatility-benchmarking New and Mature Asset Classes0
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language ModelsCode0
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition DynamicsCode0
SDFR: Synthetic Data for Face Recognition Competition0
Multicalibration for Confidence Scoring in LLMs0
Enhancing Video Summarization with Context AwarenessCode0
Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text GenerationCode0
GNNBENCH: Fair and Productive Benchmarking for Single-GPU GNN System0
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)Code0
Dynamic Risk Assessment Methodology with an LDM-based System for Parking Scenarios0
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model PerformanceCode2
Outlier-Efficient Hopfield Layers for Large Transformer-Based ModelsCode1
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal ModelCode1
Benchmarking ChatGPT on Algorithmic ReasoningCode0
Benchmarking Parameter Control Methods in Differential Evolution for Mixed-Integer Black-Box OptimizationCode0
Schroedinger's Threshold: When the AUC doesn't predict AccuracyCode0
A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking The Privacy-Utility Trade-offCode0
DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior0
NL2KQL: From Natural Language to Kusto Query0
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Atom-Level Optical Chemical Structure Recognition with Limited SupervisionCode1
On the reduction of Linear Parameter-Varying State-Space models0
PATCH! Psychometrics-AssisTed BenCHmarking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade MathematicsCode0
PREGO: online mistake detection in PRocedural EGOcentric videosCode1
Advancing LLM Reasoning Generalists with Preference TreesCode3
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and BenchmarkingCode2
Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach0
Diffusion-Driven Domain Adaptation for Generating 3D Molecules0
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations0
Are large language models superhuman chemists?Code2
SpiralMLP: A Lightweight Vision MLP Architecture0
Comparing Hyper-optimized Machine Learning Models for Predicting Efficiency Degradation in Organic Solar Cells0
IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian ContextCode0
Show:102550
← PrevPage 47 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified