SOTAVerified

Benchmarking

Papers

Showing 23012325 of 5548 papers

TitleStatusHype
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language RepresentationCode0
Towards Sim-to-Real Industrial Parts Classification with Synthetic DatasetCode1
Practical Guidelines for Cell Segmentation Models Under Optical Aberrations in Microscopy0
Exploring the Decentraland Economy: Multifaceted Parcel Attributes, Key Insights, and Benchmarking0
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCode7
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMsCode0
Certifying almost all quantum states with few single-qubit measurements0
Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation ModelCode1
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models0
Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NASCode0
From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution0
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs0
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
EFSA: Towards Event-Level Financial Sentiment AnalysisCode0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering0
HOEG: A New Approach for Object-Centric Predictive Process MonitoringCode0
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action LevelCode0
A Comparison of Cryptocurrency Volatility-benchmarking New and Mature Asset Classes0
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language ModelsCode0
SDFR: Synthetic Data for Face Recognition Competition0
Multicalibration for Confidence Scoring in LLMs0
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition DynamicsCode0
Enhancing Video Summarization with Context AwarenessCode0
Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text GenerationCode0
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)Code0
Show:102550
← PrevPage 93 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified