SOTAVerified

Benchmarking

Papers

Showing 14261450 of 5548 papers

TitleStatusHype
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding0
A Survey of Small Language Models0
OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery0
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
CoqPilot, a plugin for LLM-based generation of proofsCode2
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based ApproachCode2
Conditional diffusions for amortized neural posterior estimationCode0
Benchmarking Graph Learning for Drug-Drug Interaction Prediction0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems0
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation FrameworkCode0
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to AdvancesCode3
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and ValidationCode0
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling0
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
VoiceBench: Benchmarking LLM-Based Voice AssistantsCode3
Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies0
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
Safe Load Balancing in Software-Defined-Networking0
Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing0
Building Conformal Prediction Intervals with Approximate Message PassingCode0
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
Show:102550
← PrevPage 58 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified