SOTAVerified

Benchmarking

Papers

Showing 14011450 of 5548 papers

TitleStatusHype
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
PC-Gym: Benchmark Environments For Process Control ProblemsCode2
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models0
SS3DM: Benchmarking Street-View Surface Reconstruction with a Synthetic 3D Mesh Dataset0
AI Cyber Risk Benchmark: Automated Exploitation Capabilities0
Benchmarking LLM Guardrails in Handling Multilingual Toxicity0
Benchmarking Human and Automated Prompting in the Segment Anything ModelCode0
Exploring Capabilities of Time Series Foundation Models in Building Analytics0
Project MPG: towards a generalized performance benchmark for LLM capabilities0
LLMCBench: Benchmarking Large Language Model Compression for Efficient DeploymentCode1
ODRL: A Benchmark for Off-Dynamics Reinforcement LearningCode2
NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual UpdatesCode0
LLM-initialized Differentiable Causal Discovery0
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training0
CODES: Benchmarking Coupled ODE SurrogatesCode0
BongLLaMA: LLaMA for Bangla Language0
Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce0
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?Code0
CURATe: Benchmarking Personalised Alignment of Conversational AI AssistantsCode0
Sequential Large Language Model-Based Hyper-parameter OptimizationCode0
SPICEPilot: Navigating SPICE Code Generation and Simulation with AI GuidanceCode1
Multi-input Multi-output Loewner Framework for Vibration-based Damage Detection on a Trainer Jet0
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance LabelsCode0
SFTrack: A Robust Scale and Motion Adaptive Algorithm for Tracking Small and Fast Moving Objects0
OGBench: Benchmarking Offline Goal-Conditioned RLCode3
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding0
A Survey of Small Language Models0
OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery0
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
CoqPilot, a plugin for LLM-based generation of proofsCode2
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based ApproachCode2
Conditional diffusions for amortized neural posterior estimationCode0
Benchmarking Graph Learning for Drug-Drug Interaction Prediction0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems0
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation FrameworkCode0
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to AdvancesCode3
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and ValidationCode0
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling0
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
VoiceBench: Benchmarking LLM-Based Voice AssistantsCode3
Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies0
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
Safe Load Balancing in Software-Defined-Networking0
Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing0
Building Conformal Prediction Intervals with Approximate Message PassingCode0
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
Show:102550
← PrevPage 29 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified