SOTAVerified

Benchmarking

Papers

Showing 17011750 of 5548 papers

TitleStatusHype
Transformers in Protein: A Survey0
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs0
PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology0
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel BugsCode0
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative RefinementCode0
FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets0
AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and HealthcareCode0
Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages0
Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking InsightsCode0
EuroCon: Benchmarking Parliament Deliberation for Political Consensus Finding0
Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation BenchmarksCode0
A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking0
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs0
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems0
Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat0
EnvSDD: Benchmarking Environmental Sound Deepfake Detection0
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research0
SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs0
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding0
AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science0
Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking0
Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments0
SPDEBench: An Extensive Benchmark for Learning Regular and Singular Stochastic PDEsCode0
Benchmarking and Rethinking Knowledge Editing for Large Language ModelsCode0
Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs0
So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection0
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE DatasetCode0
Benchmarking Poisoning Attacks against Retrieval-Augmented Generation0
From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation0
LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning ChallengesCode0
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models0
A Position Paper on the Automatic Generation of Machine Learning LeaderboardsCode0
SEvoBench : A C++ Framework For Evolutionary Single-Objective Optimization Benchmarking0
Wildfire spread forecasting with Deep LearningCode0
PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language0
Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts0
SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond ClassificationCode0
Benchmark for Antibody Binding Affinity Maturation and Design0
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation0
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding0
3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method EvaluationCode0
Is Single-View Mesh Reconstruction Ready for Robotics?0
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsCode0
PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints0
Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS20
Experimental robustness benchmark of quantum neural network on a superconducting quantum processor0
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models0
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs0
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models0
Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing0
Show:102550
← PrevPage 35 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified