SOTAVerified

Benchmarking

Papers

Showing 651700 of 5548 papers

TitleStatusHype
DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific InformationCode1
DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated ObjectsCode1
Detecting beats in the photoplethysmogram: benchmarking open-source algorithmsCode1
DFGC 2021: A DeepFake Game CompetitionCode1
DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic DiversityCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
A Computed Tomography Vertebral Segmentation Dataset with Anatomical Variations and Multi-Vendor Scanner DataCode1
Descending through a Crowded Valley - Benchmarking Deep Learning OptimizersCode1
Benchmarking Large Language Models for News SummarizationCode1
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
Benchmarking Micro-action Recognition: Dataset, Methods, and ApplicationsCode1
Descending through a Crowded Valley — Benchmarking Deep Learning OptimizersCode1
Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical CyclonesCode1
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?Code1
Benchmarking Language Models for Code Syntax UnderstandingCode1
AudioMarkBench: Benchmarking Robustness of Audio WatermarkingCode1
Delving into Out-of-Distribution Detection with Medical Vision-Language ModelsCode1
RobFR: Benchmarking Adversarial Robustness on Face RecognitionCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character RecognitionCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
Benchmarking Large Multimodal Models against Common CorruptionsCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy DetectionCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion GamesCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World ScenariosCode1
Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph EngineeringCode1
Deluca -- A Differentiable Control Library: Environments, Methods, and BenchmarkingCode1
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for HallucinationsCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural NetworksCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
Benchmarking Meta-embeddings: What Works and What Does NotCode1
Benchmarking LLMs' Swarm intelligenceCode1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and ObjectsCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
Deep learning model solves change point detection for multiple change typesCode1
Deep Learning-Based Synchronization for Uplink NB-IoTCode1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT ScansCode1
Benchmarking Meaning Representations in Neural Semantic ParsingCode1
DocuMint: Docstring Generation for Python using Small Language ModelsCode1
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMsCode1
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and RethinkingCode1
Show:102550
← PrevPage 14 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified