SOTAVerified

Benchmarking

Papers

Showing 601650 of 5548 papers

TitleStatusHype
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
Descending through a Crowded Valley - Benchmarking Deep Learning OptimizersCode1
Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph EngineeringCode1
DependEval: Benchmarking LLMs for Repository Dependency UnderstandingCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural NetworksCode1
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark DetectionCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
Active-Passive SimStereo -- Benchmarking the Cross-Generalization Capabilities of Deep Learning-based Stereo MethodsCode1
Delving into Out-of-Distribution Detection with Medical Vision-Language ModelsCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
MatTools: Benchmarking Large Language Models for Materials Science ToolsCode1
Deluca -- A Differentiable Control Library: Environments, Methods, and BenchmarkingCode1
RobFR: Benchmarking Adversarial Robustness on Face RecognitionCode1
DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated ObjectsCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource LanguagesCode1
Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XLCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
Deep Learning-Based Synchronization for Uplink NB-IoTCode1
Deep learning model solves change point detection for multiple change typesCode1
ALTO: A Large-Scale Dataset for UAV Visual Place Recognition and LocalizationCode1
Benchmarking Large Language Models for News SummarizationCode1
Benchmarking Language Models for Code Syntax UnderstandingCode1
A Critical Assessment of State-of-the-Art in Entity AlignmentCode1
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal ProcessingCode1
DCL-Net: Deep Correspondence Learning Network for 6D Pose EstimationCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working MemoryCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
Benchmarking LLMs' Swarm intelligenceCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet DatasetsCode1
Decoding the Underlying Meaning of Multimodal Hateful MemesCode1
DFGC 2021: A DeepFake Game CompetitionCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
Data-Driven Denoising of Stationary Accelerometer SignalsCode1
D2S: Document-to-Slide Generation Via Query-Based Text SummarizationCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
Data Generating Process to Evaluate Causal Discovery Techniques for Time Series DataCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
Benchmarking Graph Neural Networks on Dynamic Link PredictionCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
Show:102550
← PrevPage 13 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified