SOTAVerified

Benchmarking

Papers

Showing 14011450 of 5548 papers

TitleStatusHype
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE DetectionCode1
Mukayese: Turkish NLP Strikes BackCode1
Data Generating Process to Evaluate Causal Discovery Techniques for Time Series DataCode1
Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working MemoryCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
Multimodal LLMs Can Reason about Aesthetics in Zero-ShotCode1
MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution ImageryCode1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection BenchmarkCode1
Mutual-Information Based Few-Shot ClassificationCode1
NAS-Bench-101: Towards Reproducible Neural Architecture SearchCode1
BEND: Benchmarking DNA Language Models on biologically meaningful tasksCode1
NAS-Bench-Graph: Benchmarking Graph Neural Architecture SearchCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
NATS-Bench: Benchmarking NAS Algorithms for Architecture Topology and SizeCode1
Autonomous Reinforcement Learning: Formalism and BenchmarkingCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell DataCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
D2S: Document-to-Slide Generation Via Query-Based Text SummarizationCode1
Decoding the Underlying Meaning of Multimodal Hateful MemesCode1
A Critical Assessment of State-of-the-Art in Entity AlignmentCode1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity DatasetCode1
NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning ApplicationsCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
NewsRecLib: A PyTorch-Lightning Library for Neural News RecommendationCode1
NLPBench: Evaluating Large Language Models on Solving NLP ProblemsCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
NTIRE 2020 Challenge on Real-World Image Super-Resolution: Methods and ResultsCode1
NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentationCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
Object Shape Error Response Using Bayesian 3-D Convolutional Neural Networks for Assembly Systems With Compliant PartsCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning AlgorithmsCode1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
Show:102550
← PrevPage 29 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified