SOTAVerified

Benchmarking

Papers

Showing 14261450 of 5548 papers

TitleStatusHype
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
D2S: Document-to-Slide Generation Via Query-Based Text SummarizationCode1
Decoding the Underlying Meaning of Multimodal Hateful MemesCode1
A Critical Assessment of State-of-the-Art in Entity AlignmentCode1
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity DatasetCode1
NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning ApplicationsCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
NewsRecLib: A PyTorch-Lightning Library for Neural News RecommendationCode1
NLPBench: Evaluating Large Language Models on Solving NLP ProblemsCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
NTIRE 2020 Challenge on Real-World Image Super-Resolution: Methods and ResultsCode1
NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentationCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
Object Shape Error Response Using Bayesian 3-D Convolutional Neural Networks for Assembly Systems With Compliant PartsCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning AlgorithmsCode1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
Show:102550
← PrevPage 58 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified