SOTAVerified

Benchmarking

Papers

Showing 14511475 of 5548 papers

TitleStatusHype
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic WritingCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondCode1
OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving ConditionsCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource LanguagesCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
OPF-Learn: An Open-Source Framework for Creating Representative AC Optimal Power Flow DatasetsCode1
OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle CommunicationCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Benchmarking Graph Neural Networks on Dynamic Link PredictionCode1
Benchmarking Graph Neural Networks for FMRI analysisCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language ModelsCode1
Data-Driven Denoising of Stationary Accelerometer SignalsCode1
Show:102550
← PrevPage 59 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified