SOTAVerified

Benchmarking

Papers

Showing 14511500 of 5548 papers

TitleStatusHype
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World ScenariosCode1
OpenDataVal: a Unified Benchmark for Data ValuationCode1
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle PerceptionCode1
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet DatasetsCode1
Data Generating Process to Evaluate Causal Discovery Techniques for Time Series DataCode1
BiBench: Benchmarking and Analyzing Network BinarizationCode1
BEND: Benchmarking DNA Language Models on biologically meaningful tasksCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
DCL-Net: Deep Correspondence Learning Network for 6D Pose EstimationCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language ModelsCode1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondCode1
D2S: Document-to-Slide Generation Via Query-Based Text SummarizationCode1
Optimizing Performance of Federated Person Re-identification: Benchmarking and AnalysisCode1
OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle CommunicationCode1
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System VerificationCode1
Data-Driven Denoising of Stationary Accelerometer SignalsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
Benchmarking Graph Neural Networks on Dynamic Link PredictionCode1
Benchmarking Graph Neural Networks for FMRI analysisCode1
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMsCode1
BiCo-Net: Regress Globally, Match Locally for Robust 6D Pose EstimationCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice TextCode1
Performance Evaluation of Deep Transfer Learning on Multiclass Identification of Common Weed Species in Cotton Production SystemsCode1
PGDQN: Preference-Guided Deep Q-NetworkCode1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
Beyond neural scaling laws: beating power law scaling via data pruningCode1
Beyond Normal: On the Evaluation of Mutual Information EstimatorsCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal ProcessingCode1
PLANTAIN: Diffusion-inspired Pose Score Minimization for Fast and Accurate Molecular DockingCode1
Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph EngineeringCode1
ECRECer: Enzyme Commission Number Recommendation and Benchmarking based on Multiagent Dual-core LearningCode1
Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopyCode1
RADIATE: A Radar Dataset for Automotive Perception in Bad WeatherCode1
POGEMA: A Benchmark Platform for Cooperative Multi-Agent PathfindingCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Positional Encoding in Transformer-Based Time Series Models: A SurveyCode1
PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power SystemsCode1
Benchmarking Graph Learning for Drug-Drug Interaction Prediction0
A practical generalization metric for deep networks benchmarking0
AERF: Adaptive ensemble random fuzzy algorithm for anomaly detection in cloud computing0
Show:102550
← PrevPage 30 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified