SOTAVerified

Benchmarking

Papers

Showing 501550 of 5548 papers

TitleStatusHype
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and ToolkitCode1
Data Generating Process to Evaluate Causal Discovery Techniques for Time Series DataCode1
DCL-Net: Deep Correspondence Learning Network for 6D Pose EstimationCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
Application-Oriented Benchmarking of Quantum Generative Learning Using QUARKCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite ImageryCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
An Evaluation Dataset for Intent Classification and Out-of-Scope PredictionCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
D2S: Document-to-Slide Generation Via Query-Based Text SummarizationCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
An Empirical Study on Google Research Football Multi-agent ScenariosCode1
Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmarkCode1
Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New BenchmarkCode1
An Empirical Study of GPT-4o Image Generation CapabilitiesCode1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089Code1
Benchmarking Graph Neural Networks for FMRI analysisCode1
Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and BenchmarksCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
An Empirical Study into Clustering of Unseen Datasets with Self-Supervised EncodersCode1
A BFS-Tree of Ranking References for Unsupervised Manifold LearningCode1
Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models?Code1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
Replication in Visual Diffusion Models: A Survey and OutlookCode1
Benchmarking Graph Neural Networks on Dynamic Link PredictionCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal ProcessingCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture BreedingCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular DataCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Show:102550
← PrevPage 11 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified