SOTAVerified

Benchmarking

Papers

Showing 12011250 of 5548 papers

TitleStatusHype
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
Benchmarking Reinforcement Learning Techniques for Autonomous NavigationCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative ComprehensionCode1
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning AlgorithmsCode1
FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of ThingsCode1
FedCV: A Federated Learning Framework for Diverse Computer Vision TasksCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User DataCode1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness MethodsCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
Benchmarking the Generation of Fact Checking ExplanationsCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
FNBench: Benchmarking Robust Federated Learning against Noisy LabelsCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Formalizing Multimedia Recommendation through Multimodal Deep LearningCode1
A Reinforcement Learning Environment for Multi-Service UAV-enabled Wireless SystemsCode1
Continual Learning with Foundation Models: An Empirical Study of Latent ReplayCode1
Benchmarking Omni-Vision Representation through the Lens of Visual RealmsCode1
FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven InterpretationCode1
fseval: A Benchmarking Framework for Feature Selection and Feature Ranking AlgorithmsCode1
FTNet: Feature Transverse Network for Thermal Image Semantic SegmentationCode1
BARS-CTR: Open Benchmarking for Click-Through Rate PredictionCode1
G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural NetworksCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Benchmarking Language Models for Code Syntax UnderstandingCode1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
Benchmarking: Past, Present and FutureCode1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event ExtractionCode1
Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking StudyCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
Generative Evaluation of Complex Reasoning in Large Language ModelsCode1
A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive CareCode1
GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument RolesCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Show:102550
← PrevPage 25 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified