SOTAVerified

Benchmarking

Papers

Showing 12511300 of 5548 papers

TitleStatusHype
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event ExtractionCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Benchmarking emergency department triage prediction models with machine learning and large public electronic health recordsCode1
4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBsCode1
GuacaMol: Benchmarking Models for De Novo Molecular DesignCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based HateCode1
A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research ChallengesCode1
HazeSpace2M: A Dataset for Haze Aware Single Image DehazingCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
HINT3: Raising the bar for Intent Detection in the WildCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide ImagesCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
Benchmarking Quantized Neural Networks on FPGAs with FINNCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language ModelsCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive MediaCode1
4D Panoptic LiDAR SegmentationCode1
A framework for benchmarking clustering algorithmsCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Hyperparameter optimization in deep multi-target predictionCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data MiningCode1
Image Colorization: A Survey and DatasetCode1
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantificationCode1
A SWAT-based Reinforcement Learning Framework for Crop ManagementCode1
AirSim Drone Racing LabCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
Benchmarking Simulation-Based InferenceCode1
A Comprehensive Overview of Large Language ModelsCode1
A framework for benchmarking class-out-of-distribution detection and its application to ImageNetCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation ModelCode1
A Systematic Benchmarking Analysis of Transfer Learning for Medical Image AnalysisCode1
Improving and Benchmarking Offline Reinforcement Learning AlgorithmsCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Geometric Deep Learning for Structure-Based Drug Design: A SurveyCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
Show:102550
← PrevPage 26 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified