SOTAVerified

Benchmarking

Papers

Showing 11511200 of 5548 papers

TitleStatusHype
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
DependEval: Benchmarking LLMs for Repository Dependency UnderstandingCode1
Benchmarking Quantized Neural Networks on FPGAs with FINNCode1
DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment RegimeCode1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
A Closer Look at Mortality Risk Prediction from ElectrocardiogramsCode1
Benchmarking Large Language Models for News SummarizationCode1
A global analysis of metrics used for measuring performance in natural language processingCode1
A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise ModelsCode1
EDFace-Celeb-1M: Benchmarking Face Hallucination with a Million-scale DatasetCode1
A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance ImagingCode1
Benchmarking Multidomain English-Indonesian Machine TranslationCode1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionCode1
Efficient Prediction of Peptide Self-assembly through Sequential and Graphical EncodingCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
EH-DNAS: End-to-End Hardware-aware Differentiable Neural Architecture SearchCode1
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective OptimizationCode1
Benchmarking Reinforcement Learning Techniques for Autonomous NavigationCode1
Recent Advances on Neural Network Pruning at InitializationCode1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for ElectromyographyCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality RobustnessCode1
End-to-end Emotion-Cause Pair Extraction via Learning to LinkCode1
Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and ToolkitCode1
Benchmarking Visual Localization for Autonomous NavigationCode1
A skeletonization algorithm for gradient-based optimizationCode1
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
Benchmarking Omni-Vision Representation through the Lens of Visual RealmsCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
EvalCrafter: Benchmarking and Evaluating Large Video Generation ModelsCode1
Evaluating Adversarial Attacks on ImageNet: A Reality Check on Misclassification ClassesCode1
Evaluating Attribution for Graph Neural NetworksCode1
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
Benchmarking Neural Network Generalization for Grammar InductionCode1
Benchmarking Neural Network Robustness to Common Corruptions and Surface VariationsCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
EventEA: Benchmarking Entity Alignment for Event-centric Knowledge GraphsCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Show:102550
← PrevPage 24 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified