SOTAVerified

Benchmarking

Papers

Showing 851900 of 5548 papers

TitleStatusHype
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New BenchmarkCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmarkCode1
Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone ImagesCode1
An Empirical Study on Google Research Football Multi-agent ScenariosCode1
Image Matching across Wide Baselines: From Paper to PracticeCode1
ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial PatchesCode1
4D Panoptic LiDAR SegmentationCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
An Evaluation Dataset for Intent Classification and Out-of-Scope PredictionCode1
Benchmarking Batch Deep Reinforcement Learning AlgorithmsCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
Benchmarking of DL Libraries and Models on Mobile DevicesCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)Code1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness MetricsCode1
A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research ChallengesCode1
AIPerf: Automated machine learning as an AI-HPC benchmarkCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBsCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan DatasetsCode1
Exploiting News Article Structure for Automatic Corpus Generation of Entailment DatasetsCode1
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
IOHprofiler: A Benchmarking and Profiling Tool for Iterative Optimization HeuristicsCode1
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical StudyCode1
Benchmarking Classical and Learning-Based Multibeam Point Cloud RegistrationCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill MatchingCode1
An Exploration of Embodied Visual ExplorationCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
Show:102550
← PrevPage 18 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified