SOTAVerified

Benchmarking

Papers

Showing 451500 of 5548 papers

TitleStatusHype
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning PotentialsCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
An Exploration of Embodied Visual ExplorationCode1
AnomalyHop: An SSL-based Image Anomaly Localization MethodCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
CharacterBench: Benchmarking Character Customization of Large Language ModelsCode1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic WritingCode1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning RobustnessCode1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
CCTV-Gun: Benchmarking Handgun Detection in CCTV ImagesCode1
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?Code1
CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital TwinsCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Accelerated and interpretable oblique random survival forestsCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
Benchmarking Visual Localization for Autonomous NavigationCode1
CBench: Towards Better Evaluation of Question Answering Over Knowledge GraphsCode1
Chaos as an interpretable benchmark for forecasting and data-driven modellingCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
CattleFace-RGBT: RGB-T Cattle Facial Landmark BenchmarkCode1
AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoringCode1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
Restore Anything Model via Efficient Degradation AdaptationCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE DetectionCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation AlgorithmsCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
COSMOS: Catching Out-of-Context Misinformation with Self-Supervised LearningCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
A Platform for the Biomedical Application of Large Language ModelsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
Can Language Models Make Fun? A Case Study in Chinese Comical CrosstalkCode1
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMsCode1
Show:102550
← PrevPage 10 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified