| Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking | Feb 14, 2024 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| Explainable Global Wildfire Prediction Models using Graph Neural Networks | Feb 11, 2024 | BenchmarkingCommunity Detection | CodeCode Available | 1 |
| Retrieve, Merge, Predict: Augmenting Tables with Data Lakes | Feb 9, 2024 | AutoMLBenchmarking | CodeCode Available | 1 |
| Improved off-policy training of diffusion samplers | Feb 7, 2024 | Benchmarking | CodeCode Available | 1 |
| JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching | Feb 5, 2024 | BenchmarkingSentence | CodeCode Available | 1 |
| GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning | Feb 3, 2024 | BenchmarkingDeepFake Detection | CodeCode Available | 1 |
| Benchmarking Transferable Adversarial Attacks | Feb 1, 2024 | Adversarial AttackBenchmarking | CodeCode Available | 1 |
| We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline | Feb 1, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Explainable Benchmarking for Iterative Optimization Heuristics | Jan 31, 2024 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 1 |
| Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial Labels | Jan 30, 2024 | Benchmarkingimage-classification | CodeCode Available | 1 |
| Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets | Jan 29, 2024 | BenchmarkingMachine Translation | CodeCode Available | 1 |
| Dataset and Benchmark: Novel Sensors for Autonomous Vehicle Perception | Jan 24, 2024 | Benchmarking | CodeCode Available | 1 |
| SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval | Jan 24, 2024 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Benchmarking Large Multimodal Models against Common Corruptions | Jan 22, 2024 | BenchmarkingImage to text | CodeCode Available | 1 |
| CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling | Jan 21, 2024 | Benchmarking | CodeCode Available | 1 |
| RSUD20K: A Dataset for Road Scene Understanding In Autonomous Driving | Jan 14, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital Twins | Jan 6, 2024 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 |
| German Text Embedding Clustering Benchmark | Jan 5, 2024 | BenchmarkingClustering | CodeCode Available | 1 |
| FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models | Jan 1, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Large Language Models on Controllable Generation under Diversified Instructions | Jan 1, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA | Dec 29, 2023 | AnatomyBenchmarking | CodeCode Available | 1 |
| APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond | Dec 25, 2023 | Animal Pose EstimationBenchmarking | CodeCode Available | 1 |
| Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models | Dec 21, 2023 | Benchmarking | CodeCode Available | 1 |
| RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation | Dec 21, 2023 | BenchmarkingProduct Recommendation | CodeCode Available | 1 |
| FiFAR: A Fraud Detection Dataset for Learning to Defer | Dec 20, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 |
| TAO-Amodal: A Benchmark for Tracking Any Object Amodally | Dec 19, 2023 | Amodal TrackingAutonomous Driving | CodeCode Available | 1 |
| How to Train Neural Field Representations: A Comprehensive Study and Benchmark | Dec 16, 2023 | Benchmarking | CodeCode Available | 1 |
| Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models | Dec 15, 2023 | BenchmarkingCode Summarization | CodeCode Available | 1 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 |
| EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning | Dec 11, 2023 | BenchmarkingHuman-Object Interaction Detection | CodeCode Available | 1 |
| Benchmarking Distribution Shift in Tabular Data with TableShift | Dec 10, 2023 | BenchmarkingBinary Classification | CodeCode Available | 1 |
| STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers | Dec 9, 2023 | AnatomyAutoML | CodeCode Available | 1 |
| Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single Images | Dec 8, 2023 | BenchmarkingObject | CodeCode Available | 1 |
| Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym | Dec 6, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks | Dec 5, 2023 | BenchmarkingMinecraft | CodeCode Available | 1 |
| Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions | Dec 5, 2023 | BenchmarkingConversational Question Answering | CodeCode Available | 1 |
| Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning Algorithms | Nov 30, 2023 | BenchmarkingOpenAI Gym | CodeCode Available | 1 |
| Enhancing Ligand Pose Sampling for Molecular Docking | Nov 30, 2023 | BenchmarkingMolecular Docking | CodeCode Available | 1 |
| Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation | Nov 30, 2023 | Benchmarkingcounterfactual | CodeCode Available | 1 |
| Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs | Nov 29, 2023 | Benchmarking | CodeCode Available | 1 |
| UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation | Nov 26, 2023 | BenchmarkingHallucination | CodeCode Available | 1 |
| Benchmarking Robustness of Text-Image Composed Retrieval | Nov 24, 2023 | AttributeBenchmarking | CodeCode Available | 1 |
| IMGTB: A Framework for Machine-Generated Text Detection Benchmarking | Nov 21, 2023 | BenchmarkingText Detection | CodeCode Available | 1 |
| BEND: Benchmarking DNA Language Models on biologically meaningful tasks | Nov 21, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Towards a more inductive world for drug repurposing approaches | Nov 21, 2023 | BenchmarkingPrediction | CodeCode Available | 1 |
| LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly Detector | Nov 20, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Pathology Feature Extractors for Whole Slide Image Classification | Nov 20, 2023 | Benchmarkingimage-classification | CodeCode Available | 1 |
| TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction | Nov 16, 2023 | BenchmarkingEvent Extraction | CodeCode Available | 1 |
| Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization | Nov 15, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 |