| Coherent Feed Forward Quantum Neural Network | Feb 1, 2024 | BenchmarkingDiagnostic | —Unverified | 0 |
| Benchmarking Transferable Adversarial Attacks | Feb 1, 2024 | Adversarial AttackBenchmarking | CodeCode Available | 1 |
| We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline | Feb 1, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Benchmarking Sensitivity of Continual Graph Learning for Skeleton-Based Action Recognition | Jan 31, 2024 | Action RecognitionBenchmarking | —Unverified | 0 |
| I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench | Jan 31, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 4 |
| Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data | Jan 31, 2024 | BenchmarkingChange Detection | CodeCode Available | 0 |
| Explainable Benchmarking for Iterative Optimization Heuristics | Jan 31, 2024 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 1 |
| Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios | Jan 30, 2024 | Benchmarking | CodeCode Available | 2 |
| Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial Labels | Jan 30, 2024 | Benchmarkingimage-classification | CodeCode Available | 1 |
| ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks | Jan 29, 2024 | BenchmarkingCross-Lingual Transfer | CodeCode Available | 0 |
| Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets | Jan 29, 2024 | BenchmarkingMachine Translation | CodeCode Available | 1 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 |
| PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models | Jan 28, 2024 | BenchmarkingCode Generation | CodeCode Available | 0 |
| MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries | Jan 27, 2024 | BenchmarkingRAG | CodeCode Available | 3 |
| SAM-based instance segmentation models for the automation of structural damage detection | Jan 27, 2024 | BenchmarkingInstance Segmentation | —Unverified | 0 |
| Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset | Jan 27, 2024 | BenchmarkingTime Series | —Unverified | 0 |
| Biological Valuation Map of Flanders: A Sentinel-2 Imagery Analysis | Jan 26, 2024 | BenchmarkingSemantic Segmentation | —Unverified | 0 |
| Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs | Jan 26, 2024 | BenchmarkingKnowledge Graphs | —Unverified | 0 |
| Automated legal reasoning with discretion to act using s(LAW) | Jan 25, 2024 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| TriSAM: Tri-Plane SAM for zero-shot cortical blood vessel segmentation in VEM images | Jan 25, 2024 | BenchmarkingSegmentation | —Unverified | 0 |
| Dataset and Benchmark: Novel Sensors for Autonomous Vehicle Perception | Jan 24, 2024 | Benchmarking | CodeCode Available | 1 |
| Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding | Jan 24, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval | Jan 24, 2024 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Benchmarking the Fairness of Image Upsampling Methods | Jan 24, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents | Jan 24, 2024 | Benchmarking | CodeCode Available | 3 |
| What the Weight?! A Unified Framework for Zero-Shot Knowledge Composition | Jan 23, 2024 | Benchmarking | CodeCode Available | 0 |
| LLpowershap: Logistic Loss-based Automated Shapley Values Feature Selection Method | Jan 23, 2024 | BenchmarkingFairness | CodeCode Available | 0 |
| Benchmarking LLMs via Uncertainty Quantification | Jan 23, 2024 | BenchmarkingUncertainty Quantification | CodeCode Available | 3 |
| Deep Neural Network Benchmarks for Selective Classification | Jan 23, 2024 | BenchmarkingClassification | CodeCode Available | 0 |
| Subgroup analysis methods for time-to-event outcomes in heterogeneous randomized controlled trials | Jan 22, 2024 | BenchmarkingSynthetic Data Generation | CodeCode Available | 0 |
| A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation | Jan 22, 2024 | BenchmarkingDiagnostic | CodeCode Available | 3 |
| Benchmarking Large Multimodal Models against Common Corruptions | Jan 22, 2024 | BenchmarkingImage to text | CodeCode Available | 1 |
| CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling | Jan 21, 2024 | Benchmarking | CodeCode Available | 1 |
| Data-Driven Target Localization: Benchmarking Gradient Descent Using the Cramer-Rao Bound | Jan 20, 2024 | Benchmarking | —Unverified | 0 |
| Data Augmentation for Traffic Classification | Jan 19, 2024 | BenchmarkingClassification | —Unverified | 0 |
| R-Judge: Benchmarking Safety Risk Awareness for LLM Agents | Jan 18, 2024 | Benchmarking | CodeCode Available | 2 |
| WAVES: Benchmarking the Robustness of Image Watermarks | Jan 16, 2024 | Benchmarking | CodeCode Available | 2 |
| NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription | Jan 16, 2024 | Automatic Speech RecognitionBenchmarking | —Unverified | 0 |
| Harnessing Orthogonality to Train Low-Rank Neural Networks | Jan 16, 2024 | Benchmarking | CodeCode Available | 0 |
| Large Language Models are Null-Shot Learners | Jan 16, 2024 | Arithmetic ReasoningBenchmarking | —Unverified | 0 |
| TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding | Jan 16, 2024 | Action RecognitionBenchmarking | —Unverified | 0 |
| OpenDPD: An Open-Source End-to-End Learning & Benchmarking Framework for Wideband Power Amplifier Modeling and Digital Pre-Distortion | Jan 16, 2024 | Benchmarking | —Unverified | 0 |
| Authorship Obfuscation in Multilingual Machine-Generated Text Detection | Jan 15, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 2 |
| RSUD20K: A Dataset for Road Scene Understanding In Autonomous Driving | Jan 14, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis | Jan 13, 2024 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| Lifelogging As An Extreme Form of Personal Information Management -- What Lessons To Learn | Jan 11, 2024 | BenchmarkingForm | —Unverified | 0 |
| InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks | Jan 10, 2024 | Benchmarking | CodeCode Available | 2 |
| Knowledge Sharing in Manufacturing using Large Language Models: User Evaluation and Model Benchmarking | Jan 10, 2024 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics | Jan 10, 2024 | Anomaly SegmentationAutonomous Driving | —Unverified | 0 |
| DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference | Jan 9, 2024 | BenchmarkingText Generation | CodeCode Available | 7 |