| Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets | Jan 29, 2024 | BenchmarkingMachine Translation | CodeCode Available | 1 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 |
| PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models | Jan 28, 2024 | BenchmarkingCode Generation | CodeCode Available | 0 |
| SAM-based instance segmentation models for the automation of structural damage detection | Jan 27, 2024 | BenchmarkingInstance Segmentation | —Unverified | 0 |
| Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset | Jan 27, 2024 | BenchmarkingTime Series | —Unverified | 0 |
| MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries | Jan 27, 2024 | BenchmarkingRAG | CodeCode Available | 3 |
| Biological Valuation Map of Flanders: A Sentinel-2 Imagery Analysis | Jan 26, 2024 | BenchmarkingSemantic Segmentation | —Unverified | 0 |
| Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs | Jan 26, 2024 | BenchmarkingKnowledge Graphs | —Unverified | 0 |
| Automated legal reasoning with discretion to act using s(LAW) | Jan 25, 2024 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| TriSAM: Tri-Plane SAM for zero-shot cortical blood vessel segmentation in VEM images | Jan 25, 2024 | BenchmarkingSegmentation | —Unverified | 0 |