| GANmut: Generating and Modifying Facial Expressions | Jun 16, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters | Jun 16, 2024 | BenchmarkingInstance Segmentation | CodeCode Available | 0 |
| Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models | Jun 15, 2024 | BenchmarkingData Augmentation | CodeCode Available | 0 |
| Reactor Mk.1 performances: MMLU, HumanEval and BBH test results | Jun 15, 2024 | BenchmarkingHumanEval | —Unverified | 0 |
| A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking | Jun 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework | Jun 14, 2024 | Benchmarking | —Unverified | 0 |
| SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading | Jun 14, 2024 | BenchmarkingMathematical Proofs | CodeCode Available | 0 |
| ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures | Jun 14, 2024 | Answer GenerationBenchmarking | CodeCode Available | 0 |
| VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs | Jun 14, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming | Jun 14, 2024 | BenchmarkingGeneral Knowledge | —Unverified | 0 |