| Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning | Oct 4, 2024 | BenchmarkingUncertainty Quantification | —Unverified | 0 |
| AutoPenBench: Benchmarking Generative Agents for Penetration Testing | Oct 4, 2024 | Benchmarking | CodeCode Available | 2 |
| Towards a Benchmark for Large Language Models for Business Process Management Tasks | Oct 4, 2024 | BenchmarkingManagement | CodeCode Available | 0 |
| EBES: Easy Benchmarking for Event Sequences | Oct 4, 2024 | Benchmarking | CodeCode Available | 1 |
| Repurposing Foundation Model for Generalizable Medical Time Series Classification | Oct 3, 2024 | BenchmarkingDiagnostic | —Unverified | 0 |
| DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects | Oct 3, 2024 | BenchmarkingImitation Learning | CodeCode Available | 1 |
| Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning | Oct 3, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services | Oct 3, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| MANTRA: The Manifold Triangulations Assemblage | Oct 3, 2024 | Benchmarking | CodeCode Available | 0 |
| IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models | Oct 3, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents | Oct 3, 2024 | Autonomous DrivingBackdoor Attack | CodeCode Available | 3 |
| A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning | Oct 2, 2024 | BenchmarkingDenoising | —Unverified | 0 |
| MONICA: Benchmarking on Long-tailed Medical Image Classification | Oct 2, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description | Oct 2, 2024 | BenchmarkingFacial expression generation | —Unverified | 0 |
| CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations | Oct 2, 2024 | BenchmarkingLong Form Question Answering | —Unverified | 0 |
| OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models | Oct 2, 2024 | Benchmarking | CodeCode Available | 3 |
| StringLLM: Understanding the String Processing Capability of Large Language Models | Oct 2, 2024 | Benchmarking | CodeCode Available | 1 |
| Deep learning for action spotting in association football videos | Oct 2, 2024 | Action SpottingBenchmarking | —Unverified | 0 |
| Deep Unlearn: Benchmarking Machine Unlearning | Oct 2, 2024 | BenchmarkingMachine Unlearning | —Unverified | 0 |
| MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework | Oct 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs | Oct 2, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| shapiq: Shapley Interactions for Machine Learning | Oct 2, 2024 | BenchmarkingData Valuation | CodeCode Available | 4 |
| ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving | Oct 2, 2024 | BenchmarkingDocument Summarization | —Unverified | 0 |
| Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents | Oct 1, 2024 | BenchmarkingConversational Question Answering | —Unverified | 0 |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | Oct 1, 2024 | BenchmarkingFairness | —Unverified | 0 |