| SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models | Oct 29, 2023 | GPUMixture-of-Experts | CodeCode Available | 1 |
| Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | Oct 29, 2023 | GPUQuantization | CodeCode Available | 2 |
| The Synergy of Speculative Decoding and Batching in Serving Large Language Models | Oct 28, 2023 | GPUText Generation | —Unverified | 0 |
| Punica: Multi-Tenant LoRA Serving | Oct 28, 2023 | GPU | CodeCode Available | 3 |
| OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame Compression | Oct 27, 2023 | BenchmarkingGPU | CodeCode Available | 0 |
| FP8-LM: Training FP8 Large Language Models | Oct 27, 2023 | GPU | CodeCode Available | 2 |
| LLMSTEP: LLM proofstep suggestions in Lean | Oct 27, 2023 | CPUGPU | CodeCode Available | 1 |
| Real-Time Neural Materials using Block-Compressed Features | Oct 26, 2023 | DecoderGPU | —Unverified | 0 |
| PockEngine: Sparse and Efficient Fine-tuning in a Pocket | Oct 26, 2023 | CPUGPU | —Unverified | 0 |
| TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs | Oct 25, 2023 | Autonomous DrivingGPU | CodeCode Available | 3 |