| Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression | Apr 10, 2025 | MathMMLU | CodeCode Available | 1 |
| MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models | Apr 8, 2025 | MathMultimodal Reasoning | CodeCode Available | 1 |
| Large (Vision) Language Models are Unsupervised In-Context Learners | Apr 3, 2025 | GSM8KIn-Context Learning | CodeCode Available | 1 |
| BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Apr 2, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 1 |
| Entropy-Based Adaptive Weighting for Self-Training | Mar 31, 2025 | GSM8KMath | CodeCode Available | 1 |
| QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? | Mar 28, 2025 | Logical ReasoningMath | CodeCode Available | 1 |
| ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models | Mar 27, 2025 | Math | CodeCode Available | 1 |
| LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation | Mar 25, 2025 | Code CompletionLanguage Modeling | CodeCode Available | 1 |
| EXAONE Deep: Reasoning Enhanced Language Models | Mar 16, 2025 | Math | CodeCode Available | 1 |
| VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search | Mar 13, 2025 | Image RetrievalMath | CodeCode Available | 1 |