| SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning | Aug 10, 2024 | HallucinationOptical Character Recognition | CodeCode Available | 11 |
| DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | Dec 13, 2024 | Chart UnderstandingMixture-of-Experts | CodeCode Available | 9 |
| Nougat: Neural Optical Understanding for Academic Documents | Aug 25, 2023 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 5 |
| Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders | Aug 28, 2024 | Optical Character Recognition | CodeCode Available | 4 |
| OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning | Dec 31, 2024 | BenchmarkingLogical Reasoning | CodeCode Available | 4 |
| OCR-free Document Understanding Transformer | Nov 30, 2021 | Document Image Classificationdocument understanding | CodeCode Available | 3 |
| Playing Non-Embedded Card-Based Games with Reinforcement Learning | Apr 7, 2025 | Board GamesDecision Making | CodeCode Available | 3 |
| Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition | May 23, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 2 |
| IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling | Jan 6, 2023 | Link PredictionOptical Character Recognition | CodeCode Available | 2 |
| MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories | Jun 5, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 2 |