| RULER: What's the Real Context Size of Your Long-Context Language Models? | Apr 9, 2024 | Long-Context Understanding | CodeCode Available | 9 | 5 |
| InternLM2 Technical Report | Mar 26, 2024 | 4kLong-Context Understanding | CodeCode Available | 9 | 5 |
| Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | Jun 9, 2023 | ChatbotLanguage Modelling | CodeCode Available | 7 | 5 |
| GLM-130B: An Open Bilingual Pre-trained Model | Oct 5, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 6 | 5 |
| GPT-4 Technical Report | Mar 15, 2023 | answerability predictionArithmetic Reasoning | CodeCode Available | 6 | 5 |
| Long-context LLMs Struggle with Long In-context Learning | Apr 2, 2024 | 2kIn-Context Learning | CodeCode Available | 5 | 5 |
| Kimi-VL Technical Report | Apr 10, 2025 | Long-Context UnderstandingMathematical Reasoning | CodeCode Available | 5 | 5 |
| CogVLM: Visual Expert for Pretrained Language Models | Nov 6, 2023 | 1 Image, 2*2 StitchingFS-MEVQA | CodeCode Available | 5 | 5 |
| mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | Nov 7, 2023 | 1 Image, 2*2 StitchingDecoder | CodeCode Available | 4 | 5 |
| Gated Delta Networks: Improving Mamba2 with Delta Rule | Dec 9, 2024 | Common Sense ReasoningLanguage Modeling | CodeCode Available | 4 | 5 |
| M+: Extending MemoryLLM with Scalable Long-Term Memory | Feb 1, 2025 | 16kGPU | CodeCode Available | 3 | 5 |
| Retrieval Head Mechanistically Explains Long-Context Factuality | Apr 24, 2024 | Continual PretrainingHallucination | CodeCode Available | 3 | 5 |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Mar 8, 2024 | 1 Image, 2*2 StitchingCode Generation | CodeCode Available | 3 | 5 |
| LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding | Aug 28, 2023 | 16kCode Completion | CodeCode Available | 3 | 5 |
| LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images | Mar 18, 2024 | Long-Context UnderstandingTextVQA | CodeCode Available | 3 | 5 |
| Recurrent Context Compression: Efficiently Expanding the Context Window of LLM | Jun 10, 2024 | Long-Context UnderstandingQuestion Answering | CodeCode Available | 2 | 5 |
| What is Wrong with Perplexity for Long-context Language Modeling? | Oct 31, 2024 | Document SummarizationIn-Context Learning | CodeCode Available | 2 | 5 |
| LongProLIP: A Probabilistic Vision-Language Model with Long Context Text | Mar 11, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks | Apr 9, 2024 | Answer SelectionLong-Context Understanding | CodeCode Available | 2 | 5 |
| Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models | Jun 17, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| FABLES: Evaluating faithfulness and content selection in book-length summarization | Apr 1, 2024 | Long-Context Understanding | CodeCode Available | 2 | 5 |
| HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models | Sep 24, 2024 | Long-Context UnderstandingText Generation | CodeCode Available | 2 | 5 |
| Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA | Jun 25, 2024 | BenchmarkingLong-Context Understanding | CodeCode Available | 2 | 5 |
| MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression | Jun 21, 2024 | GPULanguage Modeling | CodeCode Available | 2 | 5 |
| Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance | Feb 12, 2025 | BenchmarkingLong-Context Understanding | CodeCode Available | 2 | 5 |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | May 11, 2023 | 1 Image, 2*2 StitchingDiversity | CodeCode Available | 2 | 5 |
| GATEAU: Selecting Influential Samples for Long Context Alignment | Oct 21, 2024 | Instruction FollowingLong-Context Understanding | CodeCode Available | 1 | 5 |
| BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models | Feb 11, 2025 | Code GenerationInstruction Following | CodeCode Available | 1 | 5 |
| BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression | Oct 20, 2024 | In-Context LearningLong-Context Understanding | CodeCode Available | 1 | 5 |
| Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs? | Jun 20, 2025 | Book summarizationLong-Context Understanding | CodeCode Available | 1 | 5 |
| Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression | May 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning | Mar 14, 2025 | Long-Context Understanding | CodeCode Available | 1 | 5 |
| DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration | Jun 6, 2025 | Computational EfficiencyLanguage Modeling | CodeCode Available | 1 | 5 |
| From Text to Pixel: Advancing Long-Context Understanding in MLLMs | May 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 | 5 |
| Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs | Apr 16, 2024 | Long-Context UnderstandingToken Reduction | CodeCode Available | 1 | 5 |
| L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding? | Oct 3, 2024 | 8kDocument Summarization | CodeCode Available | 1 | 5 |
| LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams | Apr 24, 2025 | Long-Context UnderstandingSpoken Language Understanding | CodeCode Available | 1 | 5 |
| LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement | Apr 22, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| LooGLE: Can Long-Context Language Models Understand Long Contexts? | Nov 8, 2023 | In-Context LearningLong-Context Understanding | CodeCode Available | 1 | 5 |
| Marathon: A Race Through the Realm of Long Context with Large Language Models | Dec 15, 2023 | Long-Context UnderstandingMultiple-choice | CodeCode Available | 1 | 5 |
| MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models | May 26, 2025 | Data CompressionLong-Context Understanding | CodeCode Available | 1 | 5 |
| Mixture of In-Context Experts Enhance LLMs' Long Context Awareness | Jun 28, 2024 | Long-Context Understanding | CodeCode Available | 1 | 5 |
| RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios | Dec 12, 2024 | Logical ReasoningLong-Context Understanding | CodeCode Available | 1 | 5 |
| S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models | Oct 23, 2023 | Long-Context Understanding | CodeCode Available | 1 | 5 |
| Self-Taught Agentic Long Context Understanding | Feb 21, 2025 | Long-Context Understanding | CodeCode Available | 1 | 5 |
| Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding | Jun 4, 2024 | ArticlesLong-Context Understanding | CodeCode Available | 0 | 5 |
| Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models | Jul 13, 2025 | AttributeBenchmarking | CodeCode Available | 0 | 5 |
| MesaNet: Sequence Modeling by Locally Optimal Test-Time Training | Jun 5, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning | Feb 19, 2025 | Long-Context Understanding | CodeCode Available | 0 | 5 |