SOTAVerified

Long-Context Understanding

Papers

Showing 150 of 81 papers

TitleStatusHype
RULER: What's the Real Context Size of Your Long-Context Language Models?Code9
InternLM2 Technical ReportCode9
Judging LLM-as-a-Judge with MT-Bench and Chatbot ArenaCode7
GLM-130B: An Open Bilingual Pre-trained ModelCode6
GPT-4 Technical ReportCode6
Long-context LLMs Struggle with Long In-context LearningCode5
Kimi-VL Technical ReportCode5
CogVLM: Visual Expert for Pretrained Language ModelsCode5
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality CollaborationCode4
Gated Delta Networks: Improving Mamba2 with Delta RuleCode4
M+: Extending MemoryLLM with Scalable Long-Term MemoryCode3
Retrieval Head Mechanistically Explains Long-Context FactualityCode3
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextCode3
LongBench: A Bilingual, Multitask Benchmark for Long Context UnderstandingCode3
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution ImagesCode3
Recurrent Context Compression: Efficiently Expanding the Context Window of LLMCode2
What is Wrong with Perplexity for Long-context Language Modeling?Code2
LongProLIP: A Probabilistic Vision-Language Model with Long Context TextCode2
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarksCode2
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language ModelsCode2
FABLES: Evaluating faithfulness and content selection in book-length summarizationCode2
HelloBench: Evaluating Long Text Generation Capabilities of Large Language ModelsCode2
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QACode2
MoA: Mixture of Sparse Attention for Automatic Large Language Model CompressionCode2
Fino1: On the Transferability of Reasoning Enhanced LLMs to FinanceCode2
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningCode2
GATEAU: Selecting Influential Samples for Long Context AlignmentCode1
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language ModelsCode1
BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via CompressionCode1
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?Code1
Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM CompressionCode1
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and ReasoningCode1
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference AccelerationCode1
From Text to Pixel: Advancing Long-Context Understanding in MLLMsCode1
Gemini: A Family of Highly Capable Multimodal ModelsCode1
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMsCode1
L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?Code1
LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live StreamsCode1
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field EnlargementCode1
LooGLE: Can Long-Context Language Models Understand Long Contexts?Code1
Marathon: A Race Through the Realm of Long Context with Large Language ModelsCode1
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language ModelsCode1
Mixture of In-Context Experts Enhance LLMs' Long Context AwarenessCode1
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World ScenariosCode1
S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language ModelsCode1
Self-Taught Agentic Long Context UnderstandingCode1
Equipping Transformer with Random-Access Reading for Long-Context Understanding0
State Space Models are Strong Text Rerankers0
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning0
The Claude 3 Model Family: Opus, Sonnet, Haiku0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o1 Image, 4*4 Stitching, Exact Accuracy83Unverified
2GPT-4V1 Image, 4*4 Stitching, Exact Accuracy54.72Unverified
3Gemini Pro 1.51 Image, 4*4 Stitching, Exact Accuracy39.85Unverified
4Gemini Pro 1.01 Image, 4*4 Stitching, Exact Accuracy24.78Unverified
5LLaVA-Llama-31 Image, 4*4 Stitching, Exact Accuracy17.5Unverified
6Claude 3 Opus1 Image, 4*4 Stitching, Exact Accuracy12.3Unverified
7IDEFICS2-8B1 Image, 4*4 Stitching, Exact Accuracy7.8Unverified
8InstructBLIP-Flan-T5-XXL1 Image, 4*4 Stitching, Exact Accuracy6.2Unverified
9CogVLM2-Llama-31 Image, 4*4 Stitching, Exact Accuracy0.9Unverified
10mPLUG-Owl-v21 Image, 4*4 Stitching, Exact Accuracy0.3Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4-Turbo-11061k74Unverified
2GPT-4-Turbo-01251k73.5Unverified
3Claude-21k65Unverified
4GPT-3.5-Turbo-11061k61.5Unverified
5InternLM2-7b1k58.6Unverified
6Vicuna-13b-v1.5-16k1k53.4Unverified
7ChatGLM3-6b-32k1k39.8Unverified
8Vicuna-7b-v1.5-16k1k37Unverified
9LongChat-7b-v1.5-32k1k32.4Unverified
10ChatGLM2-6b-32k1k31.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4-Turbo-11062k18.5Unverified
2GPT-4-Turbo-01252k15.5Unverified
3Vicuna-13b-v1.5-16k2k5.4Unverified
4LongChat-7b-v1.5-32k2k5.3Unverified
5Vicuna-7b-v1.5-16k2k5.3Unverified
6InternLM2-7b2k5.1Unverified
7Claude-22k5Unverified
8GPT-3.5-Turbo-11062k4Unverified
9ChatGLM3-6b-32k2k2.3Unverified
10ChatGLM2-6b-32k2k0.9Unverified
#ModelMetricClaimedVerifiedStatus
1GALI(Llama3-8b-ins-4k-to-16k)Average Score59.21Unverified
2GALI(Llama3-8b-ins-4k-to-32k)Average Score59.1Unverified
3GALI(Llama3-8b-ins-8k-to-32k)Average Score42.79Unverified
4GALI(Llama3-8b-ins-8k-to-16k)Average Score42.32Unverified
#ModelMetricClaimedVerifiedStatus
1GALI(Llama3-8b-ins-4k-to-16k)Average Score46.22Unverified
2GALI(Llama3-8b-ins-8k-to-32k)Average Score45.38Unverified
3GALI(Llama3-8b-ins-8k-to-16k)Average Score45.17Unverified