Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

2025-02-10Code Available1· sign in to hype

Sankalp Nagaonkar, Augustya Sharma, Ashish Choithani, Ashutosh Trivedi

Code Available — Be the first to reproduce this paper.

Code

github.com/video-db/ocr-benchmark
OfficialIn papernone★ 47

Abstract

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

Tasks

Benchmarking Optical Character Recognition Optical Character Recognition (OCR)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
VideoDB's OCR Benchmark Public Collection	GPT-4o	Average Accuracy	76.22	—	Unverified
VideoDB's OCR Benchmark Public Collection	Gemini-1.5 Pro	Average Accuracy	76.13	—	Unverified
VideoDB's OCR Benchmark Public Collection	Claude-3 Sonnet	Average Accuracy	67.71	—	Unverified
VideoDB's OCR Benchmark Public Collection	RapidOCR	Average Accuracy	56.98	—	Unverified
VideoDB's OCR Benchmark Public Collection	EasyOCR	Average Accuracy	49.3	—	Unverified

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Code

Abstract

Tasks

Benchmark Results

Reproductions