Performance Comparison of Tesseract and Google Document AI in Punjabi Newspapers Digitization

Main Article Content

Atul Kumar, Gurpreet Singh Lehal

Abstract

This paper focuses on the digitization of Punjabi newspapers through Optical Character Recognition (OCR) technology, a comparative analysis was conducted between Google Document AI and Tesseract OCR solutions. Punjabi newspapers, with their complex layouts, non-standard fonts, and linguistic nuances, pose challenges for OCR systems. The study aimed to evaluate the out-of-the-box performance of OCR solutions in accurately extracting text from Punjabi newspaper scans. Utilizing a benchmarking experiment with a dataset comprising Punjabi newspaper segments, the research addressed questions regarding the comparative performance of Google Document AI and Tesseract in handling Punjabi text. The methodology involved image enhancement, layout analysis, and OCR execution, with qualitative and quantitative analyses conducted to assess precision and reliability. While Tesseract demonstrated competitive performance, Google Document AI exhibited superior accuracy, highlighting the potential of server-based OCR solutions for handling diverse document types.  The mask RCNN model is used to extract the layout of newspapers using a layout parser. The findings reveal that while Tesseract demonstrates competitive performance, Google Document AI exhibits superior accuracy. We have performed the text extraction on newspaper segments that are extracted from newspaper images. Specifically, Tesseract achieved an accuracy of 97.20% at the word level and 92.48% at the character level, whereas Google API performed better with an accuracy of 98.86% at the word level and 95.62% at the character level. These findings contribute to the advancement of OCR technology in the context of Punjabi newspaper digitization, facilitating broader access to historical Punjabi texts for scholarly research. 

Article Details

Section
Articles