TEXT EXTRACTION FROM DEGRADED HISTORICAL DOCUMENTS USING ADVANCED IMAGE PROCESSING AND CLUSTERING TECHNIQUES

Main Article Content

Shashidhara B , Yogish Naik G R , Vidyasagar K B

Abstract

This study proposes a novel method for extracting textual information from degraded historical document images, using image processing methods and advanced clustering techniques to enhance the accuracy and reliability of textual data extraction. The methodology involves converting document images to grayscale, enhancing contrast via histogram equalization, detecting text edges with the Canny algorithm, refining character shapes through skeletonization, identifying text regions using connected component analysis, and applying geometric filtering and nearest neighbour clustering to denoise and group these regions. This integrated approach effectively distinguishes authentic textual content from background noise and extraneous elements. Experimental evaluation on degraded historical document datasets demonstrates that the proposed method consistently attains Precision 91–94%, Recall 94–97% and F1-Scores (92–95%) respectively and Error Rates for the proposed method remain the lowest (7–10%). This approach significantly enhances the quality of digitized text from challenging archival documents, providing an effective solution for libraries, researchers, and historical digitization projects.

Article Details

Section
Articles