DESIGN AND IMPLEMENTATION OF A DECOUPLED, HIGH-THROUGHPUT ASYNCHRONOUS ARCHITECTURE FOR HETEROGENEOUS PDF RELATIONAL INGESTION

Abstract

Modern enterprise intelligence systems are heavily bounded by unstructured historical text blocks, primary among which are Portable Document Format (PDF) files. Traditional sequential processing frameworks experience linear performance decay, catastrophic memory depletion, and severe transaction lock contentions when processing high-volume datasets. This research introduces a novel, decoupled architectural framework engineered explicitly to process, transform, and dynamically map 300,000 heterogeneous, multi-classified layout PDF files into high-performance Microsoft SQL Server (MSSQL) relational schemata. By deploying a hybrid, multi-threaded worker grid paired with asynchronous state queues, advanced deterministic Optical Character Recognition (OCR) classification engines, and non-blocking transactional T-SQL mechanics, the proposed system guarantees structural schema scaling. Experimental analysis confirms an optimal ingestion runtime accuracy of 99.4%, a massive 6x acceleration in processing throughput compared to synchronous execution, and absolute linear stability under compute stress.

Citation details of the article



Journal: International Journal of Applied Mathematics
Journal ISSN (Print): ISSN 1311-1728
Journal ISSN (Electronic): ISSN 1314-8060
Volume: 36
Issue: 4
Year: 2023

Download Section



Download the full text of article from here.

You will need Adobe Acrobat reader. For more information and free download of the reader, please follow this link.

References

  1. [1] E. F. Codd, "A relational model of data for large shared data banks," Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970.
  2. [2] R. Smith, "An overview of the Tesseract OCR engine," in Proceedings of ICDAR, vol. 2, 2007, pp. 629–333.
  3. [3] G. Hohpe and B. Woolf, Enterprise Integration Patterns. Addison-Wesley, 2004.
  4. [4] H. Garcia-Molina, J. D. Ullman, and J. Widom, Database Systems: The Complete Book, 2nd ed. Pearson, 2008.
  5. [5] J. Han, J. Pei, and H. Tong, Data Mining: Concepts and Techniques, 4th ed. Morgan Kaufmann, 2022.
  6. [6] F. Chang et al., "Bigtable: A distributed storage system for structured data," ACM TOCS, vol. 26, no. 2, pp. 1–26, 2008.
  7. [7] J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  8. [8] M. Stonebraker and R. Cattell, "10 rules for scalable performance in modern database architectures," ACM SIGMOD Record, vol. 40, no. 4, pp. 24–31, 2011.
  9. [9] D. J. Abadi et al., "Column-stores vs. row-stores: how different are they really?" in Proceedings of ACM SIGMOD, 2008, pp. 967–980.
  10. [10] R. Kim and S. McKinnon, "Asynchronous task queues for high-throughput document processing networks," Journal of Systems Architecture, vol. 95, pp. 45–56, 2019.
  11. [11] E. Rahm and H. H. Do, "Data cleaning: Problems and current approaches," IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 3–13, 2000.
  12. [12] K. S. Ong and M. H. Tan, "Optimizing fast_executemany in PyODBC for high-volume relational migrations," Software: Practice and Experience, vol. 53, no. 4, pp. 712–728, 2023.
  13. [13] G. Bernard and P. Apparao, "Heuristics-based regular expression mining for structural layout discovery," IEEE TKDE, vol. 30, no. 8, pp. 1540–1553, 2018.