Enterprise applications utilize relational databases and structured business processes, requiring slow and expensive conversion of inputs and outputs, from business documents such as invoices, purchase orders, and receipts, into known templates and schemas before processing. We propose a new LLM Agent-based intelligent data extraction, transformation, and load (IntelligentETL) pipeline that not only ingests PDFs and detects inputs within it but also addresses the extraction of structured and unstructured data by developing tools that most efficiently and securely deal with respective data types. We study the efficiency of our proposed pipeline and compare it with enterprise solutions that also utilize LLMs. We establish the supremacy in timely and accurate data extraction and transformation capabilities of our approach for analyzing the data from varied sources based on nested and/or interlinked input constraints.
Bahameish, B., Yaqot, M., Franzoi, R. and Menezes, B. (2022) Artificial Intelligence in Procurement: An Overview and Case Study of Qatar Foundation. Proceedings of the International Conference on Industrial Engineering and Operations Management, Rome, 26-28 July 2022, 722-732. https://doi.org/10.46254/eu05.20220146
[3]
Yang, J., Hu, X., Xiao, G. and Shen, Y. (2024) A Survey of Knowledge Enhanced Pre-Trained Language Models. ACM Transactions on Asian and Low-Resource Language Information Processing. https://doi.org/10.1145/3631392
[4]
Kalyanpur, A., Saravanakumar, K.K., Barres, V., McFate, C.J., Moon, L., Seifu, N., Eremeev, M., Barrera, J., Bautista-Castillo, A., Brown, E. and Ferrucci, D. (2024) Multi-Step Knowledge Retrieval and Inference over Unstructured Data. arXiv: 2406.17987.
[5]
Zhou, M.Y. (2024) Improving LLM Understanding of Structured Data and Exploring Advanced Prompting Methods. Microsoft Research Blog.
[6]
Biswas, A. and Talukdar, W. (2024) Robustness of Structured Data Extraction from In-Plane Rotated Documents using Multi-Modal Large Language Models (LLM). Journal of Artificial Intelligence Research, 4, 176-195.
[7]
Fang, X., Xu, W.J., Tan, F.A., Zhang, J.N., Hu, Z.Q., Qi, Y.J., Nickleach, S., Socolinsky, D., Sengamedu, S. and Faloutsos, C. (2024) Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Under-Standing—A Survey. https://doi.org/10.48550/arXiv.2402.17944
[8]
Narayanan, P.P. and Narayana Iyer, A.P. (2024) HySem: A Context Length Optimized LLM Pipeline for Unstructured Tabular Extraction. arXiv: 2408.09434.
[9]
Li, H., Gao, H., Wu, C. and Vasarhelyi, M.A. (2023) Extracting Financial Data from Unstructured Sources: Leveraging Large Language Models. SSRNElectronicJournal. https://doi.org/10.2139/ssrn.4567607
[10]
Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A.S., Ceder, G., et al. (2024) Structured Information Extraction from Scientific Text with Large Language Models. NatureCommunications, 15, Article No. 1418. https://doi.org/10.1038/s41467-024-45563-x
[11]
Yang, Y., Wu, Z., Yang, Y., Lian, S., Guo, F. and Wang, Z. (2022) A Survey of Information Extraction Based on Deep Learning. AppliedSciences, 12, Article 9691. https://doi.org/10.3390/app12199691
[12]
Shan, Y., Lu, H. and Lou, W. (2023) A Hybrid Attention and Dilated Convolution Framework for Entity and Relation Extraction and Mining. ScientificReports, 13, Article No. 17062. https://doi.org/10.1038/s41598-023-40474-1
[13]
Yang, Y., Tang, Y.X. and Tam, K.Y. (2023) InvestLM: A Large Language Model for Investment Using Financial Domain Instruction Tuning. arXiv: 2309.13064.
[14]
Krugmann, J.O. and Hartmann, J. (2024) Sentiment Analysis in the Age of Generative AI. CustomerNeedsandSolutions, 11, Article No. 3. https://doi.org/10.1007/s40547-024-00143-4
[15]
Parthasarathy, V.B., Zafar, A., Khan, A. and Shahid, A. (2024) The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv: 2408.13296.
[16]
Trad, F. and Chehab, A. (2024) Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models. Machine Learning and KnowledgeExtraction, 6, 367-384. https://doi.org/10.3390/make6010018
[17]
Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J. and Wu, X. (2024) Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Transactions on Knowledge andDataEngineering, 36, 3580-3599. https://doi.org/10.1109/tkde.2024.3352100
[18]
Hello, N., Di Lorenzo, P. and Strinati, E.C. (2024) Semantic Communication Enhanced by Knowledge Graph Representation Learning. 2024 IEEE 25thInternationalWorkshoponSignalProcessingAdvancesinWirelessCommunications (SPAWC), Lucca, 10-13 September 2024, 876-880. https://doi.org/10.1109/spawc60668.2024.10694291
[19]
Zhao, H., Jiang, W., Deng, J., Ren, Q. and Zhang, L. (2023) Constructing Knowledge Graph for Electricity Keywords Based on Large Language Model. 2023 IEEE 7thConferenceonEnergyInternetandEnergySystemIntegration (EI2), Hangzhou, 15-18 December 2023, 4844-4849. https://doi.org/10.1109/ei259745.2023.10512525