The cost and strict input format requirements of GraphRAG make it less efficient for processing large documents. This paper proposes an alternative approach for constructing a knowledge graph (KG) from a PDF document with a focus on simplicity and cost-effectiveness. The process involves splitting the document into chunks, extracting concepts within each chunk using a large language model (LLM), and building relationships based on the proximity of concepts in the same chunk. Unlike traditional named entity recognition (NER), which identifies entities like “Shanghai”, the proposed method identifies concepts, such as “Convenient transportation in Shanghai” which is found to be more meaningful for KG construction. Each edge in the KG represents a relationship between concepts occurring in the same text chunk. The process is computationally inexpensive, leveraging locally set up tools like Mistral 7B openorca instruct and Ollama for model inference, ensuring the entire graph generation process is cost-free. A method of assigning weights to relationships, grouping similar pairs, and summarizing multiple relationships into a single edge with associated weight and relation details is introduced. Additionally, node degrees and communities are calculated for node sizing and coloring. This approach offers a scalable, cost-effective solution for generating meaningful knowledge graphs from large documents, achieving results comparable to GraphRAG while maintaining accessibility for personal machines.
References
[1]
Carmo, D., Piau, M., Campiotti, I., Nogueira, R. and Lotufo, R. (2020) PTT5: Pretraining and Validating the T5 Model on Brazilian Portuguese Data. arXiv: 2008.09144. https://doi.org/10.48550/arXiv.2008.09144
[2]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I. (2019) Language Models Are Unsupervised Multitask Learners. OpenAI.
[3]
Brown, T.B. (2020) Language Models Are Few-Shot Learners. arXiv: 2005.14165. https://doi.org/10.48550/arXiv.2005.14165
[4]
Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., et al. (2022) Glam: Efficient Scaling of Language Models with Mixture-of-Experts. arXiv: 2112.06905. https://doi.org/10.48550/arXiv.2112.06905
[5]
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., et al. (2022) LaMDA: Language Models for Dialog Applications. arXiv: 2201.08239. https://doi.org/10.48550/arXiv.2201.08239
[6]
Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., et al. (2021) Pangu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-Parallel Computation. arXiv: 2104.12369. https://doi.org/10.48550/arXiv.2104.12369
[7]
Hepp, A., Loosen, W., Dreyer, S., Jarke, J., Kannengießer, S., Katzenbach, C., etal. (2023) ChatGPT, Lamda, and the Hype around Communicative AI: The Automation of Communication as a Field of Research in Media and Communication Studies. Human-MachineCommunication, 6, 41-63. https://doi.org/10.30658/hmc.6.4
[8]
Liu, Y. Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., etal. (2019) RoBERTa: A Robustly Optimized Bert Pretraining Approach. arXiv: 1907.11692. https://doi.org/10.48550/arXiv.1907.11692
[9]
Rae, J.W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., et al. (2021) Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv: 2112.11446. https://doi.org/10.48550/arXiv.2112.11446
[10]
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., et al. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, a Large-Scale Generative Language Model. arXiv: 2201.11990. https://doi.org/10.48550/arXiv.2201.11990
[11]
Sun, Y., Wang, S., Feng, S., Ding, S., Pang, C., Shang, J., et al. (2021) ERNIE 3.0: Large-Scale Knowledge Enhanced Pre-Training for Language Understanding and Generation. arXiv: 2107.02137. https://doi.org/10.48550/arXiv.2107.02137
[12]
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., et al. (2023) Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. arXiv: 2308.12966. https://doi.org/10.48550/arXiv.2308.12966
[13]
Ethayarajh, K. (2019) How Contextual Are Contextualized Word Representations? Comparing the Geometry of BERT, Elmo, and GPT-2 Embeddings. Proceedingsofthe 2019 ConferenceonEmpiricalMethodsinNaturalLanguageProcessingandthe 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 3-7 November 2019, 55-65. https://doi.org/10.18653/v1/d19-1006
[14]
Lester, B., Al-Rfou, R. and Constant, N. (2021) The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedingsofthe 2021 ConferenceonEmpiricalMethodsinNaturalLanguageProcessing, Online, 7-11 November 2021, 3045-3059. https://doi.org/10.18653/v1/2021.emnlp-main.243
[15]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al. (2022) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, 28 November-9 December 2022, 24824-24837.
[16]
Li, X.L. and Liang, P. (2021) Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv: 2101.00190. https://doi.org/10.48550/arXiv.2101.00190
[17]
Qiu, Z., Wu, X., Gao, J. and Fan, W. (2021) U-BERT: Pre-Training User Representations for Improved Recommendation. ProceedingsoftheAAAIConferenceonArtificialIntelligence, 35, 4320-4327. https://doi.org/10.1609/aaai.v35i5.16557
[18]
Wu, C., Wu, F., Qi, T. and Huang, Y. (2021) Empowering News Recommendation with Pre-Trained Language Models. Proceedingsofthe 44thInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval, Virtual, 11-15 July 2021, 1652-1656. https://doi.org/10.1145/3404835.3463069
[19]
Nayak, R. (2023) How to Convert Any Text into a Graph of Concepts. https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a/