As the integration of Large Language Models (LLMs) into scientific R&D accelerates, the associated privacy risks become increasingly critical. Scientific NoSQL repositories, which often store sensitive experimental documentation, must be protected from data leakage and inference attacks. This paper proposes a novel privacy-preserving architecture that enables LLM-based querying, summarization, and guidance over scientific NoSQL datasets under differential privacy (DP) constraints. We introduce a comprehensive framework that includes local sensitivity analysis, DP-calibrated query transformation, privacy-aware embeddings, and a controlled interface for LLM interactions. Our experiments on synthetic and biomedical datasets demonstrate the trade-offs between privacy budgets and semantic utility. This work bridges the gap between secure data infrastructure and intelligent scientific interfaces, paving the way for compliant and interpretable AI deployments in research settings.
References
[1]
Dwork, C. and Roth, A. (2013) The Algorithmic Foundations of Differential Privacy. now Publishers Inc. https://doi.org/10.1561/9781601988195
[2]
Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., et al. (2016) Deep Learning with Differential Privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, 24-28 October 2016, 308-318. https://doi.org/10.1145/2976749.2978318
Sporny, M., Kellogg, G. and Lanthaler, M. (2020) JSON-LD 1.1—A JSON-Based Serialization for Linked Data. W3C Recommendation, World Wide Web Consortium (W3C).
[5]
Radford, A., et al. (2019) Language Models Are Unsupervised Multitask Learners. OpenAI Technical Report.
[6]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 4171-4186. https://arxiv.org/abs/1810.04805
[7]
Touvron, H., Martin, L., Stone, K., et al. (2023) LLaMA 2 Meta’s Open Language Model. https://arxiv.org/abs/2307.09288
[8]
Wang, Y., Lee, J. and Kifer, D. (2023) Private Embeddings for Entity Recognition in Biomedical Texts. EMNLP.
[9]
Shokri, R., Stronati, M., Song, C. and Shmatikov, V. (2017) Membership Inference Attacks against Machine Learning Models. 2017 IEEE Symposium on Security and Privacy (SP), San Jose, 22-26 May 2017, 3-18. https://doi.org/10.1109/sp.2017.41
[10]
Song, L., Rane, S. and Raj, B. (2020) Privacy-Preserving Vector Embeddings via Random Projections. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2807-2811. https://doi.org/10.1109/ICASSP40776.2020.9054527
[11]
Opacus (2023) PyTorch Library for Training Models with Differential Privacy. https://github.com/pytorch/opacus
[12]
Gursoy, M.E., Inan, A., Nergiz, M.E. and Saygin, Y. (2019) Differentially Private Data Sharing for Data-Driven Research. Computer, 52, 40-49. https://doi.org/10.1109/MC.2019.2903037
[13]
McMahan, B., et al. (2017) Communication-Efficient Learning of Deep Networks from Decentralized Data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 1273-1282.
[14]
Sarker, I.H., et al. (2022) Differential Privacy in Big Data Analytics: A Survey. Journal of Big Data, 9, Article 113. https://doi.org/10.1186/s40537-022-00639-3
[15]
FAISS Library (2023) Facebook AI Similarity Search. https://github.com/facebookresearch/faiss
[16]
Lundberg, S.M. and Lee, S.I. (2017) A Unified Approach to Interpreting Model Predictions. arXiv: 1705.07874.
[17]
Shapley, L.S. (1953) 17. A Value for n-Person Games. In: Kuhn, H.W. and Tucker, A.W., Eds., Contributions to the Theory of Games (AM-28), Volume II, Princeton University Press, 307-318. https://doi.org/10.1515/9781400881970-018
[18]
Google DP Library (2023) Differential Privacy Implementation. https://github.com/google/differential-privacy
[19]
Johnson, J., Douze, M. and Jegou, H. (2021) Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data, 7, 535-547. https://doi.org/10.1109/tbdata.2019.2921572
National Centers for Environmental Information (2023) Global Historical Climatology Network—Daily (GHCN-D). NOAA. https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily
[25]
Cho, H., Simmons, S., Kim, R. and Berger, B. (2020). Privacy-Preserving Biomedical Database Queries with Optimal Privacy-Utility Trade-Offs. Cell Systems, 10, 408-416.e9. https://doi.org/10.1016/j.cels.2020.03.006
[26]
BioSchemas Project (2024) Enabling Consistent Markup of Life Science Resources.
[27]
Browning, D. and Maali, F. (2020) Data Catalog Vocabulary (DCAT)—Version 2. W3C Recommendation.
[28]
European Commission (2021) Ethics Guidelines for Trustworthy AI.
Biswas, T. (2023) Enhancing R&D Knowledge Management: Integrating Large Language Models with NoSQL Databases for Experiment Documentation Access. International Journal of Computer Engineering andTechnology, 14, 100-106.