In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical Database. Our investigation entails a comprehensive exploration of various methodologies aimed at enhancing the efficiency of ETL processes, with a primary emphasis on optimizing time and resource utilization. Through meticulous experimentation utilizing a representative dataset, we shed light on the advantages associated with the incorporation of PySpark and Docker containerized applications. Our research illuminates significant advancements in time efficiency, process streamlining, and resource optimization attained through the utilization of PySpark for distributed computing within Big Data Engineering workflows. Additionally, we underscore the strategic integration of Docker containers, delineating their pivotal role in augmenting scalability and reproducibility within the ETL pipeline. This paper encapsulates the pivotal insights gleaned from our experimental journey, accentuating the practical implications and benefits entailed in the adoption of PySpark and Docker. By streamlining Big Data Engineering and ETL processes in the context of clinical big data, our study contributes to the ongoing discourse on optimizing data processing efficiency in healthcare applications. The source code is available on request.
References
[1]
Johnson, E. and Miller, R. (2021) Harnessing the Data Revolution: Big Data’s Role in Transforming Industries. Journal of Science & Technology, 2, 32-39.
[2]
Cozzoli, N., Salvatore, F.P., Faccilongo, N. and Milone, M. (2022) How Can Big Data Analytics Be Used for Healthcare Organization Management? Literary Framework and Future Research from a Systematic Review. BMCHealthServicesResearch, 22, Article No. 809. https://doi.org/10.1186/s12913-022-08167-z
[3]
Dicuonzo, G., Galeone, G., Shini, M. and Massari, A. (2022) Towards the Use of Big Data in Healthcare: A Literature Review. Healthcare, 10, Article No. 1232. https://doi.org/10.3390/healthcare10071232
[4]
Raghupathi, W. and Raghupathi, V. (2014) Big Data Analytics in Healthcare: Promise and Potential. HealthInformationScienceandSystems, 2, Article No. 3. https://doi.org/10.1186/2047-2501-2-3
[5]
Pannunzio, V., Kleinsmann, M., Snelders, D. and Raijmakers, J. (2023) From Digital Health to Learning Health Systems: Four Approaches to Using Data for Digital Health Design. HealthSystems, 12, 481-494. https://doi.org/10.1080/20476965.2023.2284712
[6]
Lipovac, I. and Babac, M.B. (2024) Developing a Data Pipeline Solution for Big Data Processing. InternationalJournalofDataMining, ModellingandManagement, 16, 1-22. https://doi.org/10.1504/ijdmmm.2024.136221
[7]
Cheng, K.Y., Pazmino, S. and Schreiweis, B. (2022) ETL Processes for Integrating Healthcare Data—Tools and Architecture Patterns. In: StudiesinHealthTechnologyandInformatics, IOS Press, 151-156. https://doi.org/10.3233/shti220974
[8]
Rossi, R.L. and Grifantini, R.M. (2018) Big Data: Challenge and Opportunity for Translational and Industrial Research in Healthcare. FrontiersinDigitalHumanities, 5, Article No. 13. https://doi.org/10.3389/fdigh.2018.00013
[9]
Berg, K., Doktorchik, C., Quan, H. and Saini, V. (2022) Automating Data Collection Methods in Electronic Health Record Systems: A Social Determinant of Health (SDOH) Viewpoint. HealthSystems, 12, 472-480. https://doi.org/10.1080/20476965.2022.2075796
[10]
Dash, S., Shakyawar, S.K., Sharma, M. and Kaushik, S. (2019) Big Data in Healthcare: Management, Analysis and Future Prospects. JournalofBigData, 6, Article No. 54. https://doi.org/10.1186/s40537-019-0217-0
[11]
Batko, K. and Ślęzak, A. (2022) The Use of Big Data Analytics in Healthcare. JournalofBigData, 9, Article No. 3. https://doi.org/10.1186/s40537-021-00553-4
[12]
Ismail, A., Shehab, A. and El-Henawy, I.M. (2018) Healthcare Analysis in Smart Big Data Analytics: Reviews, Challenges and Recommendations. In: Hassanien, A.E., Elhoseny, M., Ahmed, S.H. and Singh, A.K., Eds., Security in Smart Cities: Models, Applications, and Challenges, Springer International Publishing, 27-45. https://doi.org/10.1007/978-3-030-01560-2_2
[13]
Kashyap, R. (2019) Big Data Analytics Challenges and Solutions. In: Dey, N., Das, H., Naik, B. and Behera, H.S., Eds., BigDataAnalyticsforIntelligentHealthcareManagement, Elsevier, 19-41. https://doi.org/10.1016/b978-0-12-818146-1.00002-7
[14]
Kraus, J.M., Lausser, L., Kuhn, P., Jobst, F., Bock, M., Halanke, C., et al. (2018) Big Data and Precision Medicine: Challenges and Strategies with Healthcare Data. InternationalJournalofDataScienceandAnalytics, 6, 241-249. https://doi.org/10.1007/s41060-018-0095-0
[15]
Seenivasan, D. (2023) Improving the Performance of the ETL Jobs. InternationalJournalofComputerTrendsandTechnology, 71, 27-33. https://doi.org/10.14445/22312803/ijctt-v71i3p105
[16]
Johnson, A., et al. (2019) Mimic-III Clinical Database Demo (version 1.4). Physionet.
[17]
Johnson, A., et al. (2019) “Mimic-III Clinical Database Demo” (Version 1.4). Physionet (2019).
[18]
Johnson, A.E.W., Pollard, T.J., Shen, L., Lehman, L.H., Feng, M., Ghassemi, M., et al. (2016) MIMIC-III, a Freely Accessible Critical Care Database. ScientificData, 3, Article ID: 160035. https://doi.org/10.1038/sdata.2016.35
[19]
Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., et al. (2000) Physiobank, Physiotoolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101, e215-e220. https://doi.org/10.1161/01.cir.101.23.e215
[20]
Onyemachi, N.C. and Nonyelum, O.F. (2019) Big Data Analytics in Healthcare: A Review. 2019 15thInternationalConferenceonElectronics, ComputerandComputation (ICECCO), Abuja, 10-12 December 2019, 1-5. https://doi.org/10.1109/icecco48375.2019.9043183
[21]
Wang, Y., Kung, L. and Byrd, T.A. (2018) Big Data Analytics: Understanding Its Capabilities and Potential Benefits for Healthcare Organizations. TechnologicalForecastingandSocialChange, 126, 3-13. https://doi.org/10.1016/j.techfore.2015.12.019
[22]
Hong, L., Luo, M., Wang, R., Lu, P., Lu, W. and Lu, L. (2018) Big Data in Health Care: Applications and Challenges. DataandInformationManagement, 2, 175-197. https://doi.org/10.2478/dim-2018-0014
[23]
Dabral, S. and Mohana, R. (2023) Healthcare Data Pipeline.
[24]
Saheb, T. and Izadi, L. (2019) Paradigm of IoT Big Data Analytics in the Healthcare Industry: A Review of Scientific Literature and Mapping of Research Trends. TelematicsandInformatics, 41, 70-85. https://doi.org/10.1016/j.tele.2019.03.005
[25]
Rehman, A., Naz, S. and Razzak, I. (2021) Leveraging Big Data Analytics in Healthcare Enhancement: Trends, Challenges and Opportunities. MultimediaSystems, 28, 1339-1371. https://doi.org/10.1007/s00530-020-00736-8
[26]
Ariffin, N., Yunus, A.M. and Kadir, I. (2021) The Role of Big Data in the Healthcare Industry. JournalofIslamic, 6, 235-245.
[27]
Karatas, M., Eriskin, L., Deveci, M., Pamucar, D. and Garg, H. (2022) Big Data for Healthcare Industry 4.0: Applications, Challenges and Future Perspectives. ExpertSystemswithApplications, 200, Article ID: 116912. https://doi.org/10.1016/j.eswa.2022.116912
[28]
Kruse, C.S., Goswamy, R., Raval, Y. and Marawi, S. (2016) Challenges and Opportunities of Big Data in Health Care: A Systematic Review. JMIRMedicalInformatics, 4, e38. https://doi.org/10.2196/medinform.5359
[29]
Raj, A., Bosch, J., Olsson, H.H. and Wang, T.J. (2020) Modelling Data Pipelines. 2020 46thEuromicroConferenceonSoftwareEngineeringandAdvancedApplications (SEAA), Portoroz, 26-28 August 2020, 13-20. https://doi.org/10.1109/seaa51224.2020.00014
[30]
Vyas, S. and Vaishnav, P. (2017) A Comparative Study of Various ETL Process and Their Testing Techniques in Data Warehouse. JournalofStatisticsandManagementSystems, 20, 753-763. https://doi.org/10.1080/09720510.2017.1395194
[31]
Rahman, N., Kumar, N. and Rutz, D. (2016) Managing Application Compatibility during ETL Tools and Environment Upgrades. JournalofDecisionSystems, 25, 136-150. https://doi.org/10.1080/12460125.2016.1138392
[32]
Diouf, P.S., Boly, A. and Ndiaye, S. (2018). Variety of Data in the ETL Processes in the Cloud: State of the Art. 2018 IEEEInternationalConferenceonInnovativeResearchandDevelopment (ICIRD), Bangkok, 11-12 May 2018, 1-5. https://doi.org/10.1109/icird.2018.8376308
[33]
Singh, P. (2021) Manage Data with Pyspark. In: Singh, P., MachineLearningwithPySpark: With Natural Language Processing and Recommender Systems, Apress, 15-37. https://doi.org/10.1007/978-1-4842-7777-5_2
[34]
Lee, D. and Drabas, T. (2017) Learning Pyspark. Packt Publishing Ltd.
[35]
Docker, I. (2020). https://www.docker.com/what-docker
[36]
Cook, J. (2017) Docker for Data Science: Building Scalable and Extensible Data Infrastructure around the Jupyter Notebook Server.
[37]
Turnbull, J. (2014) The Docker Book: Containerization Is the New Virtualization.
[38]
Gkatziouras, E. (2022) A Developer’s Essential Guide to Docker Compose: Simplify the Development and Orchestration of Multi-Container Applications. Packt Publishing Ltd.
[39]
Lutz, M. (2001) Programming Python. O’Reilly Media, Inc.
[40]
McKinney, W. (2011) Pandas: A Foundational Python Library for Data Analysis and Statistics. Pythonfor High Performanceand Scientific Computing, 14, 1-9.
[41]
Obe, R.O. and Hsu, L.S. (2017) Postgresql: Up and Running: A Practical Guide to the Advanced Open Source Database. O’Reilly Media, Inc.
[42]
Mukhopadhyay, S. and Samanta, P. (2022) ETL with Python. In: AdvancedDataAnalyticsUsingPython: With Architectural Patterns, Text and Image Classification, and Optimization Techniques, Apress, 23-52. https://doi.org/10.1007/978-1-4842-8005-8_2
[43]
Batmaci, G. (2022) Etl Data Pipelines Configurations in Spark.
[44]
Zhou, N., Zhou, H. and Hoppe, D. (2023) Containerization for High Performance Computing Systems: Survey and Prospects. IEEETransactionsonSoftwareEngineering, 49, 2722-2740. https://doi.org/10.1109/tse.2022.3229221
[45]
Bhat, S., Bhat, S. and Karkal (2018) Practical Docker with Python. Springer.
[46]
Castaneda, C., Nalley, K., Mannion, C., Bhattacharyya, P., Blake, P., Pecora, A., et al. (2015) Clinical Decision Support Systems for Improving Diagnostic Accuracy and Achieving Precision Medicine. JournalofClinicalBioinformatics, 5, Article No. 4. https://doi.org/10.1186/s13336-015-0019-3
[47]
Ferrão, J.C., Oliveira, M.D., Janela, F., Martins, H.M.G. and Gartner, D. (2020) Can Structured EHR Data Support Clinical Coding? A Data Mining Approach. HealthSystems, 10, 138-161. https://doi.org/10.1080/20476965.2020.1729666