OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Journal of Data Analysis and Information Processing 2024

Optimizing Healthcare Big Data Processing with Containerized PySpark and Parallel Computing: A Study on ETL Pipeline Efficiency

DOI: 10.4236/jdaip.2024.124029, PP. 544-565

Ehsan Soltanmohammadi, Neset Hikmet

Keywords: Big Data Engineering, ETL, Healthcare Sector, Containerized Applications, Distributed Computing, Resource Optimization, Data Processing Efficiency

Full-Text Cite this paper Add to My Lib

Abstract:

In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical Database. Our investigation entails a comprehensive exploration of various methodologies aimed at enhancing the efficiency of ETL processes, with a primary emphasis on optimizing time and resource utilization. Through meticulous experimentation utilizing a representative dataset, we shed light on the advantages associated with the incorporation of PySpark and Docker containerized applications. Our research illuminates significant advancements in time efficiency, process streamlining, and resource optimization attained through the utilization of PySpark for distributed computing within Big Data Engineering workflows. Additionally, we underscore the strategic integration of Docker containers, delineating their pivotal role in augmenting scalability and reproducibility within the ETL pipeline. This paper encapsulates the pivotal insights gleaned from our experimental journey, accentuating the practical implications and benefits entailed in the adoption of PySpark and Docker. By streamlining Big Data Engineering and ETL processes in the context of clinical big data, our study contributes to the ongoing discourse on optimizing data processing efficiency in healthcare applications. The source code is available on request.

References

[1]	Johnson, E. and Miller, R. (2021) Harnessing the Data Revolution: Big Data’s Role in Transforming Industries. Journal of Science & Technology, 2, 32-39.
[2]	Cozzoli, N., Salvatore, F.P., Faccilongo, N. and Milone, M. (2022) How Can Big Data Analytics Be Used for Healthcare Organization Management? Literary Framework and Future Research from a Systematic Review. BMC Health Services Research, 22, Article No. 809. https://doi.org/10.1186/s12913-022-08167-z
[3]	Dicuonzo, G., Galeone, G., Shini, M. and Massari, A. (2022) Towards the Use of Big Data in Healthcare: A Literature Review. Healthcare, 10, Article No. 1232. https://doi.org/10.3390/healthcare10071232
[4]	Raghupathi, W. and Raghupathi, V. (2014) Big Data Analytics in Healthcare: Promise and Potential. Health Information Science and Systems, 2, Article No. 3. https://doi.org/10.1186/2047-2501-2-3
[5]	Pannunzio, V., Kleinsmann, M., Snelders, D. and Raijmakers, J. (2023) From Digital Health to Learning Health Systems: Four Approaches to Using Data for Digital Health Design. Health Systems, 12, 481-494. https://doi.org/10.1080/20476965.2023.2284712
[6]	Lipovac, I. and Babac, M.B. (2024) Developing a Data Pipeline Solution for Big Data Processing. International Journal of Data Mining, Modelling and Management, 16, 1-22. https://doi.org/10.1504/ijdmmm.2024.136221
[7]	Cheng, K.Y., Pazmino, S. and Schreiweis, B. (2022) ETL Processes for Integrating Healthcare Data—Tools and Architecture Patterns. In: Studies in Health Technology and Informatics, IOS Press, 151-156. https://doi.org/10.3233/shti220974
[8]	Rossi, R.L. and Grifantini, R.M. (2018) Big Data: Challenge and Opportunity for Translational and Industrial Research in Healthcare. Frontiers in Digital Humanities, 5, Article No. 13. https://doi.org/10.3389/fdigh.2018.00013
[9]	Berg, K., Doktorchik, C., Quan, H. and Saini, V. (2022) Automating Data Collection Methods in Electronic Health Record Systems: A Social Determinant of Health (SDOH) Viewpoint. Health Systems, 12, 472-480. https://doi.org/10.1080/20476965.2022.2075796
[10]	Dash, S., Shakyawar, S.K., Sharma, M. and Kaushik, S. (2019) Big Data in Healthcare: Management, Analysis and Future Prospects. Journal of Big Data, 6, Article No. 54. https://doi.org/10.1186/s40537-019-0217-0
[11]	Batko, K. and Ślęzak, A. (2022) The Use of Big Data Analytics in Healthcare. Journal of Big Data, 9, Article No. 3. https://doi.org/10.1186/s40537-021-00553-4
[12]	Ismail, A., Shehab, A. and El-Henawy, I.M. (2018) Healthcare Analysis in Smart Big Data Analytics: Reviews, Challenges and Recommendations. In: Hassanien, A.E., Elhoseny, M., Ahmed, S.H. and Singh, A.K., Eds., Security in Smart Cities: Models, Applications, and Challenges, Springer International Publishing, 27-45. https://doi.org/10.1007/978-3-030-01560-2_2
[13]	Kashyap, R. (2019) Big Data Analytics Challenges and Solutions. In: Dey, N., Das, H., Naik, B. and Behera, H.S., Eds., Big Data Analytics for Intelligent Healthcare Management, Elsevier, 19-41. https://doi.org/10.1016/b978-0-12-818146-1.00002-7
[14]	Kraus, J.M., Lausser, L., Kuhn, P., Jobst, F., Bock, M., Halanke, C., et al. (2018) Big Data and Precision Medicine: Challenges and Strategies with Healthcare Data. International Journal of Data Science and Analytics, 6, 241-249. https://doi.org/10.1007/s41060-018-0095-0
[15]	Seenivasan, D. (2023) Improving the Performance of the ETL Jobs. International Journal of Computer Trends and Technology, 71, 27-33. https://doi.org/10.14445/22312803/ijctt-v71i3p105
[16]	Johnson, A., et al. (2019) Mimic-III Clinical Database Demo (version 1.4). Physionet.
[17]	Johnson, A., et al. (2019) “Mimic-III Clinical Database Demo” (Version 1.4). Physionet (2019).
[18]	Johnson, A.E.W., Pollard, T.J., Shen, L., Lehman, L.H., Feng, M., Ghassemi, M., et al. (2016) MIMIC-III, a Freely Accessible Critical Care Database. Scientific Data, 3, Article ID: 160035. https://doi.org/10.1038/sdata.2016.35
[19]	Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., et al. (2000) Physiobank, Physiotoolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101, e215-e220. https://doi.org/10.1161/01.cir.101.23.e215
[20]	Onyemachi, N.C. and Nonyelum, O.F. (2019) Big Data Analytics in Healthcare: A Review. 2019 15th International Conference on Electronics, Computer and Computation (ICECCO), Abuja, 10-12 December 2019, 1-5. https://doi.org/10.1109/icecco48375.2019.9043183
[21]	Wang, Y., Kung, L. and Byrd, T.A. (2018) Big Data Analytics: Understanding Its Capabilities and Potential Benefits for Healthcare Organizations. Technological Forecasting and Social Change, 126, 3-13. https://doi.org/10.1016/j.techfore.2015.12.019
[22]	Hong, L., Luo, M., Wang, R., Lu, P., Lu, W. and Lu, L. (2018) Big Data in Health Care: Applications and Challenges. Data and Information Management, 2, 175-197. https://doi.org/10.2478/dim-2018-0014
[23]	Dabral, S. and Mohana, R. (2023) Healthcare Data Pipeline.
[24]	Saheb, T. and Izadi, L. (2019) Paradigm of IoT Big Data Analytics in the Healthcare Industry: A Review of Scientific Literature and Mapping of Research Trends. Telematics and Informatics, 41, 70-85. https://doi.org/10.1016/j.tele.2019.03.005
[25]	Rehman, A., Naz, S. and Razzak, I. (2021) Leveraging Big Data Analytics in Healthcare Enhancement: Trends, Challenges and Opportunities. Multimedia Systems, 28, 1339-1371. https://doi.org/10.1007/s00530-020-00736-8
[26]	Ariffin, N., Yunus, A.M. and Kadir, I. (2021) The Role of Big Data in the Healthcare Industry. Journal of Islamic, 6, 235-245.
[27]	Karatas, M., Eriskin, L., Deveci, M., Pamucar, D. and Garg, H. (2022) Big Data for Healthcare Industry 4.0: Applications, Challenges and Future Perspectives. Expert Systems with Applications, 200, Article ID: 116912. https://doi.org/10.1016/j.eswa.2022.116912
[28]	Kruse, C.S., Goswamy, R., Raval, Y. and Marawi, S. (2016) Challenges and Opportunities of Big Data in Health Care: A Systematic Review. JMIR Medical Informatics, 4, e38. https://doi.org/10.2196/medinform.5359
[29]	Raj, A., Bosch, J., Olsson, H.H. and Wang, T.J. (2020) Modelling Data Pipelines. 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Portoroz, 26-28 August 2020, 13-20. https://doi.org/10.1109/seaa51224.2020.00014
[30]	Vyas, S. and Vaishnav, P. (2017) A Comparative Study of Various ETL Process and Their Testing Techniques in Data Warehouse. Journal of Statistics and Management Systems, 20, 753-763. https://doi.org/10.1080/09720510.2017.1395194
[31]	Rahman, N., Kumar, N. and Rutz, D. (2016) Managing Application Compatibility during ETL Tools and Environment Upgrades. Journal of Decision Systems, 25, 136-150. https://doi.org/10.1080/12460125.2016.1138392
[32]	Diouf, P.S., Boly, A. and Ndiaye, S. (2018). Variety of Data in the ETL Processes in the Cloud: State of the Art. 2018 IEEE International Conference on Innovative Research and Development (ICIRD), Bangkok, 11-12 May 2018, 1-5. https://doi.org/10.1109/icird.2018.8376308
[33]	Singh, P. (2021) Manage Data with Pyspark. In: Singh, P., Machine Learning with PySpark: With Natural Language Processing and Recommender Systems, Apress, 15-37. https://doi.org/10.1007/978-1-4842-7777-5_2
[34]	Lee, D. and Drabas, T. (2017) Learning Pyspark. Packt Publishing Ltd.
[35]	Docker, I. (2020). https://www.docker.com/what-docker
[36]	Cook, J. (2017) Docker for Data Science: Building Scalable and Extensible Data Infrastructure around the Jupyter Notebook Server.
[37]	Turnbull, J. (2014) The Docker Book: Containerization Is the New Virtualization.
[38]	Gkatziouras, E. (2022) A Developer’s Essential Guide to Docker Compose: Simplify the Development and Orchestration of Multi-Container Applications. Packt Publishing Ltd.
[39]	Lutz, M. (2001) Programming Python. O’Reilly Media, Inc.
[40]	McKinney, W. (2011) Pandas: A Foundational Python Library for Data Analysis and Statistics. Python for High Performance and Scientific Computing, 14, 1-9.
[41]	Obe, R.O. and Hsu, L.S. (2017) Postgresql: Up and Running: A Practical Guide to the Advanced Open Source Database. O’Reilly Media, Inc.
[42]	Mukhopadhyay, S. and Samanta, P. (2022) ETL with Python. In: Advanced Data Analytics Using Python: With Architectural Patterns, Text and Image Classification, and Optimization Techniques, Apress, 23-52. https://doi.org/10.1007/978-1-4842-8005-8_2
[43]	Batmaci, G. (2022) Etl Data Pipelines Configurations in Spark.
[44]	Zhou, N., Zhou, H. and Hoppe, D. (2023) Containerization for High Performance Computing Systems: Survey and Prospects. IEEE Transactions on Software Engineering, 49, 2722-2740. https://doi.org/10.1109/tse.2022.3229221
[45]	Bhat, S., Bhat, S. and Karkal (2018) Practical Docker with Python. Springer.
[46]	Castaneda, C., Nalley, K., Mannion, C., Bhattacharyya, P., Blake, P., Pecora, A., et al. (2015) Clinical Decision Support Systems for Improving Diagnostic Accuracy and Achieving Precision Medicine. Journal of Clinical Bioinformatics, 5, Article No. 4. https://doi.org/10.1186/s13336-015-0019-3
[47]	Ferrão, J.C., Oliveira, M.D., Janela, F., Martins, H.M.G. and Gartner, D. (2020) Can Structured EHR Data Support Clinical Coding? A Data Mining Approach. Health Systems, 10, 138-161. https://doi.org/10.1080/20476965.2020.1729666

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133