全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

High Dimension Multivariate Data Analysis for Small Group Samples of Chemical Volatile Profiles of African Nightshade Species

DOI: 10.4236/jdaip.2024.122012, PP. 210-231

Keywords: Random Forest, Similarity Percentage, PERMANOVA, ANOSIM, Non-Metric Multi-Dimensional Scaling

Full-Text   Cite this paper   Add to My Lib

Abstract:

Quantitative headspace analysis of volatiles emitted by plants or any other living organisms in chemical ecology studies generates large multidimensional data that require extensive mining and refining to extract useful information. More often the number of variables and the quantified volatile compounds exceed the number of observations or samples and hence many traditional statistical analysis methods become inefficient. Here, we employed machine learning algorithm, random forest (RF) in combination with distance-based procedure, similarity percentage (SIMPER) as preprocessing steps to reduce the data dimensionality in the chemical profiles of volatiles from three African nightshade plant species before subjecting the data to non-metric multidimensional scaling (NMDS). In addition, non-parametric methods namely permutational multivariate analysis of variance (PERMANOVA) and analysis of similarities (ANOSIM) were applied to test hypothesis of differences among the African nightshade species based on the volatiles profiles and ascertain the patterns revealed by NMDS plots. Our results revealed that there were significant differences among the African nightshade species when the data’s dimension was reduced using RF variable importance and SIMPER, as also supported by NMDS plots that showed S. scabrum being separated from S. villosum and S. sarrachoides based on the reduced data variables. The novelty of our work is on the merits of using data reduction techniques to successfully reveal differences in groups which could have otherwise not been the case if the analysis were performed on the entire original data matrix characterized by small samples. The R code used in the analysis has been shared herein for interested researchers to customise it for their own data of similar nature.

References

[1]  Tholl, D., Boland, W., Hansel, A., Loreto, F., Rose, U.S.R. and Schnitzler, J.P. (2006) Practical Approaches to Plant Volatile Analysis. The Plant Journal, 45, 540-560.
https://doi.org/10.1111/j.1365-313X.2005.02612.x
[2]  Chung, S.H., Scully, E.D., Peiffer M., Geib S.M., Rosa, C., Hoover K. and Felton G.W. (2017) Host Plant Species Determines Symbiotic Bacterial Community Mediating Suppression of Plant Defenses. Scientific Reports, 7, Article No. 39690.
https://doi.org/10.1038/srep39690
[3]  Salerno, G., Rebora, M., Piersanti, S., Gorb, E. and Gorb, S. (2020) Mechanical Ecology of Fruit-Insect Interaction in the Adult Mediterranean Fruit Fly Ceratitis capitata (Diptera: Tephritidae). Zoology, 139, Article 125748.
https://doi.org/10.1016/j.zool.2020.125748
[4]  Zu, P.J., García-García, R., Schuman, M.C., Saavedra, S. and Melián, C.J. (2023) Plant-Insect Chemical Communication in Ecological Communities: An Information Theory Perspective. Journal of Systematics and Evolution, 61, 445-453.
https://doi.org/10.1111/jse.12841
[5]  War, A.R., Paulraj, H.C.S., Gabriel, M., War, M.Y. and Ignacimuthu, S. (2011) Herbivore Induced Plant Volatiles: Their Role in Plant Defense for Pest Management. Plant Signaling & Behavior, 6, 1973-1978.
https://doi.org/10.4161/psb.6.12.18053
[6]  Dicke, M., Van Poecke, R.M.P. and De Boer, J.G. (2003) Inducible Indirect Defence of Plants: From Mechanisms to Ecological Functions. Basic and Applied Ecology, 4, 27-42.
https://doi.org/10.1078/1439-1791-00131
[7]  Engelberth, J., Alborn, H.T., Schmelz, E.A. and Tumlinson J.H. (2004) Airborne Signals Prime Plants Against Insect Herbivore Attack. Proc. Proceedings of the National Academy of Sciences, 101, 1781-1785.
https://doi.org/10.1073/pnas.0308037100
[8]  Chin, S.T., Nazimah, S.A.H., Quek, S.Y., Man, Y.B.C., Rahman, R.A. and Hashim, D.M. (2007) Analysis of Volatile Compounds from Malaysian Durians (Durio zibethinus) Using Headspace SPME Coupled to Fast GC-MS. Journal of Food Composition and Analysis, 20, 31-44.
https://doi.org/10.1016/j.jfca.2006.04.011
[9]  Drioiche, A., et al. (2022) Correlation between the Chemical Composition and the Antimicrobial Properties of Seven Samples of Essential Oils of Endemic Thymes in Morocco against Multi-Resistant Bacteria and Pathogenic Fungi. Saudi Pharmaceutical Journal, 30, 1200-1214.
https://doi.org/10.1016/j.jsps.2022.06.022
[10]  Paliy, O. and Shankar, V. (2016) Application of Multivariate Statistical Techniques in Microbial Ecology. Molecular Ecology, 25, 1032-1057.
https://doi.org/10.1111/mec.13536
[11]  Verma, S.P., Uscanga-Junco, O.A. and Díaz-González, L. (2021) A Statistically Coherent Robust Multidimensional Classification Scheme for Water. Science of the Total Environment, 750, Article 141704.
https://doi.org/10.1016/j.scitotenv.2020.141704
[12]  Ricciardi, C., et al. (2020) Linear Discriminant Analysis and Principal Component Analysis to Predict Coronary Artery Disease. Health Informatics Journal, 26, 2181-2192.
https://doi.org/10.1177/1460458219899210
[13]  Osborne, J.W. and Costello, A.B. (2004) Sample Size and Subject to Item Ratio in Principal Components Analysis. Practical Assessment, Research, and Evaluation, 9, Article 11.
[14]  Kocovsky, P.M., Adams, J.V. and Bronte, C.R. (2009) The Effect of Sample Size on the Stability of Principal Components Analysis of Truss-Based Fish Morphometrics. Transactions of the American Fisheries Society, 138, 487-496.
https://doi.org/10.1577/T08-091.1
[15]  Björklund, M. (2019) Be Careful with Your Principal Components. Evolution, 73, 2151-2158.
https://doi.org/10.1111/evo.13835
[16]  Shaukat, S.S., Rao, T.A. and Khan, M.A. (2016) Impact of Sample Size on Principal Component Analysis Ordination of an Environmental Data Set: Effects on Eigenstructure. Ekologia (Bratislava), 35, 173-190.
https://doi.org/10.1515/eko-2016-0014
[17]  Sharma, A. and Paliwal, K.K. (2015) Linear Discriminant Analysis for the Small Sample Size Problem: An Overview. International Journal of Machine Learning and Cybernetics, 6, 443-454.
https://doi.org/10.1007/s13042-013-0226-9
[18]  Austin, M.P. (2013) Inconsistencies between Theory and Methodology: A Recurrent Problem in Ordination Studies. Journal of Vegetation Science, 24, 251-268.
https://doi.org/10.1111/j.1654-1103.2012.01467.x
[19]  Damgaard, C. (2006) Modelling Ecological Presence-Absence Data along an Environmental Gradient: Threshold Levels of the Environment. Environmental and Ecological Statistics, 13, 229-236.
https://doi.org/10.1007/s10651-005-0004-2
[20]  Jollife, I.T. and Cadima, J. (2016) Principal Component Analysis: A Review and Recent Developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, Article 20150202.
https://doi.org/10.1098/rsta.2015.0202
[21]  Belgiu, M. and Drăgu, L. (2016) Random Forest in Remote Sensing: A Review of Applications and Future Directions. ISPRS Journal of Photogrammetry and Remote Sensing, 114, 24-31.
https://doi.org/10.1016/j.isprsjprs.2016.01.011
[22]  Oshiro, T.M., Perez, P.S. and Baranauskas, J.A. (2012) How Many Trees in a Random Forest? Machine Learning and Data Mining in Pattern Recognition: 8th International Conference, Berlin, 13-20 July 2012, 154-168.
https://doi.org/10.1007/978-3-642-31537-4_13
[23]  Chang, V., Bailey, J., Xu, Q.A. and Sun, Z. (2023) Pima Indians Diabetes Mellitus Classification Based on Machine Learning (ML) Algorithms. Neural Computing and Applications, 35, 16157-16173.
https://doi.org/10.1007/s00521-022-07049-z
[24]  Olden, J.D. and Jackson, D.A. (2001) Fish-Habitat Relationships in Lakes: Gaining Predictive and Explanatory Insight by Using Artificial Neural Networks. Transactions of the American Fisheries Society, 130, 878-897.
https://doi.org/10.1577/1548-8659(2001)130<0878:FHRILG>2.0.CO;2
[25]  Qi, Y. (2012) Random Forest for Bioinformatics. In: Zhang, C. and Ma, Y., Eds., Ensemble Machine Learning. Methods and Applications, Springer, New York, 307-323.
https://doi.org/10.1007/978-1-4419-9326-7
[26]  Wang, H., Yang, F. and Luo, Z. (2016) An Experimental Study of the Intrinsic Stability of Random Forest Variable Importance Measures. BMC Bioinformatics, 17, Article No. 60.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0900-5
https://doi.org/10.1186/s12859-016-0900-5
[27]  Luan, J., Zhang, C., Xu, B., Xue, Y. and Ren, Y. (2020) The Predictive Performances of Random Forest Models with Limited Sample Size and Different Species Traits. Fisheries Research, 227, Article 105534.
https://doi.org/10.1016/j.fishres.2020.105534
[28]  Janitza, S., Celik, E. and Boulesteix, A.L. (2018) A Computationally Fast Variable Importance Test for Random Forests for High-Dimensional Data. Advances in Data Analysis and Classification, 12, 885-915.
https://doi.org/10.1007/s11634-016-0276-4
[29]  Clarke, K.R. (1993) Non-Parametric Multivariate Analyses of Changes in Community Structure. Australian Journal of Ecology, 18, 117-143.
https://doi.org/10.1111/j.1442-9993.1993.tb00438.x
[30]  Hattas, D., Hjältén, J., Julkunen-Tiitto, R., Scogings, P.F. and Rooke, T. (2011) Differential Phenolic Profiles in Six African Savanna Woody Species in Relation to Antiherbivore Defense. Phytochemistry, 72, 1796-1803.
https://doi.org/10.1016/j.phytochem.2011.05.007
[31]  Gibert C. and Escarguel, G. (2019) PER-SIMPER—A New Tool for Inferring Community Assembly Processes from Taxon Occurrences. Global Ecology and Biogeography, 28, 374-385.
https://doi.org/10.1111/geb.12859
[32]  Torok, V.A., Ophel-Keller, K., Loo, M. and Hughes, R.J. (2008) Application of Methods for Identifying Broiler Chicken Gut Bacterial Species Linked with Increased Energy Metabolism. Applied and Environmental Microbiology, 74, 783-791.
https://doi.org/10.1128/AEM.01384-07
[33]  Murungi, L.K., Kirwa, H., Salifu, D. and Torto, B. (2016) Opposing Roles of Foliar and Glandular Trichome Volatile Components in Cultivated Nightshade Interaction with a Specialist Herbivore. PLOS ONE, 11, e0160383.
https://doi.org/10.1371/journal.pone.0160383
[34]  Kulkarni, V.Y. and Sinha, P.K. (2012) Pruning of Random Forest Classifiers: A Survey and Future Directions. 2012 International Conference on Data Science and Engineering (ICDSE), Cochin, 18-20 July 2012, 64-68.
https://doi.org/10.1109/ICDSE.2012.6282329
[35]  Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T. and Zeileis, A. (2008) Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, Article No. 307.
https://doi.org/10.1186/1471-2105-9-307
[36]  Ramette, A. (2007) Multivariate Analyses in Microbial Ecology. FEMS Microbiology Ecology, 62, 142-160.
https://doi.org/10.1111/j.1574-6941.2007.00375.x
[37]  Van Der Gucht, K., et al. (2005) Characterization of Bacterial Communities in Four Freshwater Lakes Differing in Nutrient Load and Food Web Structure. FEMS Microbiology Ecology, 53, 205-220.
https://doi.org/10.1016/j.femsec.2004.12.006
[38]  Salido, J.A. and Clemente, J. (2012) Non-Metric Multidimensional Scaling for Biological Characterization of Reduced Yeast Cell Cycle. 2012 International Conference on Biological and Life Sciences, Singapore, 23-24 July 2012, 104-108.
[39]  Dexter, E., Rollwagen-Bollens, G. and Bollens, S.M. (2018) The Trouble with Stress: a Flexible Method for the Evaluation of Nonmetric Multidimensional Scaling. Limnology and Oceanography: Methods, 16, 434-443.
https://doi.org/10.1002/lom3.10257
[40]  San Segundo, E., Tsanas, A. and Gómez-Vilda, P. (2017) Euclidean Distances as Measures of Speaker Similarity Including Identical Twin Pairs: A Forensic Investigation Using Source and Filter Voice Characteristics. Forensic Science International, 270, 25-38.
https://doi.org/10.1016/j.forsciint.2016.11.020
[41]  Legendre, P. and Gallagher, E.D. (2001) Ecologically Meaningful Transformations for Ordination of Species Data. Oecologia, 129, 271-280.
https://doi.org/10.1007/s004420100716
[42]  Gomathi, V.V. and Karthikeyan, S. (2014) An Efficient Clustering Segmentation Algorithm for Computer Tomography Image Segmentation. Journal of Biomedical Engineering and Medical Imaging, 1, 1-11.
https://doi.org/10.14738/jbemi.13.267
[43]  Legendre, P. and Legendre, L. (2012) Numerical Ecology, Developments in Environmental Modelling. 3rd Edition, Elsevier, Amsterdam, 419.
[44]  Gagné, S.A. and Fahrig, L. (2011) Do Birds and Beetles Show Similar Responses to Urbanization? Ecological Applications, 21, 2297-2312.
https://doi.org/10.1890/09-1905.1
[45]  Anderson, M.J. (2001) A New Method for Non-Parametric Multivariate Analysis of Variance. Austral Ecology, 26, 32-46.
[46]  R Core Team (2022) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
https://www.r-project.org/
[47]  Liaw A. and Wiener, M. (2002) Classification and Regression by randomForest. R News, 2, 18-22.
https://cran.r-project.org/doc/Rnews/
[48]  Greenwell, B.M. and Boehmke, B.C. (2020) Variable Importance Plots—An Introduction to the Vip Package. The R Journal, 12, 343-366.
https://doi.org/10.32614/RJ-2020-013
[49]  Oksanen, J., et al. (2022) Vegan: Community Ecology Package.
https://cran.r-project.org/package=vegan
[50]  Von Lampe, F. and Schellenberg, J. (2023) Goeveg: Functions for Community Data and Ordinations.
https://cran.r-project.org/package=goeveg
[51]  Wickham, H. (2016) Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York.
https://ggplot2.tidyverse.org
https://doi.org/10.1007/978-3-319-24277-4
[52]  Pedersen, T.L. (2022) Ggforce: Accelerating ‘Ggplot2’.
https://cran.r-project.org/package=ggforce
[53]  Konietschke, F., Schwab, K. and Pauly, M. (2020) Small Sample Sizes : A Big Data Problem in High-Dimensional Data Analysis. Statistical Methods in Medical Research, 30, 687-701.
https://doi.org/10.1177/0962280220970228
[54]  Chang, J., Zheng, C., Zhou, W. and Zhou, W. (2017) Simulation-Based Hypothesis Testing of High Dimensional Means under Covariance Heterogeneity. Biometrics, 73, 1300-1310.
https://doi.org/10.1111/biom.12695
[55]  Hufnagel, M.J. (2015) Chemical Ecology of Wild Solanum spp and Their Interaction with the Colorado Potato Beetle. Master’s Thesis, Michigan State University, East Lansing.
[56]  Suinyuy, T.N., Donaldson, J.S. and Johnson, S.D. (2012) Variation in the Chemical Composition of Cone Volatiles within the African Cycad Genus Encephalartos. Phytochemistry, 85, 82-91.
https://doi.org/10.1016/j.phytochem.2012.09.016
[57]  Ruokolainen L. and Salo, K. (2006) Differences in Performance of Four Ordination Methods on a Complex Vegetation Dataset. Annales Botanici Fennici, 43, 269-275.
[58]  Wang, J., Liu, X. and Shen, H. (2019) High-Dimensional Data Analysis with Subspace Comparison Using Matrix Visualization. Information Visualization, 18, 94-109.
https://doi.org/10.1177/1473871617733996
[59]  Muthoni, K.R. (2023) Identification and Mechanisms of Allelochemicals Regulating Root-Knot Nematode Parasitism. Ph.D. Thesis, Kenyatta University, Kahawa.
[60]  Junker, R.R. (2018) A Biosynthetically Informed Distance Measure to Compare Secondary Metabolite Profiles. Chemoecology, 28, 29-37.
https://doi.org/10.1007/s00049-017-0250-4
[61]  Roberts, D.W. (2017) Distance, Dissimilarity, and Mean-Variance Ratios in Ordination. Methods in Ecology and Evolution, 8, 1398-1407.
https://doi.org/10.1111/2041-210X.12739
[62]  Tomašev, N., Radovanović, M., Mladenić, D. and Ivanović, M. (2014) The Role of Hubness in Clustering High-Dimensional Data. IEEE Transactions on Knowledge and Data Engineering, 26, 739-751.
https://doi.org/10.1109/TKDE.2013.25
[63]  Ricotta, C. and Podani, J. (2017) On Some Properties of the Bray-Curtis Dissimilarity and Their Ecological Meaning. Ecological Complexity, 31, 201-205.
https://doi.org/10.1016/j.ecocom.2017.07.003
[64]  Faith, D.P., Minchin, P.R. and Belbin, L. (1987) Compositional Dissimilarity as a Robust Measure of Ecological Distance. Vegetatio, 69, 57-68.
https://doi.org/10.1007/BF00038687
[65]  Somerfield, P.J., Clarke, K.R. and Gorley, R.N. (2021) Analysis of Similarities (ANOSIM) for 2-Way Layouts Using a Generalised ANOSIM Statistic, with Comparative Notes on Permutational Multivariate Analysis of Variance (PERMANOVA). Austral Ecology, 46, 911-926.
https://doi.org/10.1111/aec.13059
[66]  Rojas, T.N., Zampini, I.C., Isla, M.I. and Blendiger, P.G. (2022) Fleshy Fruit Traits and Seed Dispersers: Which Traits Define Syndromes? Annals of Botany, 129, 831-838.
https://doi.org/10.1093/aob/mcab150
[67]  Kenkel, N.C. (2006) On Selecting an Appropriate Multivariate Analysis. Canadian Journal of Plant Science, 86, 663-676.
https://doi.org/10.4141/P05-164

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133