全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...
PLOS ONE  2014 

How Good Is Crude MDL for Solving the Bias-Variance Dilemma? An Empirical Investigation Based on Bayesian Networks

DOI: 10.1371/journal.pone.0092866

Full-Text   Cite this paper   Add to My Lib

Abstract:

The bias-variance dilemma is a well-known and important problem in Machine Learning. It basically relates the generalization capability (goodness of fit) of a learning method to its corresponding complexity. When we have enough data at hand, it is possible to use these data in such a way so as to minimize overfitting (the risk of selecting a complex model that generalizes poorly). Unfortunately, there are many situations where we simply do not have this required amount of data. Thus, we need to find methods capable of efficiently exploiting the available data while avoiding overfitting. Different metrics have been proposed to achieve this goal: the Minimum Description Length principle (MDL), Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC), among others. In this paper, we focus on crude MDL and empirically evaluate its performance in selecting models with a good balance between goodness of fit and complexity: the so-called bias-variance dilemma, decomposition or tradeoff. Although the graphical interaction between these dimensions (bias and variance) is ubiquitous in the Machine Learning literature, few works present experimental evidence to recover such interaction. In our experiments, we argue that the resulting graphs allow us to gain insights that are difficult to unveil otherwise: that crude MDL naturally selects balanced models in terms of bias-variance, which not necessarily need be the gold-standard ones. We carry out these experiments using a specific model: a Bayesian network. In spite of these motivating results, we also should not overlook three other components that may significantly affect the final model selection: the search procedure, the noise rate and the sample size.

References

[1]  Grünwald P (2000) Model selection based on Minimum Description Length. J Math Psychol 44 (1): 133–152. doi: 10.1006/jmps.1999.1280
[2]  Grünwald P (2007) The Minimum Description Length principle. MIT Press. 703 p.
[3]  Grunwald P, Myung IJ, Pitt MA, eds. (2005) Advances in Minimum Description Length: theory and applications. MIT Press. 452 p.
[4]  Kearns M, Mansour Y, Ng AY, Ron D (1997) An experimental and theoretical comparison of model selection methods. Mach Learn 27 (1): 7–50. doi: 10.1145/225298.225301
[5]  Myung IJ (2000) The importance of complexity in model selection. J Math Psychol 44 (1): 190–204. doi: 10.1006/jmps.1999.1283
[6]  Van Allen T, Greiner R (2000) Model selection criteria for learning belief nets: an empirical comparison. Proc Int Conf Mach Learn 17: 1047–1054.
[7]  Zucchini W (2000) An introduction to model selection. J Math Psychol 44 (1): 41–61. doi: 10.1006/jmps.1999.1276
[8]  Bozdogan H (2000) Akaike’s information criterion and recent developments in information complexity. J Math Psychol 44 (1): 62–91. doi: 10.1006/jmps.1999.1277
[9]  Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. San Mateo, California: Morgan Kaufmann. 552 p.
[10]  Wasserman L (2000) Bayesian model selection and model averaging. J Math Psychol 44 (1): 92–107. doi: 10.1006/jmps.1999.1278
[11]  Cooper GF (1999) An overview of the representation and discovery of causal relationships using Bayesian networks. In: Glymour C and Cooper GF, editors. Computation, causation & discovery. AAAI Press/MIT Press. 3–62.
[12]  Geiger D, Heckerman D, Meek C (1998) Asymptotic model selection for directed networks with hidden variables. Learning in graphical models, NATO ASI series 89: 461–477. doi: 10.1007/978-94-011-5014-9_16
[13]  Heckerman D (1998) A tutorial on learning with Bayesian networks. Learning in graphical models, NATO ASI series 89: 301–354. doi: 10.1007/978-94-011-5014-9_11
[14]  Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Min Knowl Discov 1 (1): 55–77.
[15]  Geman S, Bienenstock E, Doursat R (1992) Neural Networks and the bias/variance dilemma. Neural Comput 4 (1): 1–58. doi: 10.1162/neco.1992.4.1.1
[16]  Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. New York: Springer. 533 p.
[17]  Bouckaert RR (1993) Probabilistic network construction using the Minimum Description Length principle. In: Clarke M, Kruse R and Moral S, editors. Symbolic and quantitative approaches to reasoning and uncertainty. Springer-Verlag. 41–48.
[18]  Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intell 10 (3): 269–293. doi: 10.1111/j.1467-8640.1994.tb00166.x
[19]  Suzuki J (1996) Learning Bayesian belief networks based on the Minimum Description Length principle: an efficient algorithm using the B & B technique. Proc Int Conf Mach Learn 13: 462–470.
[20]  Suzuki J (1999) Learning Bayesian belief networks based on the Minimum Description Length Principle: basic properties. IEICE transactions on fundamentals of electronics, communications and computer science E82-A (10): 2237–2245.
[21]  Robinson RW (1977) Counting unlabeled acyclic digraphs. Combinatorial mathematics V, Lecture notes in mathematics 622: 28–43. doi: 10.1007/bfb0069178
[22]  Chickering DM, Heckerman D, Meek C (1997) A Bayesian approach to learning Bayesian networks with local structure. Uncertain Artif Intell 13: 80–89.
[23]  Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9: 309–347. doi: 10.1007/bf00994110
[24]  Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29 (2–3): 131–163. doi: 10.1023/a:1007465528199
[25]  Glymour C, Cooper GF, eds. (1999) Computation, causation & discovery. AAAI Press/MIT Press. 552 p.
[26]  Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20 (3): 197–243. doi: 10.1007/bf00994016
[27]  Jordan MI, ed. (1998) Learning in graphical models. Dordecht, The Netherlands: Kluwer Academic Publishers. 634 p.
[28]  Neapolitan RE (1990) Probabilistic reasoning in expert systems: theory and algorithms. New York: John Wiley & Sons, Inc. 433 p.
[29]  Pearl J (2000) Causality: models, reasoning and inference. New York: Cambridge University Press. 384 p.
[30]  Spirtes P, Glymour C, Scheines R (1993) Causation, prediction and search. New York: Springer-Verlag. 526 p.
[31]  Spirtes P, Meek C (1995) Learning Bayesian networks with discrete variables from data. KDD 1: 294–299.
[32]  Whittaker J (1990) Graphical Models in Applied Mathematical Multivariate Statistics. New York: John Wiley & Sons. 448 p.
[33]  Friedman N, Goldszmidt M (1998) Learning Bayesian networks from data. Proc Conf AAAI Artif Intell 15.
[34]  Buntine W (1996) A guide to the literature on learning probabilistic networks from data. IEEE Trans Knowl Data Eng 8 (2): 195–210. doi: 10.1109/69.494161
[35]  Diez FJ, Mira J, Iturralde E, Zubillaga S (1997) DIAVAL, a Bayesian expert system for echocardiography. Artif Intell Med 10 (1): 59–73. doi: 10.1016/s0933-3657(97)00384-9
[36]  Chickering DM (1996) Learning Bayesian Networks is NP-Complete. Learning from data, Lecture notes in statistics 112: 121–130. doi: 10.1007/978-1-4612-2404-4_12
[37]  Russell S, Norvig P (2002) Artificial intelligence: a modern approach. Prentice Hall. 1179 p.
[38]  Grossman D, Domingos P (2004) Learning Bayesian network classifiers by maximizing conditional likelihood. Proc Int Conf Mach Learn 21: 46. doi: 10.1145/1015330.1015339
[39]  Kelner R, Lerner B (2012) Learning Bayesian network classifiers by risk minimization. Int J Approx Reason 53: 248–272. doi: 10.1016/j.ijar.2011.10.006
[40]  Acid S, de Campos LM, Fernández M (2013) Score-based methods for learning Markov boundaries by searching in constrained spaces. Data Min Knowl Discov 26 (1): 174–212. doi: 10.1007/s10618-011-0247-5
[41]  Chow CK, Liu CN (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14 (3): 462–467. doi: 10.1109/tit.1968.1054142
[42]  Friedman N, Goldszmidt M (1996) Discretizing continuous attributes while learning Bayesian networks. Proc Int Conf Mach Learn 13: 157–165.
[43]  Cheng J, Greiner R (1999) Comparing Bayesian network classifiers. Uncertain Artif Intell 15: 101–108.
[44]  Kontkanen P, Silander T, Myllymaki P, Tirri H (1999) On supervised selection of Bayesian networks. Uncertain Artif Intell 15: 334–342.
[45]  Kontkanen P, Myllymaki P, Silander T, Tirri H, Grünwald P (2000) On predictive distributions and Bayesian networks. Stat Comput 10 (1): 39–54. doi: 10.1023/a:1008984400380
[46]  Kleiner A, Sharp B (2000) A new algorithm for learning Bayesian classifiers from data. Artificial Intelligence and Soft Computing: 191–197.
[47]  Clarke EJ, Barton BA (2000) Entropy and MDL discretization of continuous variables for Bayesian belief networks. International Journal of Intelligent Systems 15 (1): 61–92. doi: 10.1002/(sici)1098-111x(200001)15:1<61::aid-int4>3.0.co;2-o
[48]  Madden MG (2002) A new Bayesian network structure for classification tasks. Artificial intelligence and cognitive science, Lecture notes in computer science 2464: 203–208. doi: 10.1007/3-540-45750-x_27
[49]  Wong ML, Lee SY, Leung KS (2002) A hybrid data mining approach to discover Bayesian networks using evolutionary programming. GECCO 2: 214–222. doi: 10.1109/icdm.2002.1183994
[50]  Wong ML, Lee SY, Leung KS (2002) A hybrid approach to discover Bayesian networks from databases using evolutionary programming. Proc IEEE Int Conf Data Min 2002: 498–505. doi: 10.1109/icdm.2002.1183994
[51]  Madden M (2003) The performance of Bayesian network classifiers constructed using different techniques. Proceedings of European conference on machine learning, workshop on probabilistic graphical models for classification 14: 59–70.
[52]  Santos E, Hussein A (2004) Case-based Bayesian network classifiers. FLAIRS conference 2004: 538–543.
[53]  Roos R, Wettig H, Grünwald P, Myllyn?ki P, Tirri H (2005) On discriminative Bayesian network classifiers and logistic regression. Mach Learn 59 (3): 267–296. doi: 10.1007/s10994-005-0471-6
[54]  Castillo G, Gama J (2005) Bias management of Bayesian network classifiers. Discovery science, Lecture notes in computer science 3735: 70–83. doi: 10.1007/11563983_8
[55]  Jing Y, Pavlovic V, Rehg JM (2005) Efficient discriminative learning of Bayesian network classifier via boosted augmented naive Bayes. Proc Int Conf Mach Learn 22: 369–376. doi: 10.1145/1102351.1102398
[56]  Langseth H, Nielsen TD (2006) Classification using hierarchical na?ve Bayes models. Mach Learn 63 (2): 135–159. doi: 10.1007/s10994-006-6136-2
[57]  Su J, Zhang H (2006) Full Bayesian network classifiers. Proc Int Conf Mach Learn 23: 897–904.
[58]  O’Donnell RT, Allison L, Korb KB (2006) Learning hybrid Bayesian networks by MML. Advances in artificial intelligence, Lecture notes in computer science 4304: 192–203. doi: 10.1007/11941439_23
[59]  Yehezkel R, Lerner B (2006) Bayesian network structure learning by recursive autonomy identification. Structural, syntactic, and statistical pattern recognition, Lecture notes in computer science 4109: 154–162. doi: 10.1007/11815921_16
[60]  Carvalho AM, Oliveira AL, Sagot MF (2007) Efficient learning of Bayesian network classifiers: an extension to the TAN classifier. Advances in artificial intelligence, Lecture notes in computer science 4830: 16–25. doi: 10.1007/978-3-540-76928-6_4
[61]  Boullé M (2007) Compression-based averaging of selective naive Bayes classifiers. J Mach Learn Res 8: 1659–1685.
[62]  Francois OCH (2008) Efficient Bayesian network learning using EM or pairwise deletion. European Workshop on Probabilistic Graphical Models 4: 121–128.
[63]  Jing Y, Pavlovic V, Rehg JM (2008) Boosted Bayesian network classifiers. Mach Learn 73 (2): 155–184. doi: 10.1007/s10994-008-5065-7
[64]  Madden MG (2009) On the classification performance of TAN and general Bayesian networks. Knowledge-Based Systems 22 (7): 489–495. doi: 10.1016/j.knosys.2008.10.006
[65]  Silander T, Roos T, Myllymaki P (2010) Learning locally minimax optimal Bayesian networks. Int J Approx Reason 51 (5): 544–557. doi: 10.1016/j.ijar.2010.01.012
[66]  Drugan MM, Wiering MA (2010) Feature selection for Bayesian network classifiers using the MDL-FS score. Int J Approx Reason 51 (6): 695–717. doi: 10.1016/j.ijar.2010.02.001
[67]  Lee N, Kim JM (2010) Conversion of categorical variables into numerical variables via Bayesian network classifiers for binary classifications. Comput Stat Data Anal 54 (5): 1247–1265. doi: 10.1016/j.csda.2009.11.003
[68]  Flores MJ, Nicholson AE, Brunskill A, Korb KB, Mascaro S (2011) Incorporating expert knowledge when learning Bayesian network structure: a medical case study. Artif Intell Med 53 (3): 181–204. doi: 10.1016/j.artmed.2011.08.004
[69]  Flores MJ, Gámez JA, Martínez AM (2011) Handling numeric attributes when comparing Bayesian network classifiers: does the discretization method matter? Applied Intelligence 34 (3): 372–385. doi: 10.1007/s10489-011-0286-z
[70]  Larra?aga P, Karshenas H, Bielza C, Santana R (2013) A review on evolutionary algorithms in Bayesian network learning and inference tasks. Information Sciences 233: 109–125. doi: 10.1016/j.ins.2012.12.051
[71]  Friedman N, Goldszmidt M (1998) Learning Bayesian networks with local structure. Learning in graphical models, NATO ASI series 89: 421–459. doi: 10.1007/978-94-011-5014-9_15
[72]  Quinlan JR (1986) Induction of decision trees. Mach Learn 1: 81–106. doi: 10.1007/bf00116251
[73]  Neapolitan RE (2004) Learning Bayesian networks. New Jersey: Pearson-Prentice Hall. 674 p.
[74]  Isozaki T (2012) Learning causal Bayesian networks using minimum free energy principle. New Generation Computing 30 (1): 17–52. doi: 10.1007/s00354-012-0103-1
[75]  Drugan MM, Wiering MA (2010) Feature selection for Bayesian network classifiers using the MDL-FS score. Int J Approx Reason 51 (6): 695–717. doi: 10.1016/j.ijar.2010.02.001
[76]  Lee N, Kim JM (2010) Conversion of categorical variables into numerical variables via Bayesian network classifiers for binary classifications. Comput Stat Data Anal 54 (5): 1247–1265. doi: 10.1016/j.csda.2009.11.003
[77]  Alonso-Barba JI, delaOssa L, Puerta JM (2011) Structural learning of Bayesian networks using local algorithms based on the space of orderings. Soft Computing 15 (10): 1881–1895. doi: 10.1007/s00500-010-0623-x
[78]  Freno A, Trentin E (2011) Hybrid random fields: a scalable approach to structure and parameter learning in probabilistic graphical models. Berlin, Germany: Springer Berlin-Heidelberg. 207 p.
[79]  Lima CF, Lobo FG, Pelikan M, Goldberg DE (2011) Model accuracy in the Bayesian optimization algorithm. Soft Computing 15 (7): 1351–1371. doi: 10.1007/s00500-010-0675-y
[80]  Liu Y, Han JDJ (2010) Application of Bayesian networks on large-scale biological data. Front Biol (Beijing) 5 (2): 98–104. doi: 10.1007/s11515-010-0023-8
[81]  de Campos CP, Zeng Z, Ji Q (2009) Structure learning of Bayesian networks using constraints. Proc Int Conf Mach Learn 26: 113–120.
[82]  Yang Y, Webb G, Korb K, Ting KM (2007) Classifying under computational resource constraints: anytime classification using probabilistic estimators. Mach Learn 69 (1): 35–53. doi: 10.1007/s10994-007-5020-z
[83]  Sahin F, Tillett J, Raghuveer R, Rao T (2004) An evolutionary algorithmic approach to learning a Bayesian network from complete data. Proc. SPIE 5433, Data Mining and Knowledge Discovery: Theory, Tools, and Technology VI: 88–99. doi: 10.1117/12.542371
[84]  de Campos LM (2006) A scoring function for learning Bayesian networks based on mutual information and conditional independence tests.. J Mach Learn Res 7: 2149–2187.
[85]  Cheng J, Greiner R, Kelly J, Bell D, Liu W (2002) Learning Bayesian networks from data: an information theory based approach. Artif Intell 137: 43–90. doi: 10.1016/s0004-3702(02)00191-1
[86]  Ide JS, Cozman FG (2002) Random generation of Bayesian networks. Advances in artificial intelligence, Lecture notes in computer science 2507: 366–376. doi: 10.1007/3-540-36127-8_35
[87]  Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1999) Numerical recipes in C: the art of scientific computing. New York: Cambridge University Press. 994 p.
[88]  Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. New York: Springer. 533 p.
[89]  Friedman N, Yakhini Z (1996) On the sample complexity of learning Bayesian networks. Uncertain Artif Intell 12: 274–282.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133