Kernel-based neural network (KNN) is proposed as a neuron that is applicable in online learning with adaptive parameters. This neuron with adaptive kernel parameter can classify data accurately instead of using a multilayer error backpropagation neural network. The proposed method, whose heart is kernel least-mean-square, can reduce memory requirement with sparsification technique, and the kernel can adaptively spread. Our experiments will reveal that this method is much faster and more accurate than previous online learning algorithms. 1. Introduction Adaptive filter is the heart of most neural networks [1]. LMS method and its kernel-based methods are potential online methods with iterative learning that are used for reducing mean squared error toward optimum Wiener weights. Due to simple implementation of LMS [1], this method became one of the candidates for online kernel-based learning. The kernel-based learning [2] utilizes Mercer kernels in order to produce nonlinear versions of conventional linear methods. After the introduction of the kernel, kernel least-mean-square (KLMS) [3, 4] was proposed. KLMS algorithm tries to solve LMS problems in reproducing kernel hilbert spaces (RKHS) [3] using a stochastic gradient methodology. KNN has such characteristics as kernel abilities and LMS features, easy learning over variants of patterns, and traditional neurons capabilities. The experimental results show that this classifier has better performance than the other online kernel methods, with suitable parameters. Two main drawbacks of kernel-based methods are selecting proper value for kernel parameters and series expansions whose size equals the number of training data, which make them unsuitable for online applications. This paper concentrates only on Gaussian kernel (for similar reasons to those discussed in [5]), while KNN uses other kernels too. In [6], the role of kernel width in the smoothness of the performance surfaces. Determining the kernel width of Gaussian kernel in kernel-based methods is very important. Controlling kernel width can help us to control the learning rate and the tradeoff between overfitting and underfitting. Use of cross-validation is one of the simplest methods to tune this parameter which is costly and cannot be used for datasets with too many classes. So, the parameters are chosen using a subset of data with a low number of classes in [7]. In some methods, genetic algorithm [8] or grid search [5] is used to determine the proper value of such parameters. However, in all the mentioned methods, the kernel width is chosen as a
References
[1]
B. Widrow, “Adaptive filters I: fundamentals,” Tech. Rep. SEL-66-126 (TR-6764-6), Stanford Electronic Laboratories, Stanford, Calif, USA, 1966.
[2]
J. Kivinenm, A. J. Smola, and R. C. Williamson, Online Learning with Kernels, IEEE, New York, NY, USA, 2004.
[3]
P. P. Pokharel, L. Weifeng, and J. C. Principe, “Kernel LMS,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), pp. III1421–III1424, Honolulu. Hawaii, USA, April 2007.
[4]
W. Liu, P. P. Pokharel, and J. C. Principe, “The kernel least-mean-square algorithm,” IEEE Transactions on Signal Processing, vol. 56, no. 2, pp. 543–554, 2008.
[5]
B. Sch?lkopf and A. J. Smola, Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, Mass, USA, 2002.
[6]
D. Erdogmus and J. C. Principe, “Generalized information potential criterion for adaptive system training,” IEEE Transactions on Neural Networks, vol. 13, no. 5, pp. 1035–1044, 2002.
[7]
R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms, MIT Press, Cambridge, Mass, USA, 2002.
[8]
V. Vapnik, Statistical Learning Theory, Wiley, New York, NY, USA, 1998.
[9]
Q. Chang, Q. Chen, and X. Wang, “Scaling Gaussian RBF kernel width to improve SVM classification,” in Proceedings of the International Conference on Neural Networks and Brain Proceedings (ICNNB '05), pp. 19–22, Beijing, China, October 2005.
[10]
Y. Baram, “Learning by kernel polarization,” Neural Computation, vol. 17, no. 6, pp. 1264–1275, 2005.
[11]
H. Xiong, M. N. S. Swamy, and M. O. Ahmad, “Optimizing the kernel in the empirical feature space,” IEEE Transactions on Neural Networks, vol. 16, no. 2, pp. 460–474, 2005.
[12]
A. Singh and J. C. Príncipe, “Kernel width adaptation in information theoretic cost functions,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '10), pp. 2062–2065, Dallas, Tex, USA, March 2010.
[13]
J. A. K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle, “Weighted least squares support vector machines: robustness and sparce approximation,” Neurocomputing, vol. 48, pp. 85–105, 2002.
[14]
J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machine, World Scientific Publishing, River Edge, NJ, USA, 2002.
[15]
B. J. De Kruif and T. J. A. De Vries, “Pruning error minimization in least squares support vector machines,” IEEE Transactions on Neural Networks, vol. 14, no. 3, pp. 696–702, 2003.
[16]
G. C. Cawley and N. L. C. Talbot, “Improved sparse least-squares support vector machines,” Neurocomputing, vol. 48, pp. 1025–1031, 2002.
[17]
L. Hoegaerts, Eigenspace methods and subset selection in kernel based learning [Ph.D. thesis], Katholieke Universiteit Leuven, Leuven, Belgium, 2005.
[18]
L. Hoegaerts, J. A. K. Suykens, J. Vandewalle, and B. De Moor, “Subset based least squares subspace regression in RKHS,” Neurocomputing, vol. 63, pp. 293–323, 2005.
[19]
Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Transactions on Signal Processing, vol. 52, no. 8, pp. 2275–2285, 2004.
[20]
P. P. Pokharel, W. Liu, and J. C. Principe, “Kernel least mean square algorithm with constrained growth,” Signal Processing, vol. 89, no. 3, pp. 257–265, 2009.
[21]
S. Van Vaerenbergh, J. Vía, and I. Santamaría, “A sliding-window kernel RLS algorithm and its application to nonlinear channel identification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), pp. V789–V792, May 2006.
[22]
P. Honeine, C. Richard, and J. C. Bermudez, “On-line nonlinear sparse approximation of functions,” in Proceedings of the IEEE International Symposium on Information Theory (ISIT '07), pp. 956–960, Nice, France, June 2007.
[23]
C. Richard, J. C. M. Bermudez, and P. Honeine, “Online prediction of time series data with kernels,” IEEE Transactions on Signal Processing, vol. 57, no. 3, pp. 1058–1067, 2009.
[24]
H. J. Bierens, “Introduction to Hilbert Spaces,” Lecture notes, 2007.
[25]
F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, no. 6, pp. 386–408, 1958.
[26]
B. Sch?lkopf, S. Mika, C. J. C. Burges et al., “Input space versus feature space in kernel-based methods,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1000–1017, 1999.
[27]
Y. Li and P. M. Long, “The relaxed online maximum margin algorithm,” Machine Learning, vol. 46, no. 1–3, pp. 361–387, 2002.
[28]
C. Gentile, “A new approximate maximal margin classification algorithm,” Journal of Machine Learning Research, vol. 2, pp. 213–242, 2002.
[29]
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” Journal of Machine Learning Research, vol. 7, pp. 551–585, 2006.