Recently, several deep learning models have been successfully proposed and have been applied to solve different Natural Language Processing (NLP) tasks. However, these models solve the problem based on single-task supervised learning and do not consider the correlation between the tasks. Based on this observation, in this paper, we implemented a multi-task learning model to joint learn two related NLP tasks simultaneously and conducted experiments to evaluate if learning these tasks jointly can improve the system performance compared with learning them individually. In addition, a comparison of our model with the state-of-the-art learning models, including multi-task learning, transfer learning, unsupervised learning and feature based traditional machine learning models is presented. This paper aims to 1) show the advantage of multi-task learning over single-task learning in training related NLP tasks, 2) illustrate the influence of various encoding structures to the proposed single- and multi-task learning models, and 3) compare the performance between multi-task learning and other learning models in literature on textual entailment task and semantic relatedness task.
References
[1]
Ruder, S. (2017) An Overview of Multi-Task Learning in Deep Neural Networks. arXiv: 1706.05098
[2]
Zhang, Y. and Yang, Q. (2017) A Survey on Multi-Task Learning. arXiv: 1707.08114
[3]
Caruana, R. (1993) Multitask Learning: A Knowledge-Based Source of Inductive Bias. Proceedings of the Tenth International Conference on Machine Learning, Amherst, 27-29 June 1993, 41-48.
https://doi.org/10.1016/B978-1-55860-307-3.50012-5
[4]
Hashimoto, K., Xiong, C., Tsuruoka, Y. and Socher, R. (2017) A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7-11 September 2017, 1923-1933. https://doi.org/10.18653/v1/D17-1206
[5]
Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986) Learning Representations by Back-Propagating Errors. Nature, 323, 533-536.
https://doi.org/10.1038/323533a0
[6]
Hochreiter, S., Bengio, Y., Frasconi, P. and Schmidhuber, J. (2001) Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.
[7]
Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
[8]
Schuster, M. and Paliwal, K.K. (1997) Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 45, 2673-2681.
https://doi.org/10.1109/78.650093
[9]
Lin, Z., Feng, M., dos Santos, C.N., Yu, M., Xiang, B., Zhou, B. and Bengio, Y. (2017) A Structured Self-Attentive Sentence Embedding. arXiv: 1703.03130
[10]
Lawrence, S., Giles, C.L., Tsoi, A.C. and Back, A.D. (1997) Face Recognition: A Convolutional Neural-Network Approach. IEEE Transactions on Neural Networks, 8, 98-113. https://doi.org/10.1109/72.554195
[11]
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) Imagenet Classification with Deep Convolutional Neural Networks. NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, 3-6 December 2012, 1097-1105.
[12]
Zhao, H., Lu, Z. and Poupart, P. (2015) Self-Adaptive Hierarchical Sentence Model. Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25-31 July 2015, 4069-4076.
[13]
Caruana, R. (1997) Multitask Learning. Machine Learning, 28, 41-75.
https://doi.org/10.1023/A:1007379606734
[14]
Kim, S., Hori, T. and Watanabe, S. (2017) Joint CTC-Attention Based End-to-End Speech Recognition Using Multi-Task Learning. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5-9 March 2017, 4835-4839. https://doi.org/10.1109/ICASSP.2017.7953075
[15]
Wu, Z., Valentini-Botinhao, C., Watts, O. and King, S. (2015) Deep Neural Networks Employing Multi-Task Learning and Stacked Bottleneck Features for Speech Synthesis. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19-24 April 2015, 4460-4464.
https://doi.org/10.1109/ICASSP.2015.7178814
[16]
Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L.-L., et al. (2017) English Conversational Telephone Speech Recognition by Humans and Machines. arXiv: 1703.02136
https://doi.org/10.21437/Interspeech.2017-405
[17]
Long, M. and Wang, J. (2015) Learning Multiple Tasks with Deep Relationship Networks. arXiv: 1506.021172
[18]
Kendall, A., Gal, Y. and Cipolla, R. (2018) Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-23 June 2018, 7482-7491.
[19]
Collobert, R. and Weston J. (2008) A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5-9 July 2008, 160-167. https://doi.org/10.1145/1390156.1390177
[20]
Søgaard, A. and Goldberg, Y. (2016) Deep Multi-Task Learning with Low Level Tasks Supervised at Lower Layers. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7-12 August 2016, 231-235. https://doi.org/10.18653/v1/P16-2038
[21]
Shao, Y. (2017) HCTI at SemEval-2017 Task 1: Use Convolutional Neural Network to Evaluate Semantic Textual Similarity. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, 3-4 August 2017, 130-133. https://doi.org/10.18653/v1/S17-2016
[22]
Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S. and Zamparelli, R. (2014) Semeval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, 23-24 August 2014, 1-8. https://doi.org/10.3115/v1/S14-2001
[23]
Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization. arXiv: 1412.6980
[24]
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L. and Lerer, A. (2017) Automatic Differentiation in Pytorch. 31st Conference on Neural Information Processing Systems (NIPS 2017), 4-9 December 2017, Long Beach, CA, USA, 1-4.
[25]
Tai, K.S., Socher, R. and Manning, C.D. (2015) Improved Semantic Representations from Tree-Structured Long Short Term Memory Networks. arXiv: 1503.00075
https://doi.org/10.3115/v1/P15-1150
[26]
Lai, A. and Hockenmaier, J. (2014) Illinois-LH: A Denotational and Distributional Approach to Semantics. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, 23-24 August 2014, 329-334.
https://doi.org/10.3115/v1/S14-2055
[27]
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017) Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://doi.org/10.1162/tacl_a_00051
[28]
Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R., Urtasun, R., Torralba, A. and Fidler, S. (2015) Skip-Thought Vectors. Advances in neural information processing systems 28 (NIPS 2015), Montreal, Canada, 7-12 December 2015, 3294-3302.
[29]
Conneau, A., Kiela, D., Schwenk, H., Barrault, L. and Bordes, A. (2017) Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. arXiv: 1705.02364.