OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

中国图象图形学报 2015

多媒体技术研究:2014――深度学习与媒体计算

DOI: 10.11834/jig.20151101

吴飞,朱文武,于俊清

Keywords: 多媒体,海量数据,检索与标注,语义理解,深度学习

Full-Text Cite this paper Add to My Lib

Abstract:

目的海量数据的快速增长给多媒体计算带来了深刻挑战。与传统以手工构造为核心的媒体计算模式不同,数据驱动下的深度学习(特征学习)方法成为当前媒体计算主流。方法重点分析了深度学习在检索排序与标注、多模态检索与语义理解、视频分析与理解等媒体计算方面的最新进展和所面临的挑战,并对未来的发展趋势进行展望。结果在检索排序与标注方面,基于深度学习的神经编码等方法取得了很好的效果;在多模态检索与语义理解方面,深度学习被用于弥补不同模态间的“异构鸿沟“以及底层特征与高层语义间的”语义鸿沟“,基于深度学习的组合语义学习成为研究热点;在视频分析与理解方面,深度神经网络被用于学习视频的有效表示方式及动作识别,并取得了很好的效果。然而,深度学习是一种数据驱动的方法,易受数据噪声影响,对于在线增量学习方面还不成熟,如何将深度学习与众包计算相结合是一个值得期待的问题。结论该综述在深入分析现有方法的基础上,对深度学习框架下为解决异构鸿沟和语义鸿沟给出新的思路。

References

[1]	Norvig P, Relman D A, Goldstein D B, et al. 2020 Visions[J]. Nature, 2010, 463(7):26-32.
[2]	Bengio Y, Ducharme R, Vincent P, ed al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003,3:1137-1155.
[3]	Mikolov T, Chen K, Corrado G,et al. Efficient estimation of word representations in vector space[C]//Proceedings of International Conference on Learning Representations. Scottsdale, Arizona:ICLR, 2013:1-12.
[4]	Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in neural Information Processing Systems,2013:1-9.
[5]	Xu W, Rudnicky A. Can artificial neural networks learn language models?[J]. Computer Science Department Technical Report, 2000, 1(1):1-4.
[6]	Mikolov T, Deoras A, Povey D, et al. Strategies for training large scale neural network language models[C]//IEEE Workshop on Automatic Speech Recognition and Understanding. Washington DC :IEEE,2011:196-201.
[7]	Socher R, Chen D, Manning C, et al. Reasoning with neural tensor networks for knowledge base completion[C]//Proceedings of NIPS. South Lake Tahoe, Nevada US:NIPS, 2013:926-934.
[8]	Krizhevsky A, Sutskever I, Hinton G.ImageNet classification with deep convolutional neural networks[C]//Proceedings of Snips. South Lake Tahoe, Nevada US:NIPS,2012:1-9.
[9]	Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[EB/OL].[2015-08-23]. http://arxiv.org/abs/1409.4842.
[10]	Farabet C, Couprie C, Najman L, et al. Learning hierarchical features for scene labeling[J]. IEEE Transactions on Analysis and Machine Intelligence, 2013, 35(8):1915-1929.
[11]	Tompson J , Jain A, LeCun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[C]//Advances in Neural Information Processing Systems. South Lake Tahoe, Nevada US:NIPS, 2014:1799-1807.
[12]	Yu D, Deng L. Automatic Speech Recognition―A Deep Learning Approach[M]. Berlin:Springer,2014.
[13]	Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J].Signal Processing Magazine,IEEE, 2012, 29(6):82-97.
[14]	Sainath T N, Mohamed A, Kingsbury B, et al. Deep convolutional neural networks for LVCSR[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington DC :IEEE, 2013:8614-8618.
[15]	Hochreiter S, Schmidhuber J. Long Short-term memory[J]. Neural Computation, 1997,9(8):1735-1780.
[16]	Vinyals O, Toshev A, Bengio S,et al. Show and tell:a neural image caption generator[EB/OL].[2015-10-10].http://arxiv.org/abs/1411.4555.
[17]	Lecun Y, Bengio Y, Hinton G. Deep learning[J].Nature,2015:436-444.[DOI:10.1038/Nature14539]
[18]	Gong Y, Wang L, Guo R, et al. Multi-scale orderless pooling of deep convolutional activation features[J].Lecture Notes in Computer Science,2014,8695:392-407.
[19]	Wang J, Song Y, Leung T, et al. Learning fine-grained image similarity with deep ranking[C]//Proceeding of IEEE Conference on Computer Vision and Pattern Recognition. Washington DC:IEEE, 2014:1386-1393.
[20]	Babenko A, Slesarev A, Chigorin A, et al. Neural codes for image retrieval[J].Lecture Notes in Computer Science ,2014,8695:584-599.
[21]	更多...
[22]	Ng J Y H, Yang F, Davis L S. Exploiting local features from deep networks for image retrieval[EB/OL].[2015-08-23]. http://arxiv.org/abs/1504.05133.
[23]	Karpathy A, Joulin A, Li F. Deep fragment embeddings for bidirectional image sentence mapping[C]//Advances in Neural Information Processing Systems.Massachusetts :MIT Press 2014:1889-1897.
[24]	Kwiatkowksi T, Zettlemoyer L, Goldwater S, et al. Inducing probabilistic CCG grammars from logical form with higher-order unification[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Assoc. Comput. Linguist,2010:1223-1233.
[25]	Zettlemoyer LS, Collins M. 2005. Learning to map sentences to logical form:structured classification with probabilistic categorial grammars[C]//Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence. Arlington, VA:Assoc. Uncertain. Artif. Intell, 2014:658-666.
[26]	Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions[EB/OL].[2015-08-23]. http://arxiv.org/abs/1412.2306.
[27]	Socher R, Karpathy A, Le Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J].Transactions of the Association for Computational Linguistics, 2014, 2:207-218.
[28]	Socher R, Manning C D, Ng A Y. Learning continuous phrase representations and syntactic parsing with recursive neural networks[C]//Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop. South lake Taboe, Nevada US:MIPS 2010:1-9.
[29]	Jacob D, Saurabh G, Ross G, et al, Exploring nearest neighbor approaches for image captioning[EB/OL].[2015-08-23]. http://arxiv.org/abs/:1505.04467.
[30]	Chang S, Han W, Tang J, et al. Heterogeneous network embedding via deep architectures[EB/OL].[2015-8-23].http://www.ifp.illinols.edu/~chang87/papers/kdd.2015.9df
[31]	Xu Z W, Yang Y, Hauptmann A G. A discriminative CNN video representation for event detection[EB/OL].[2015-08-23].http://arxiv.org/abs/1411.4555. arXiv preprint arXiv:1411.4006.
[32]	Wang H, Kl?ser A, Schmid C, et al. Action recognition by dense trajectories[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Washingtm DC:IEEE, 2011:3169-3176.
[33]	Wang H, Schmid C. Action recognition with improved trajectories[J]. IEEE International Conference on Computer Vision Washingtm DC:IEEE, 2013:3551-3558.
[34]	Ji S, Xu W, Yang M. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231.
[35]	Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Wachingtm DC:IEEE, 2014:1725-1732.
[36]	Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolution-al descriptors[EB/DC].[2015-8-3].http://arXiv preprint arXiv:1505.04868.
[37]	Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems. South Lake Tahoe, Nevada US:NIPS, 2014:568-576.
[38]	Wu Z , Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classificatio[EB/OL].[2015-08-23]. http://arxiv.org/abs/1504.01561.
[39]	Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector:Theory and practice[J]. International Journal of Computer Vision, 2013, 105(3), 222-245.
[40]	Jégou H, Perronnin F, Douze M, et al. Aggregating local image descriptors into compact codes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(9):1704-1716.
[41]	Peng X, Zou C, Qiao Y, et al. Action recognition with stacked fisher vectors[C]//Computer Vision-ECCV. Berlin:Springer International Publishing, 2014:581-595.
[42]	Thumos U C F. The first international workshop on action recogntion with a large number of classes[EB/OL].[2015-08-23].http://crcv. ucf.edu/ICCV13-Action-Workshop.
[43]	Xu Z, Zhu L, Yang Y, et al. UTS-CMV at THUmos 15[EB/OL].[2015-08-23]. http://www.cs.cmu.edu/zhong wen/pdf/THUMOS15.pdf.
[44]	Qiu Z, Li Q, Yao T, et al. MSR Asia MSM at THUMOS Challenge 2015[EB/OL].[2015-08-23].http://storage. googleapis.com/www.thumos.info/thumos15_notebooks/TH15_MSRAsia.pdf.
[45]	Ning K,Wu F. ZJUDCD submission at THUMOS Challenge 2015[EB/OL].[2015-08-23]. http://storage.googleapis.com/www.thumos.info/thumos15_notebooks/TH15_Zhejiang.pdf.
[46]	Douze M, Oneata D, Paulin M, et al. The INRIA-LIM-VocR and AXES submissions to Trecvid 2014 multimedia event detection[EB/OL].[2015-08-23].https://hal.inria.fr/hal-01089916/document
[47]	Yu S I, Jiang L , Mao Z, et al. Informedia@ TRECVID 2014 MED and MER. In NIST TRECVID video retrieval evaluation workshop.[EB/OL].[2015-08-23].http://www.cs.cmu.edu/lujiang/camera_ready_papers/informedia@trecvid2014_med.pdf.
[48]	Wan J, Wang D, Hoi S C H, et al, Deep learning for content-based image retrieval:a comprehensive study[C]//Proceedings of the ACM International Conference on Multimedia. New York:ACM, 2014:157-166.
[49]	Xie L, Wang J, Zhang B. Fine-Grained image search[J]. IEEE Transactions on Multimedia, 2015,17(5):636-647.
[50]	Mao J, Xu W, Yang Y, et al, Learning like a Child:fast novel visual concept learning from sentence descriptions of images[EB/OL].[2015-08-23]. http://arxiv.org/abs/1504.06692.
[51]	Ma L, Lu Z, Shang L, et al. Multimodal convolutional neural networks for matching image and sentence[EB/OL].[2015-08-23]. http://arxiv.org/abs/1504.06063.
[52]	Wang D, Cui P, Ou M, et al. Deep Multimodal hashing with orthogonal regularization ..http://media.cs.tsinghua.edu.cn/multimedia/cuipeng/papers/DeepHash-IJCAI15.pdf.
[53]	Wu R, Yan S, Shan Y, et al. Deep image:Scaling up image recognition[EB/OL].[2015-08-23]. http://arxiv.org/abs/1501.02876, 2015.
[54]	Mnih V, Kavukcuoglu K, Silver D, et al., Human-level control through deep reinforcement learning[J], Nature ,2015, 518:529-533

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133