We present a statistical framework based on Hidden Markov Models (HMMs) for skimming feature films. A chain of HMMs is used to model subsequent story units: HMM states represent different visual-concepts, transitions model the temporal dependencies in each story unit, and stochastic observations are given by single shots. The skim is generated as an observation sequence, where, in order to privilege more informative segments for entering the skim, shots are assigned higher probability of observation if endowed with salient features related to specific film genres. The effectiveness of the method is demonstrated by skimming the first thirty minutes of a wide set of action and dramatic movies, in order to create previews for users useful for assessing whether they would like to see that movie or not, but without revealing the movie central part and plot details. Results are evaluated and compared through extensive user tests in terms of metrics that estimate the content representational value of the obtained video skims and their utility for assessing the user's interest in the observed movie. “I took a speed reading course and read “War and Peace” in 20 minutes. It involves Russia.” Woody Allen. 1. Introduction In the last years, with the proliferation of digital TV broadcasting, dedicated internet websites, and private recording of home video, a large amount of video information has been made available to end-users. Nevertheless, this massive proliferation in the availability of digital video has not been accompanied by a parallel increase in its accessibility. In this scenario, video summarization techniques may represent a key component of a practical video-content management system. By watching a condensed video, a viewer may be able to assess the relevance of a programme before committing time, thus facilitating typical tasks such as browsing, organizing, and searching video-content. For unscripted-content videos such as sports and home-videos, where the events happen spontaneously and not according to a given script, previous work on video summarisation mainly focused on the extraction of highlights. Regarding scripted-content videos—those videos which are produced according to a script, such as feature films (e.g., Hollywood movies), news and cartoons—two types of video abstracts have been investigated so far, namely, video static summarization and video skimming. The first one is a process that selects a set of salient key-frames to represent content in a compact form and present it to the user as a static programme preview. Video skimming
References
[1]
Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model for video summarization,” in Proceedings of the 10th ACM International Multimedia Conference and Exhibition, pp. 533–542, Juan Les Pins, France, December 2002.
[2]
C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang, “Video summarization and scene detection by graph modeling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 2, pp. 296–304, 2005.
[3]
M. A. Smith and T. Kanade, “Video skimming and characterization through the combination of image and language understanding techniques,” in Proceedings of the IEEE International Workshop on Content-Based Access Image Video Data Base, pp. 61–67, January 1998.
[4]
L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[5]
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge, UK, 1999.
[6]
Y. Wang, Z. Liu, and J. C. Huang, “Multimedia content analysis,” IEEE Signal Processing Magazine, vol. 17, no. 6, pp. 12–36, 2000.
[7]
L. Xie, S.-F. Chang, A. Divakaran, and H. Sun, “Structure analysis of soccer video with hidden Markov models,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '02), vol. 4, pp. 4096–4099, Orlando, Fla, USA, May 2002.
[8]
B. T. Truong and S. Venkatesh, “Video abstraction: a systematic review and classification,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 3, no. 1, p. 3, 2007.
[9]
S. Benini, P. Migliorati, and R. Leonardi, “Hierarchical structuring of video previews by leading-cluster-analysis,” Signal, Image and Video Processing, 2010.
[10]
Y. Gao, W.-B. Wang, J.-H. Yong, and H.-J. Gu, “Dynamic video summarization using two-level redundancy detection,” Multimedia Tools and Applications, vol. 42, no. 2, pp. 233–250, 2009.
[11]
N. Omoigui, L. He, A. Gupta, J. Grudin, and Sanocki, “Time-compression: systems concerns, usage, and benefits,” in Proceedings of the ACM Conference on Human Factors in Computing Systems, pp. 136–143, May 1999.
[12]
A. Amir, D. Ponceleon, B. Blanchard, D. Petkovic, S. Srinivasan, and G. Cohen, “Using audio time scale modification for video browsing,” in Proceedings of the 33rd Hawaii International Conference on System Sciences, vol. 3, pp. 3046–3055, January 2000.
[13]
Z. Li, G. M. Schuster, and A. K. Katsaggelos, “Minmax optimal video summarization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 10, pp. 1245–1256, 2005.
[14]
Y.-F. Ma and H.-J. Zhang, “A model of motion attention for video skimming,” in Proceedings of the IEEE International Conference on Image Processing (ICIP '02), vol. 1, pp. 129–132, Rochester, NY, USA, September 2002.
[15]
G. Evangelopoulos, A. Zlatintsi, G. Skoumas, et al., “Video event detection and summarization using audio, visual and text saliency,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '09), pp. 3553–3556, Taipei, Taiwan, April 2009.
[16]
J. Nam and A. T. Tewfik, “Video abstract of video,” in Proceedings of IEEE 3rd Workshop on Multimedia Signal Processing, pp. 117–122, September 1999.
[17]
A. Hanjalic and H. Zhang, “An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 8, pp. 1280–1289, 1999.
[18]
X. Zhu, X. Wu, J. Fan, A. K. Elmagarmid, and W. G. Aref, “Exploring video content structure for hierarchical summarization,” Multimedia Systems, vol. 10, no. 2, pp. 98–115, 2004.
[19]
Y. H. Gong and X. Liu, “Video summarization using singular value decomposition,” in Proceedings of the of International Conference on Computer Vision and Pattern Recognition (CVPR '00), vol. 2, pp. 174–180, 2000.
A. F. Smeaton, P. Over, and W. Kraaij, “Evaluation campaigns and TRECVid,” in Proceedings of the 8th ACM International Multimedia Conference and Exhibition (MIR '06), pp. 321–330, New York, NY, USA, 2006.
[22]
E. Rossi, S. Benini, R. Leonardi, B. Mansencal, and J. Benois-Pineau, “Clustering of scene repeats for essential rushes preview,” in Proceedings of the 10th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '09), pp. 234–237, London, UK, May 2009.
[23]
Q. Huang, Z. Liu, A. Rosenberg, D. Gibbon, and B. Shahraray, “Automated generation of news content hierarchy by integrating audio, video, and text information,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '99), vol. 6, pp. 3025–3028, March 1999.
[24]
W.-T. Peng, Y.-H. Chiang, W.-T. Chu, et al., “Aesthetics-based automatic home video skimming system,” in Advances in Multimedia Modeling, vol. 4903 of Lecture Notes in Computer Science, pp. 186–197, 2008.
[25]
Y. Takahashi, N. Nitta, and N. Babaguchi, “Video summarization for large sports video archives,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '05), pp. 1170–1173, Amsterdam, The Netherlands, July 2005.
[26]
H. Sundaram, L. Xie, and S.-F. Chang, “A utility framework for the automatic generation of audio-visual skims,” in Proceedings of the 10th ACM International Multimedia Conference and Exhibition, pp. 189–198, Juan Les Pins, France, 2002.
[27]
T. Tsoneva, M. Barbieri, and H. Weda, “Automated summarisation of narrative video on a semantic level,” in Proceedings of the IEEE International Conference on Semantic Computing (ICSC '07), Irvine, Calif, USA, September 2007.
[28]
N. Dimitrova , M. Barbieri, and L. Agnihotri, “Movie-in-a-minute,” in Proceedings of the 5th IEEE Pacific-Rim Conference on Multimedia (PCM '04), Tokyo, Japan, December 2004.
[29]
A. Hanjalic, R. L. Lagendijk, and J. Biemond, “Automated high-level movie segmentation for advanced video-retrieval systems,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 4, pp. 580–588, 1999.
[30]
M. M. Yeung and B.-L. Yeo, “Time-constrained clustering for segmentation of video into story units,” in Proceedings of the 13th International Conference on Pattern Recognition (ICPR '96), vol. 3, pp. 375–380, Vienna, Austria, August 1996.
[31]
S. Benini, A. Bianchetti, R. Leonardi, and P. Migliorati, “Video shot clustering and summarization through dendrograms,” in Proceedings of the Image Analysis for Multimedia Interactive Services (WIAMIS '06), pp. 19–21, Incheon, South Korea, April 2006.
[32]
S. Jeannin and A. Divarakan, “MPEG7 visual motion descriptors,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 720–724, 2001.
[33]
P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '01), vol. 1, pp. 511–518, 2001.
[34]
N. Adami and R. Leonardi, “Identification of editing effect in image sequences by statistical modelling,” in Proceedings of the Picture Coding Symposium (PCS '99), pp. 157–160, Portland, Ore, USA, April 1999.