While video content is often stored in rather large files or broadcasted in continuous streams, users are often interested in retrieving only a particular passage on a topic of interest to them. It is, therefore, necessary to split video documents or streams into shorter segments corresponding to appropriate retrieval units. We propose here a method for the automatic segmentation of TV news videos into stories. A-multiple-descriptor based segmentation approach is proposed. The selected multimodal features are complementary and give good insights about story boundaries. Once extracted, these features are expanded with a local temporal context and combined by an early fusion process. The story boundaries are then predicted using machine learning techniques. We investigate the system by experiments conducted using TRECVID 2003 data and protocol of the story boundary detection task, and we show that the proposed approach outperforms the state-of-the-art methods while requiring a very small amount of manual annotation. 1. Introduction Progress in storage and communication technologies has made huge amounts of video contents accessible to users. However, finding a video content corresponding to a particular user's need is not always easy for a variety of reasons, including poor or incomplete content indexing. Also, while video content is often stored in rather large files or broadcasted in continuous streams, users are often interested in retrieving only a particular passage on a topic of interest to them. It is therefore necessary to split video documents or streams into shorter segments corresponding to appropriate retrieval units, for instance, a particular scene in a movie or a particular news in a TV journal. These retrieval units can be defined hierarchically on order to potentially satisfy user needs at different levels of granularity. The retrieval units are not only relevant as search result units but also as units for content-based indexing and for further increasing the content-based video retrieval (CVBR) systems effectiveness. A video can be analyzed at different levels of granularity. For the image track, the lower level is the individual frame that is generally used for extracting static visual features like color, texture, shape, or interest points. Videos can also be decomposed into shots; a shot is a basic video unit showing a sequence of frames captured by a single camera in a single continuous action in time and space. The shot, however, is not a good retrieval unit as it usually lasts only a few seconds. Higher-level techniques are
References
[1]
A. F. Smeaton, P. Over, and W. Kraaij, “TRECVID—an overview,” in Proceedings of TRECVID, 2003.
[2]
T. S. Chua, S. F. Chang, L. Chaisorn, and W. Hsu, “Story boundary detection in large broadcast news video archives - Techniques, experience and trends,” in Proceedings of the 12th ACM International Conference on Multimedia, pp. 656–659, October 2004.
[3]
P. Joly, J. Benois-Pineau, E. Kijak, and G. Quénot, “The ARGOS campaign: evaluation of video analysis and indexing tools,” Signal Processing, vol. 22, no. 7-8, pp. 705–717, 2007.
[4]
A. E. Abduraman, S. A. Berrani, and B. Mérialdo, “TV program structuring techniques: a review,” in TV Content Analysis: Techniques and Applications, 2011.
[5]
J. M. Gauch, S. Gauch, S. Bouix, and X. Zhu, “Real time video scene detection and classification,” Information Processing and Management, vol. 35, no. 3, pp. 381–400, 1999.
[6]
L. Chaisorn and T. S. Chua, “Story boundary detection in news video using global rule induction technique,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '06), pp. 2101–2104, July 2006.
[7]
L. Chaisorn, T. S. Chua, and C. H. Lee, “A multi-modal approach to story segmentation for news video,” World Wide Web, vol. 6, no. 2, pp. 187–208, 2003.
[8]
H. Misra, F. Hopfgartner, A. Goyal, et al., “Tv news story segmentation based on semantic coherence and content similarity,” in Proceedings of the 16th international conference on Advances in Multimedia Modeling, pp. 347–357, 2010.
[9]
A. Goyal, P. Punitha, F. Hopfgartner, and J. M. Jose, “Split and merge based story segmentation in news videos,” in Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pp. 766–770, 2009.
[10]
C. Ma, B. Byun, I. Kim, and C. H. Lee, “A detection-based approach to broadcast news video story segmentation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), pp. 1957–1960, April 2009.
[11]
E. Dumont and B. Mérialdo, “Split-screen dynamically accelerated video summaries,” in Proceedings of the 1st TRECVID Video Summarization Workshop (TVS '07), pp. 55–59, September 2007.
[12]
A. G. Hauptmann, M. G. Christel, W. H. Lin et al., “Clever clustering vs. simple speed-up for summarizing BBC rushes,” in Proceedings of the 1st TRECVID Video Summarization Workshop (TVS '07), pp. 20–24, September 2007.
[13]
E. Dumont and B. Mérialdo, “Automatic evaluation method for rushes summary content,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '09), pp. 666–669, July 2009.
[14]
C. Snoek, M. Worring, and A. W. M. Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 399–402, 2005.
[15]
M. A. Hearst, “Multi-paragraph segmentation of expository text,” in Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics (ACL '94), pp. 9–16, 1994.
[16]
G. Quénot, D. Moraru, and L. Besacier, “CLIPS at TRECvid: shot boundary detection and feature detection,” in Proceedings of TRECVID, 2003.
[17]
H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23–38, 1998.
[18]
J. Poignant, L. Besacier, G. Quénot, and F. Thollard, “From text detection in videos to person identification,” in Proceedings of the IEEE International Conference on Multimedia and Expo, 2012.
[19]
J. L. Gauvain, L. Lamel, and G. Adda, “The LIMSI broadcast news transcription system,” Speech Communication, vol. 37, no. 1-2, pp. 89–108, 2002.
[20]
V. B. Le, O. Mella, and D. Fohr, “Speaker diarization using normalized cross likelihood ratio,” in Proceedings of the 8th Annual Conference of the International Speech Communication Association (Interspeech '07), pp. 873–876, August 2007.
[21]
M. A. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” SIGKDD Explorations Newsletter, vol. 11, pp. 10–18, 2009.