A semantic-based video scene segmentation using a deep neural network

Hyesung Ji, Danial Hooshyar, Kuekyeng Kim, Heui Seok Lim

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Video scene segmentation is very important research in the field of computer vision, because it helps in efficient storage, indexing and retrieval of videos. Achieving this kind of scene segmentation cannot be done by just calculating the similarity of low-level features presented in the video; high-level features should also be considered to achieve a better performance. Even though much research has been conducted on video scene segmentation, most of these studies failed to semantically segment a video into scenes. Thus, in this study, we propose a Deep-learning Semantic-based Scene-segmentation model (called DeepSSS) that considers image captioning to segment a video into scenes semantically. First, the DeepSSS performs shot boundary detection by comparing colour histograms and then employs maximum-entropy-applied keyframe extraction. Second, for semantic analysis, using image captioning that benefits from deep learning generates a semantic text description of the keyframes. Finally, by comparing and analysing the generated texts, it assembles the keyframes into a scene grouped under a semantic narrative. That said, DeepSSS considers both low- and high-level features of videos to achieve a more meaningful scene segmentation. By applying DeepSSS to data sets from MS COCO for caption generation and evaluating its semantic scene-segmentation task results with the data sets from TRECVid 2016, we demonstrate quantitatively that DeepSSS outperforms other existing scene-segmentation methods using shot boundary detection and keyframes. What’s more, the experiments were done by comparing scenes segmented by humans and scene segmented by the DeepSSS. The results verified that the DeepSSS’ segmentation resembled that of humans. This is a new kind of result that was enabled by semantic analysis, which was impossible by just using low-level features of videos.

Original languageEnglish
JournalJournal of Information Science
DOIs
Publication statusAccepted/In press - 2018 Jan 1

Fingerprint

neural network
video
Semantics
semantics
Image analysis
Computer vision
segmentation
Deep neural networks
Entropy
entropy
indexing
learning
Color
narrative
experiment
Experiments
performance
Deep learning

Keywords

  • Deep learning
  • image captioning
  • keyframe extraction
  • shot boundary detection
  • video scene segmentation

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

A semantic-based video scene segmentation using a deep neural network. / Ji, Hyesung; Hooshyar, Danial; Kim, Kuekyeng; Lim, Heui Seok.

In: Journal of Information Science, 01.01.2018.

Research output: Contribution to journalArticle

@article{9037b5d99322401e9a48f757c47e947a,
title = "A semantic-based video scene segmentation using a deep neural network",
abstract = "Video scene segmentation is very important research in the field of computer vision, because it helps in efficient storage, indexing and retrieval of videos. Achieving this kind of scene segmentation cannot be done by just calculating the similarity of low-level features presented in the video; high-level features should also be considered to achieve a better performance. Even though much research has been conducted on video scene segmentation, most of these studies failed to semantically segment a video into scenes. Thus, in this study, we propose a Deep-learning Semantic-based Scene-segmentation model (called DeepSSS) that considers image captioning to segment a video into scenes semantically. First, the DeepSSS performs shot boundary detection by comparing colour histograms and then employs maximum-entropy-applied keyframe extraction. Second, for semantic analysis, using image captioning that benefits from deep learning generates a semantic text description of the keyframes. Finally, by comparing and analysing the generated texts, it assembles the keyframes into a scene grouped under a semantic narrative. That said, DeepSSS considers both low- and high-level features of videos to achieve a more meaningful scene segmentation. By applying DeepSSS to data sets from MS COCO for caption generation and evaluating its semantic scene-segmentation task results with the data sets from TRECVid 2016, we demonstrate quantitatively that DeepSSS outperforms other existing scene-segmentation methods using shot boundary detection and keyframes. What’s more, the experiments were done by comparing scenes segmented by humans and scene segmented by the DeepSSS. The results verified that the DeepSSS’ segmentation resembled that of humans. This is a new kind of result that was enabled by semantic analysis, which was impossible by just using low-level features of videos.",
keywords = "Deep learning, image captioning, keyframe extraction, shot boundary detection, video scene segmentation",
author = "Hyesung Ji and Danial Hooshyar and Kuekyeng Kim and Lim, {Heui Seok}",
year = "2018",
month = "1",
day = "1",
doi = "10.1177/0165551518819964",
language = "English",
journal = "Journal of Information Science",
issn = "0165-5515",
publisher = "SAGE Publications Ltd",

}

TY - JOUR

T1 - A semantic-based video scene segmentation using a deep neural network

AU - Ji, Hyesung

AU - Hooshyar, Danial

AU - Kim, Kuekyeng

AU - Lim, Heui Seok

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Video scene segmentation is very important research in the field of computer vision, because it helps in efficient storage, indexing and retrieval of videos. Achieving this kind of scene segmentation cannot be done by just calculating the similarity of low-level features presented in the video; high-level features should also be considered to achieve a better performance. Even though much research has been conducted on video scene segmentation, most of these studies failed to semantically segment a video into scenes. Thus, in this study, we propose a Deep-learning Semantic-based Scene-segmentation model (called DeepSSS) that considers image captioning to segment a video into scenes semantically. First, the DeepSSS performs shot boundary detection by comparing colour histograms and then employs maximum-entropy-applied keyframe extraction. Second, for semantic analysis, using image captioning that benefits from deep learning generates a semantic text description of the keyframes. Finally, by comparing and analysing the generated texts, it assembles the keyframes into a scene grouped under a semantic narrative. That said, DeepSSS considers both low- and high-level features of videos to achieve a more meaningful scene segmentation. By applying DeepSSS to data sets from MS COCO for caption generation and evaluating its semantic scene-segmentation task results with the data sets from TRECVid 2016, we demonstrate quantitatively that DeepSSS outperforms other existing scene-segmentation methods using shot boundary detection and keyframes. What’s more, the experiments were done by comparing scenes segmented by humans and scene segmented by the DeepSSS. The results verified that the DeepSSS’ segmentation resembled that of humans. This is a new kind of result that was enabled by semantic analysis, which was impossible by just using low-level features of videos.

AB - Video scene segmentation is very important research in the field of computer vision, because it helps in efficient storage, indexing and retrieval of videos. Achieving this kind of scene segmentation cannot be done by just calculating the similarity of low-level features presented in the video; high-level features should also be considered to achieve a better performance. Even though much research has been conducted on video scene segmentation, most of these studies failed to semantically segment a video into scenes. Thus, in this study, we propose a Deep-learning Semantic-based Scene-segmentation model (called DeepSSS) that considers image captioning to segment a video into scenes semantically. First, the DeepSSS performs shot boundary detection by comparing colour histograms and then employs maximum-entropy-applied keyframe extraction. Second, for semantic analysis, using image captioning that benefits from deep learning generates a semantic text description of the keyframes. Finally, by comparing and analysing the generated texts, it assembles the keyframes into a scene grouped under a semantic narrative. That said, DeepSSS considers both low- and high-level features of videos to achieve a more meaningful scene segmentation. By applying DeepSSS to data sets from MS COCO for caption generation and evaluating its semantic scene-segmentation task results with the data sets from TRECVid 2016, we demonstrate quantitatively that DeepSSS outperforms other existing scene-segmentation methods using shot boundary detection and keyframes. What’s more, the experiments were done by comparing scenes segmented by humans and scene segmented by the DeepSSS. The results verified that the DeepSSS’ segmentation resembled that of humans. This is a new kind of result that was enabled by semantic analysis, which was impossible by just using low-level features of videos.

KW - Deep learning

KW - image captioning

KW - keyframe extraction

KW - shot boundary detection

KW - video scene segmentation

UR - http://www.scopus.com/inward/record.url?scp=85059652131&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059652131&partnerID=8YFLogxK

U2 - 10.1177/0165551518819964

DO - 10.1177/0165551518819964

M3 - Article

AN - SCOPUS:85059652131

JO - Journal of Information Science

JF - Journal of Information Science

SN - 0165-5515

ER -