HMTV: Hierarchical Multimodal Transformer for Video Highlight Query on Baseball
學年 113
學期 1
出版(發表)日期 2024-09-23
作品名稱 HMTV: Hierarchical Multimodal Transformer for Video Highlight Query on Baseball
作品名稱(其他語言)
著者 Qiaoyun Zhang; Chih-Yung Chang; Ming-Yang Su;Hsiang-Chuan Chang ; Diptendu Sinha Roy
單位
出版者
著錄名稱、卷期、頁數 Multimedia Systems 30(285), p. 1-18
摘要 With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users’ questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.
關鍵字 分層多模態 Transformer; BERT; 反白顯示查詢
語言 en
ISSN
期刊性質 國外
收錄於 SCI Scopus
產學合作
通訊作者
審稿制度
國別 USA
公開徵稿
出版型式 ,電子版
相關連結

機構典藏連結 ( http://tkuir.lib.tku.edu.tw:8080/dspace/handle/987654321/126772 )

SDGS 尊嚴就業與經濟發展,產業創新與基礎設施