nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv qikanlogo popupnotification paper paperNew
2025, 11, v.56 179-188
基于多层次去噪的水电厂监控视频跨模态语义检索
基金项目(Foundation): 国家自然科学基金(61972169,62302186); 国家电网有限公司管理科技项目(521531230001)
邮箱(Email): shiyx@ccnu.edu.cn;
DOI: 10.13928/j.cnki.wrahe.2025.11.014
摘要:

【目的】为了将跨模态检索机制应用于水电视频监控系统中的人员安防、设施保护、仪器状态监控等场景,通过构建文本图像之间的多模态数据映射,实现基于文本描述的灵活语义内容搜索。【方法】提出多层次去噪的多模态融合技术,以解决现有跨模态方法中单流模型推理速度慢和双流模型缺乏模态融合的问题。该技术基于双流预训练模型,结合掩码语言建模和细粒度跨模态语义对齐的思想,在神经网络的多个层次上设计了“先加噪、再去噪”的任务,以促进图像和文本之间的细粒度交互。【结果】通过大量试验验证,在不同设置下,相比基线模型CLIP微调后的R@1,在Flickr30K数据集上,图像检索和文本检索任务的召回率分别提高了4.1%和2.7%;在MS-COCO数据集上,这两者分别提高了4.3%和3.2%;在自己收集的水电系统监控场景数据上,针对坝区漂浮人员、设备运行状态、仪表仪器异常等工况的检索进行了测试并取得了较好的效果。【结论】通过试验验证了多层次去噪算法在跨模态语义检索任务中的优越性,证明了其在水电厂监控视频场景的适用性。

Abstract:

[Objective]To apply the cross-modal retrieval mechanisms to scenarios such as personnel security, facility protection, and equipment status monitoring in hydropower video surveillance systems, a multi-modal data mapping between texts and images is developed to enable flexible semantic content search through textual descriptions.[Methods]In order to address issues of the slow inference speed of single-stream models and the lack of modal fusion in dual-stream models in existing cross-modal method, a multi-level denoising multimodal fusion technology was proposed. Based on a dual-stream pre-trained model, this technology integrated masked language modeling with fine-grained cross-modal semantic alignment. A “noise addition followed by denoising” task was designed at multiple levels of the neural network to promote fine-grained interactions between texts and images.[Results]Through extensive experiments, it was validated that under different settings, compared with the fine-tuned CLIP baseline model, the R@1 recall rates for image and text retrieval tasks were increased by 4.1% and 2.7%, respectively, on the Flickr30K dataset. On the MS-COCO dataset, the recall rates were increased by 4.3% and 3.2%, respectively. In a self-collected dataset of hydropower system surveillance scenarios, retrieval tests for personnel in dam areas, equipment operating status, and instrument anomalies were conducted, achieving satisfactory result.[Conclusion]Experiments verify the advantages of the multi-level denoising algorithm in cross-modal semantic retrieval tasks and prove its applicability in hydropower plant surveillance video scenarios.

参考文献

[1] 熊自强.自动化监控系统在水电厂中的应用[J].集成电路应用,2024,41(1):182-183.XIONG Z Q.Application of automated monitoring system in hydropower plants[J].Application of IC,2024,41(1):182-183.

[2] 杜梦盈,张召,李谷涵,等.大型水利工程梯级泵站短期优化调度方案[J].排灌机械工程学报,2024,42(2):194-200.DU M Y,ZHANG Z,LI G H,et al.Short-term optimal scheduling scheme of cascade pumping stations in large-scale hydraulic engineering[J].Journal of Drainage and Irrigation Machinery Engineering,2024,42(2):194-200.

[3] 黄荣敏,黄钰铃,曾月,等.长江大保护试点城市某污水处理厂尾水湿地净化效果研究[J].中国水利水电科学研究院学报(中英文),2024,22(2):169-178.HUANG R M,HUANG Y L,ZENG Y,et al.Decontamination effect of tail water wetland accompanying the sewage treatment plant:A case in the area of the Yangtze River Grand Protection[J].Journal of China Institute of Water Resources and Hydropower Research,2024,22(2):169-178.

[4] 徐小蓉,金峰,廖仕信,等.堆石混凝土坝信息化施工管理研究[J].水利水电技术(中英文),2023,54(7):150-160.XU X R,JIN F,LIAO S X,et al.Research of informatization in construction management of rock-filled concrete dam[J].Water Resources and Hydropower Engineering,2023,54(7):150-160.

[5] 任英杰,李传奇,王薇,等.改进YOLOv3的轻量化漂浮物检测算法[J].水利水电技术(中英文),2023,54(10):170-179.REN Y J,LI C Q,WANG W,et al.Lightweight floating object detection algorithm based on improved YOLOv3[J].Water Resources and Hydropower Engineering,2023,54(10):170-179.

[6] 牛子厚,吴鑫淼,秦增乐,等.梯级拦水堰与曲线槽岸组合的河道水流特性与过鱼效果研究[J].中国水利水电科学研究院学报(中英文),2023,21(2):183-193.NIU Z H,WU X M,QIN Z L,et al.Study on river flow characteristics and fish passing effect of combination of cascade weir and curved channel bank[J].Journal of China Institute of Water Resources and Hydropower Research,2023,21(2):183-193.

[7] LI Y H,FAN H Q,HU R H,et al.Scaling language-image pre-training via masking[C]//IEEE.IEEE 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Vancouver,BC:IEEE,2023:23390-23400.

[8] LI G,DUAN N,FANG Y J,et al.Unicoder-VL:A universal encoder for vision and language by cross-modal pre-training[J].Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(7):11336-11344.

[9] 徐文婉,周小平,王佳.跨模态检索技术研究综述[J].计算机工程与应用,2022,58(23):12-23.XU W W,ZHOU X P,WANG J.Overview of cross-modal retrieval technology[J].Computer Engineering and Applications,2022,58(23):12-23.

[10] 武建.人工智能技术在水利行业中的应用实践与展望[J].水利发展研究,2024,24(8):44-49.WU Jian.Application practice and prospect of artificial intelligence in water sector[J].Water Resources Development Research,2024,24(8):44-49.

[11] 张文韬,郭穗,王本红,等.计算机视觉技术在水电厂的应用及前景分析[J].水电站机电技术,2023,46(12):50-53.ZHANG W T,GUO S,WANG B H,et al.Application of computer vision technology in hydropower plants and prospect analysis[J].Mechanical & Electrical Technique of Hydropower Station,2023,46(12):50-53.

[12] 郭贵彬,宋达田.水电厂电力监控系统安全防护建设[J].水电站机电技术,2023,46(10):109-110.GUO G B,SONG D T.Safety protection construction for power monitoring system in hydropower plant[J].Mechanical & Electrical Technique of Hydropower Station,2023,46(10):109-110.

[13] 胡应春,徐正刚,王乐宁,等.水电站设备检修智能知识检索与推荐模型研究应用[J].水力发电,2024,50(2):78-84.HU Y C,XU Z G,WANG Y N,et al.Research and application of intelligent knowledge retrieval and recommendation model for equipment maintenance of hydropower stations[J].Water Power,2024,50(2):78-84.

[14] MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//IEEE.2015 IEEE International Conference on Computer Vision (ICCV).Santiago,Chile:IEEE,2015:1-9.

[15] WU H,MAO J Y,ZHANG Y F,et al.Unified visual-semantic embeddings:Bridging vision and language with structured meaning representations[C]//IEEE.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Long Beach,CA:IEEE,2019:6602-6611.

[16] WANG C,YANG H J,MEINEL C.Deep semantic mapping for cross-modal retrieval[C]//IEEE.2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).Vietri sul Mare:IEEE,2015:234-241.

[17] WANG J,HE Y H,KANG C C,et al.Image-text cross-modal retrieval via modality-specific feature learning[C]//ACM.Proceedings of the 5th ACM on International Conference on Multimedia Retrieval.Shanghai:ACM,2015:347-354.

[18] HE Y H,XIANG S M,KANG C C,et al.Cross-modal retrieval via deep and bidirectional representation learning[J].IEEE Transactions on Multimedia,2016,18(7):1363-1377.

[19] LI Z,LU W,BAO E,et al.Learning a semantic space by deep network for cross-media retrieval[J].IEEE.Proceedings of the 21st International Conference on Distributed Multimedia Systems,Vancouver:IEEE,2015:199-203.

[20] WEI Y C,ZHAO Y,LU C Y,et al.Cross-modal retrieval with CNN visual features:A new baseline[J].IEEE Transactions on Cybernetics,2017,47(2):449-460.

[21] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[M]//IEEE.Computer Vision-ECCV 2014.Cham:Springer International Publishing,2014:740-755.

[22] 刘颖,郭莹莹,房杰,等.深度学习跨模态图文检索研究综述[J].计算机科学与探索,2022,16(3):489-511.LIU Y,GUO Y Y,FANG J,et al.Survey of research on deep learning image-text cross-modal retrieval[J].Journal of Frontiers of Computer Science and Technology,2022,16(3):489-511.

[23] KIM Y.Convolutional neural networks for sentence classification[C]//ACL.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).Doha:ACL,2014:1746-1751.

[24] HE K M,CHEN X L,XIE S N,et al.Masked autoencoders are scalable vision learners[C]//IEEE.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).New Orleans,LA:IEEE,2022:15979-15988.

[25] PLUMMER B A,WANG L W,CERVANTES C M,et al.Flickr30k entities:Collecting Region-to-phrase correspondences for richer image-to-sentence models[J].International Journal of Computer Vision,2017,123(1):74-93.

[26] KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//IEEE.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Boston:IEEE,2015:3128-3137.

[27] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//IEEE.Computer Vision-ECCV 2018.Cham:IEEE,2018:212-228.

[28] ZHANG Q,LEI Z,ZHANG Z X,et al.Context-aware attention network for image-text retrieval[C]//IEEE.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle:IEEE,2020:3533-3542.

[29] CHEN H,DING G G,LIU X D,et al.IMRAM:Iterative matching with recurrent attention memory for cross-modal image-text retrieval[C]//IEEE.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle:IEEE,2020:12652-12660.

[30] CHEN Y C,LI L J,YU L C,et al.UNITER:UNiversal image-TExt representation learning[C]//IEEE.Computer Vision-ECCV 2020.Cham:IEEE,2020:104-120.

[31] XU J R,ZHOU X Y,YAN S,et al.Pixel-aligned language model[C]//IEEE.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2024:13030-13039.

[32] WONJAE K,BOKYUNG S,ILDOO K.ViLT:Vision-and-language transformer without convolution or region supervision[C]//PMLR.Proceedings of the 38th International Conference on Machine Learning.Virtual:PMLR,2021:5583-5594.

[33] JIASEN L,DHRUV B,DEVI P,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//MIT.Advances In Neural Information Processing Systems.Vancouver:MIT,2019:13-23.

[34] SUN S Q,CHEN Y C,LI L J,et al.Lightningdot:pre-training visual-semantic embeddings for real-time image-text retrieval[C]//ACL.Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Seattle:ACL,2021:982-997.

[35] WEN K Y,XIA J,HUANG Y Y,et al.COOKIE:Contrastive cross-modal knowledge sharing pre-training for vision-language representation[C]//IEEE.2021 IEEE/CVF International Conference on Computer Vision (ICCV).Montreal:IEEE,2021:2188-2197.

[36] LU H Y,FEI N Y,HUO Y Q,et al.COTS:Collaborative two-stream vision-language pre-training model for cross-modal retrieval[C]//IEEE.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).New Orleans:IEEE,2022:15671-15680.

[37] LI J N,RAMPRASAATH S,AKHILESH G,et al.Align before fuse:Vision and language representation learning with momentum distillation[C]//MIT.Advances In Neural Information Processing Systems.Virtual:MIT,2021:9694-9705.

基本信息:

DOI:10.13928/j.cnki.wrahe.2025.11.014

中图分类号:TV736

引用信息:

[1]胡晓连,唐佳庆,杨志,等.基于多层次去噪的水电厂监控视频跨模态语义检索[J].水利水电技术(中英文),2025,56(11):179-188.DOI:10.13928/j.cnki.wrahe.2025.11.014.

基金信息:

国家自然科学基金(61972169,62302186); 国家电网有限公司管理科技项目(521531230001)

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文
检 索 高级检索