供稿: 赵珊, 田楷文, 孙君顶 | 时间: 2024-09-24 | 次数: |
赵珊, 田楷文, 孙君顶,等.用于实时语义分割的丰富语义提取器网络[J].河南理工大学学报(自然科学版),2024,43(6):146-155.
ZHAO S, TIAN K W, SUN J D,et al.Rich semantic extractor network for real-time semantic segmentation[J].Journal of Henan Polytechnic University(Natural Science) ,2024,43(6):146-155.
用于实时语义分割的丰富语义提取器网络
赵珊1, 田楷文1, 孙君顶2
1.河南理工大学 软件学院,河南 焦作 454000;2.河南理工大学 计算机科学与技术学院,河南 焦作 454000
摘要: 目的 由于推理速度限制,网络深度较浅,实时语义分割网络提取的语义特征信息不足。此外,较浅的网络深度也限制了特征提取网络的能力,降低了其鲁棒性和适应能力。为此, 方法 提出一种用于实时语义分割的丰富语义提取器网络。首先针对语义特征信息提取不足的问题,引入丰富语义提取器,丰富语义提取器包括多尺度全局语义提取模块和语义融合模块。其次,利用多尺度全局语义提取模块可以提取丰富的多尺度全局语义,扩大网络的有效感受野,同时语义融合模块将多尺度局部语义与多尺度全局语义高效融合,使网络拥有更全面更丰富的语义信息。最后针对细节分支和语义分支的特点设计空间重构聚合模块,建模细节特征的上下文信息,增强特征表示,使2个分支高效聚合。 结果 在Cityscapes和ADE20K数据集上进行全面实验,所提出的RSENet分别以76 帧/s和67帧/s的推理速度达到了75.6%和35.7%的MIoU。 结论 实验结果表明,在复杂场景语义信息的提取方面,本文所提出的网络能够深入挖掘并准确捕捉图像中语义信息。同时,在精度与速度的平衡方面也展现出了卓越的性能,不仅能够实现高精度的语义分割,而且推理速度非常快。这种高效的图像分割能力使得网络在实际应用场景中具有极高的实用性和可操作性。
关键词:语义分割;多尺度特征;视觉Transformer;特征融合
doi:10.16186/j.cnki.1673-9787.2023030005
基金项目:国家自然科学基金资助项目(62276092)
收稿日期:2023/03/02
修回日期:2023/05/14
出版日期:2024-09-24
Rich semantic extractor network for real-time semantic segmentation
ZHAO Shan1, TIAN Kaiwen1, SUN Junding2
1.School of Software,Henan Polytechnic University,Jiaozuo 454000,Henan,China;2.School of Computer Science and Technology,Henan Polytechnic University,Jiaozuo 454000,Henan,China
Abstract: Objectives The inference speed of the real-time semantic segmentation network is limited,the depth of the network is shallow,which lead to insufficient semantic feature information extracted.Additionally,the shallow network depth restricts the capability of feature extraction networks,reducing their robustness and adaptability.In order to solve such the problems, Methods a rich semantic extractor network(RSENet) for real-time semantic segmentation was proposed.Firstly,aiming at the problem of inadequate semantic feature information extraction,a rich semantic extractor(RSE) was introduced,which included a multi-scale global semantic extraction module(MGSEM) and a semantic fusion module(SFM).MGSEM was used to extract rich multi-scale global semantics and expand the effective receptive field of the network.At the same time,SFM efficiently fused multi-scale local semantics and multi-scale global semantics,so that the network had more comprehensive and rich semantic information.Finally,according to the characteristics of the detailed branch and the semantic branch,a space reconstruction aggregation module(SRAM) was designed to model the context information of the detailed features and enhanced the feature representation,so that the two branches could be efficiently aggregated. Results Comprehensive experiments were conducted on Cityscapes and ADE20K datasets,and the proposed RSENet achieved mIoU of 75.6% and 35.7% at inference speed of 76 frames/s and 67 frames/s,respectively. Conclusions The experimental results suggested that in the extraction of semantic information within complex scenes,the network proposed in this paper was able to deeply explore and accurately capture such semantic information in images.Furthermore,outstanding performance was demonstrated in achieving a balance between accuracy and speed,with the network not only capable of achieving high-precision semantic segmentation but also exhibiting very fast inference speeds.This efficient image segmentation capability endowed the network with high practicality and operability in real-world application scenarios.
Key words:semantic segmentation;multi-scale feature;vision Transformer;feature fusion