noofvss

End-to-End Video Semantic Segmentation in Adverse Weather using Fusion Blocks and Temporal-Spatial Teacher-Student Learning
NeurIPS 2024

Xin Yang
NUS
WenDing Yan
Huawei International Pte Ltd
Michael Bi Mi
Huawei International Pte Ltd
Yuan Yuan
Huawei International Pte Ltd
Robby T. Tan
NUS

Abstract

Adverse weather conditions can significantly degrade video frames, leading to erroneous predictions by current video semantic segmentation methods. Furthermore, these methods rely on accurate optical flows, which become unreliable under adverse weather. To address this issue, we introduce the novelty of our approach: the first end-to-end, optical-flow-free, domain-adaptive video semantic segmentation method. This is accomplished by enforcing the model to actively exploit the temporal information from adjacent frames through a fusion block and temporal-spatial teachers. The key idea of our fusion block is to offer the model a way to merge information from consecutive frames by matching and merging relevant pixels from those frames. The basic idea of our temporal-spatial teachers involves two teachers: one dedicated to exploring temporal information from adjacent frames, the other harnesses spatial information from the current frame and assists the temporal teacher. Finally, we apply temporal weather degradation augmentation to consecutive frames to more accurately represent adverse weather degradations. Our method achieves a performance of 25.4% and 33.0% mIoU on the adaptation from VIPER and Synthia to MVSS, respectively, representing an improvement of 4.3% and 5.8% mIoU over the existing state-of-the-art method.

Framework

overview

Our network comprises two pipelines: the source and the target. (a) Target Pipeline: The upper teacher (temporal) takes both the current and adjacent frames to create temporal pseudo-labels. The student, on the other hand, receives a cropped segment of the current frame and a complete adjacent frame, with a loss function enforcing its predictions align with the temporal teacher. The lower teacher (spatial) uses the same segment as the student, but from the original image and at a higher resolution. Similarly, a consistency loss is applied to make the student’s predictions consistent with the spatial teacher’s pseudo-labels. (b) Source Pipeline: The student model undergoes supervised learning with consecutive frames as inputs. (c) Fusion Block: This component integrates multiple offset layers, which adjust pixels from adjacent frames relative to the current frame, and convolutional layers to merge these pixels.

Results

The model's performance with different domain adaptation datasets. Viper to MVSS and Synthia to MVSS.

Citation

@article{yang2025end,
  title={End-to-End Video Semantic Segmentation in Adverse Weather using Fusion Blocks and Temporal-Spatial Teacher-Student Learning},
  author={Yang, Xin and Wending, Yan and Bi Mi, Michael and Yuan, Yuan and Tan, Robby},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={141000--141020},
  year={2025}
}

End-to-End Video Semantic Segmentation in Adverse Weather using Fusion Blocks and Temporal-Spatial Teacher-Student Learning
NeurIPS 2024

Paper

Code

Abstract

Framework

Results

Citation

End-to-End Video Semantic Segmentation in Adverse Weather using Fusion Blocks and Temporal-Spatial Teacher-Student Learning NeurIPS 2024

Paper

Code

Abstract

Framework

Results

Citation

End-to-End Video Semantic Segmentation in Adverse Weather using Fusion Blocks and Temporal-Spatial Teacher-Student Learning
NeurIPS 2024