Semantic Segmentation on Raindrop Degraded Images Using Two-Stage Dual Teacher-Student Learning AAAI 2025
- Xin Yang NUS
- WenDing Yan Huawei International Pte Ltd
- Michael Bi Mi Huawei International Pte Ltd
- Yuan Yuan Huawei International Pte Ltd
- Robby T. Tan NUS
A semantic segmentation method designed for complex raindrop degradation under heavy rain condition.
Abstract
Adverse weather conditions can significantly degrade video frames, leading to erroneous predictions by current video semantic segmentation methods. Furthermore, these methods rely on accurate optical flows, which become unreliable under adverse weather. To address this issue, we introduce the novelty of our approach: the first end-to-end, optical-flow-free, domain-adaptive video semantic segmentation method. This is accomplished by enforcing the model to actively exploit the temporal information from adjacent frames through a fusion block and temporal-spatial teachers. The key idea of our fusion block is to offer the model a way to merge information from consecutive frames by matching and merging relevant pixels from those frames. The basic idea of our temporal-spatial teachers involves two teachers: one dedicated to exploring temporal information from adjacent frames, the other harnesses spatial information from the current frame and assists the temporal teacher. Finally, we apply temporal weather degradation augmentation to consecutive frames to more accurately represent adverse weather degradations. Our method achieves a performance of 25.4% and 33.0% mIoU on the adaptation from VIPER and Synthia to MVSS, respectively, representing an improvement of 4.3% and 5.8% mIoU over the existing state-of-the-art method.
Framework
Our network comprises two pipelines: the source and the target. (a) Target Pipeline: The upper teacher (temporal) takes both the current and adjacent frames to create temporal pseudo-labels. The student, on the other hand, receives a cropped segment of the current frame and a complete adjacent frame, with a loss function enforcing its predictions align with the temporal teacher. The lower teacher (spatial) uses the same segment as the student, but from the original image and at a higher resolution. Similarly, a consistency loss is applied to make the student’s predictions consistent with the spatial teacher’s pseudo-labels. (b) Source Pipeline: The student model undergoes supervised learning with consecutive frames as inputs. (c) Fusion Block: This component integrates multiple offset layers, which adjust pixels from adjacent frames relative to the current frame, and convolutional layers to merge these pixels.
Results
The model's performance with different domain adaptation datasets. Viper to MVSS and Synthia to MVSS.