ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

Abstract

This paper introduces ReSAM, a novel self-prompting framework for point-supervised segmentation specifically designed for remote sensing images. ReSAM employs a unique three-stage process: Refine, Requery, and Reinforce, to iteratively improve segmentation quality with minimal annotation. The proposed method leverages initial point prompts to generate masks, actively queries uncertain regions for refinement, and reinforces the model through pseudo-labeling. Experimental results demonstrate ReSAM's superior performance and efficiency compared to existing methods on various remote sensing datasets.

1. Introduction

Accurate semantic segmentation of remote sensing images is crucial for numerous applications, yet it often requires extensive pixel-level annotations, which are costly and time-consuming to acquire. Point-supervised segmentation offers a promising alternative by using sparse point labels, but achieving high accuracy with such limited supervision remains a significant challenge. This work addresses these challenges by introducing ReSAM, a self-prompting framework that iteratively refines segmentation masks using a novel active learning strategy. The models used in this article include the foundational Segmentation Anything Model (SAM) as a base, along with a custom ReSAM architecture incorporating a Refinement Module, a Requery Engine, and a Reinforcement Learning component for pseudo-label generation.

2. Related Work

Existing literature on semantic segmentation ranges from fully supervised convolutional neural networks to weakly supervised methods using bounding boxes or image-level labels. The recent emergence of large vision models like SAM has significantly advanced zero-shot and prompt-driven segmentation, but these often require dense prompts or struggle with fine-grained details in complex remote sensing scenes. Point-supervised methods have explored various techniques for expanding sparse point information, yet they frequently suffer from ambiguities and propagation errors. ReSAM differentiates itself by integrating active querying and a robust self-training mechanism to overcome the limitations of prior point-supervised approaches and leverage the power of foundation models in a self-improving loop.

3. Methodology

The ReSAM framework operates in three iterative phases: Refine, Requery, and Reinforce. In the Refine stage, an initial coarse segmentation mask is generated from point prompts, typically using a adapted SAM model. The Requery stage then identifies regions of high uncertainty or disagreement within the generated masks, actively requesting additional supervisory points from a simulated oracle or by sampling diverse regions. Finally, the Reinforce stage utilizes the refined masks and additional prompts to generate high-quality pseudo-labels, which are then used to self-train and update the model. This iterative cycle allows ReSAM to progressively improve segmentation accuracy without requiring extensive manual supervision.

4. Experimental Results

ReSAM's performance was rigorously evaluated on multiple public remote sensing datasets, including ISPRS Vaihingen and Potsdam, demonstrating significant improvements over state-of-the-art point-supervised methods. Metrics such as Mean Intersection over Union (mIoU) and F1-score were used to quantify the segmentation accuracy. The iterative refine, requery, and reinforce cycles consistently led to higher quality masks, especially in complex urban and agricultural scenes, validating the effectiveness of the self-prompting approach. The results presented in the table below highlight ReSAM's superior performance across key metrics, showcasing its ability to achieve high mIoU with significantly fewer annotations compared to baseline methods. This improvement is attributed to the intelligent selection of query points and the robust pseudo-labeling strategy employed.

5. Discussion

The experimental results clearly indicate that ReSAM provides a robust and efficient solution for point-supervised segmentation in remote sensing imagery, outperforming existing techniques. The iterative refinement and reinforcement mechanism effectively mitigates the ambiguity inherent in sparse point supervision, leading to high-quality segmentation masks. While ReSAM demonstrates strong generalization capabilities, its performance can be influenced by the quality of initial point prompts and the complexity of scene elements. Future work will focus on integrating more sophisticated uncertainty quantification methods and exploring its applicability to other weakly-supervised vision tasks in remote sensing, potentially extending to 3D point cloud segmentation.