Background Removal: An Ill-Posed Problem 🔗
In the world of computer vision, background removal is considered a foundational task. However, what seems like a straightforward process quickly reveals itself to be an ill-posed problem.
The distinction between what constitutes a “salient object” and what is considered “background” is subjective and varies greatly from one viewer to another.
In any given image, multiple objects may coexist, each with varying degrees of saliency. The background is not always clearly defined, and identifying it involves determining the boundaries of objects—some highly salient, others more subtle.
At Finegrain, we faced this challenge head-on. Our customers wanted professional-level segmentation with no ambiguity. Unfortunately, common off-the-shelf solutions fell short of these needs:
- Background Removal Solutions: While effective in simpler scenarios, they often fail when multiple objects appear in an image.
- Segment Anything Model (SAM/SAM2): Although SAM1 and SAM22 are promptable and can handle multiple object segmentation, they frequently struggle to provide fine-grained masks, particularly for skeleton-like objects.
In many cases, segmenters are used as a first step in tasks like creating shadows or blending objects into a scene. When multiple objects and fine masks are required within the same image, there was simply no workable solution available.
We Built Our Own Box-Promptable Segmenter 🔗
Determined to overcome these limitations, we trained our own segmentation model.
To address the inherent ambiguity of background removal, we designed our system to allow users to provide a prompt—either a text prompt or a bounding box. Foundation models like GroundingDINO3 or VLMs help transform text into a box, so our model only needs to take that box as input.
We used a naive “prompting by cropping” approach: simply cropping the input image around the region of interest. This “box-guided segmentation” is built on top of the MVANet4 architecture, which can produce high-resolution (1024x1024) and fine-grained segmentation masks.
The community response was tremendous. We received supportive messages from AI enthusiasts around the globe—from Japan5 to Pakistan6. The segmenter racked up thousands of daily uses and more than 450 likes on our Hugging Face Space7.
But as more users adopted our solution, new challenges arose.
The Limitation of “Prompting by Cropping” 🔗
Feedback started rolling in via GitHub discussions8, direct messages, and our Finegrain Object Eraser Space9. The main complaint? Objects touching the image’s edges—particularly in portraits or face shots—weren’t being removed properly by the eraser tool.
Input Image | Result |
---|---|
![]() |
![]() |
The root cause lay in how we repurposed MVANet4. We retrained it to handle cropped objects as input using a very specialized dataset10 and augmentation strategy. While this worked well in most cases, it inherently failed for objects along the image edges.
Adding Channels for Native Prompting 🔗
This feedback spurred us to rethink our approach. We gave ourselves one month to address the issue.
Relying on naive “prompting by cropping” was no longer sufficient. We needed a more robust way to feed prompt information into the model, especially for edge-related scenarios.
We settled on a simpler yet more effective solution: adding channels to the input image. Specifically, we placed box information in one additional channel and a low-resolution mask prompt in another—side by side with the RGB channels. This allowed the model to work with a 5D input (RGB + Box + Mask).
Soon after implementing this, the model’s performance took an unexpected turn. It seemed to lose the ability to retain its previously learned knowledge.
We spent days verifying the entire pipeline—code, data augmentations, and training—line by line. Everything appeared correct. Only after adding custom metrics to track the model’s “world knowledge” during training did we confirm our suspicions: the additional channels caused the model to depend too heavily on the prompts, effectively bypassing its own feature extraction. This was a textbook case of catastrophic forgetting.
The Winning Formula: Four Modes of Operation 🔗
After resolving the catastrophic forgetting problem, we arrived at a four-mode strategy that made the model far more versatile and robust:
Mode | Usage | Box Prompt | Mask Prompt |
---|---|---|---|
No Prompt | Background Removal | No | No |
Mask Prompt | Point Prompting with SAM 22 | No | Yes |
Box Prompt | Box-Promptable Segmenter (no SAM 22) | Yes | No |
Box + Mask | Box + SAM 22 Promptable Segmenter | Yes | Yes |
This approach not only expanded segmentation flexibility but also mitigated catastrophic forgetting by forcing the model to maintain its own reasoning in addition to using the prompts.
New Prompting is Now Live 🔗
In early December 2024, we deployed the enhanced version of the segmenter featuring these improved prompting capabilities.
Input Image | Before the Release | After the Release |
---|---|---|
![]() |
![]() |
![]() |
Since this update, we’ve received no further complaints about objects touching the image edges. The model now handles a wide range of real-world scenarios more smoothly and accurately, making background removal and object segmentation more reliable than ever.
We finished this project in five weeks—slightly more than the initial one-month target, but close enough to celebrate as a win.
Pierre from the Finegrain Team