One-Month Sprint to Promptable High-Resolution Image Segmentation

Dec 10, 2024 · 900 words · 5 minute read

Background Removal: An Ill-Posed Problem 🔗

In the world of computer vision, background removal is considered a foundational task. However, what seems like a straightforward process quickly reveals itself to be an ill-posed problem.

The distinction between what constitutes a “salient object” and what is considered “background” is subjective and varies greatly from one viewer to another.

In any given image, multiple objects may coexist, each with varying degrees of saliency. The background is not always clearly defined, and identifying it involves determining the boundaries of objects—some highly salient, others more subtle.

A plate and a notepad — *What is salient here? The plate, the notepad, both? What about the napkin—foreground or background?*

At Finegrain, we faced this challenge head-on. Our customers wanted professional-level segmentation with no ambiguity. Unfortunately, common off-the-shelf solutions fell short of these needs:

Background Removal Solutions: While effective in simpler scenarios, they often fail when multiple objects appear in an image.
Segment Anything Model (SAM/SAM2): Although SAM¹ and SAM2² are promptable and can handle multiple object segmentation, they frequently struggle to provide fine-grained masks, particularly for skeleton-like objects.

In many cases, segmenters are used as a first step in tasks like creating shadows or blending objects into a scene. When multiple objects and fine masks are required within the same image, there was simply no workable solution available.

We Built Our Own Box-Promptable Segmenter 🔗

Determined to overcome these limitations, we trained our own segmentation model.

To address the inherent ambiguity of background removal, we designed our system to allow users to provide a prompt—either a text prompt or a bounding box. Foundation models like GroundingDINO³ or VLMs help transform text into a box, so our model only needs to take that box as input.

We used a naive “prompting by cropping” approach: simply cropping the input image around the region of interest. This “box-guided segmentation” is built on top of the MVANet⁴ architecture, which can produce high-resolution (1024x1024) and fine-grained segmentation masks.

The community response was tremendous. We received supportive messages from AI enthusiasts around the globe—from Japan⁵ to Pakistan⁶. The segmenter racked up thousands of daily uses and more than 450 likes on our Hugging Face Space⁷.

Review from Japan — *A review by @moz_ai_tech comparing finegrain-object-cutter with Photoshop and Canva*

But as more users adopted our solution, new challenges arose.

The Limitation of “Prompting by Cropping” 🔗

Feedback started rolling in via GitHub discussions⁸, direct messages, and our Finegrain Object Eraser Space⁹. The main complaint? Objects touching the image’s edges—particularly in portraits or face shots—weren’t being removed properly by the eraser tool.

Input Image	Result

The root cause lay in how we repurposed MVANet⁴. We retrained it to handle cropped objects as input using a very specialized dataset¹⁰ and augmentation strategy. While this worked well in most cases, it inherently failed for objects along the image edges.

Adding Channels for Native Prompting 🔗

This feedback spurred us to rethink our approach. We gave ourselves one month to address the issue.

Relying on naive “prompting by cropping” was no longer sufficient. We needed a more robust way to feed prompt information into the model, especially for edge-related scenarios.

We settled on a simpler yet more effective solution: adding channels to the input image. Specifically, we placed box information in one additional channel and a low-resolution mask prompt in another—side by side with the RGB channels. This allowed the model to work with a 5D input (RGB + Box + Mask).

5D input — *The input now has 5 channels: Red, Green, Blue, Box, and Mask*

Soon after implementing this, the model’s performance took an unexpected turn. It seemed to lose the ability to retain its previously learned knowledge.

We spent days verifying the entire pipeline—code, data augmentations, and training—line by line. Everything appeared correct. Only after adding custom metrics to track the model’s “world knowledge” during training did we confirm our suspicions: the additional channels caused the model to depend too heavily on the prompts, effectively bypassing its own feature extraction. This was a textbook case of catastrophic forgetting.

The Winning Formula: Four Modes of Operation 🔗

After resolving the catastrophic forgetting problem, we arrived at a four-mode strategy that made the model far more versatile and robust:

Mode	Usage	Box Prompt	Mask Prompt
No Prompt	Background Removal	No	No
Mask Prompt	Point Prompting with SAM 2²	No	Yes
Box Prompt	Box-Promptable Segmenter (no SAM 2²)	Yes	No
Box + Mask	Box + SAM 2² Promptable Segmenter	Yes	Yes

This approach not only expanded segmentation flexibility but also mitigated catastrophic forgetting by forcing the model to maintain its own reasoning in addition to using the prompts.

New Prompting is Now Live 🔗

In early December 2024, we deployed the enhanced version of the segmenter featuring these improved prompting capabilities.

Input Image	Before the Release	After the Release

Since this update, we’ve received no further complaints about objects touching the image edges. The model now handles a wide range of real-world scenarios more smoothly and accurately, making background removal and object segmentation more reliable than ever.

We finished this project in five weeks—slightly more than the initial one-month target, but close enough to celebrate as a win.

Pierre from the Finegrain Team

segmentation high-definition background-removal SAM SAM2 MVANet