Since FLUX Kontext shipped a couple of months ago, the community has been exploring ways to make it support multiple reference images natively - mostly with mixed success.
We decided to give it a try too. Using EditNet, our high-quality image editing dataset, we trained a product placement LoRA on top of FLUX Kontext - and we’re open-sourcing both the experimental model weights and HF Space.
In the write-up below, we share our learnings, along with training details for those curious about the setup.
Data ๐
Minimal Requirements ๐
For this new task, we need training samples made of:
- Reference image: a product shot of the item to be inserted.
- Before image: a scene into which the product will be placed.
- After image (ground truth): the same scene with the product included, often under different lighting or perspective.
Example:
Each sample also comes with annotations:
- Bounding box: used for product placement (see below).
- Segmentation mask: used for object erasing (see below).
Collecting thousands of such examples in the wild isn’t scalable. Instead, we can leverage our own high-quality object erasing to generate synthetic “before” images by removing the target product from the scene.
That’s where EditNet comes in: our large, FiftyOne-backed dataset of millions of annotated images across broad product categories. It lets us filter and export just the right dataset for a given task.
We applied the following initial filtering rules:
- Select products with at least two shots, including one in a real-world setup (not just on a plain background).
- Discard samples with imprecise or too-small bounding boxes.
Additional annotations ๐
By inspecting validation samples and high-loss outliers from our first trainings, we found more annotations were needed to boost quality:
- Occlusion: discard reference images featuring an object that is not fully visible, e.g. part of the object is hidden by another object.
- Viewpoint: discard pairs with extreme viewpoint mismatch (e.g., front-facing reference vs. back-facing after).
- Same state: discard pairs where product states differ (lamp on vs. off, empty vs. full glass, etc.).
We expanded the filtering rules accordingly and ended up exporting roughly 40k training samples, along with a carefully crafted validation split for checkpoint selection.
Formulation ๐
Multi-image input ๐
Flux Kontext was designed with extensibility1 in mind. However, as of today it officially supports only a single image input. Multi-image support is listed as future work2.
Following the paper’s hints, we supported an extra reference image through sequence concatenation by:
- Appending the reference latent tokens at the end of the sequence,
- Crafting proper position indices for the reference image (3D RoPE), i.e.:
- Using 2 as virtual time step (0 is for the noise, and 1 for the scene image).
- Adding an offset3 on X-axis since the reference image is not spatially aligned with the scene.
(see the corresponding utilities for more details)
Product Placement ๐
1st Attempt: Text Prompt ๐
In our first attempts, we tried relying on text prompts to specify the location of the inserted object, e.g.:
Add the coffee cup next to the fork on the table
We used o4-mini
via the OpenAI API with "high"
reasoning effort to automatically annotate our images. The system prompt looked like this (simplified):
You will be given two images:
- Image A: the original
- Image B: the same image with one object โ {category} โ added
and highlighted in a red bounding box
Your task is to describe where the new object was added using a
single, natural sentence that sounds like a human gave the instruction.
[... some guidelines and examples omitted ...]
After a manual rating, we assessed that roughly 15% of the inferred prompts were wrong, imprecise, or overly complex. Training experiments also revealed that achieving fine-grained control over position and size was difficult using text prompts alone.
2nd Attempt: Bounding Box ๐
So we switched to another approach: leveraging FLUX Kontext’s visual cue capability4. With just a bounding box, we gained direct control over placement and scale:
This setup worked consistently and produced predictable placements.
Training ๐
LoRA vs. Full Fine-Tuning ๐
Supervised fine-tuning in LoRA mode is by far the most popular and convenient option. Still, we wanted to see how it compared to full fine-tuning of the transformer weights (with T5, CLIP, and the auto-encoder kept frozen).
We took our best LoRA so far (see setup below) and ran a side-by-side comparison (see slides 27-31 here). To our surprise, the rank-16 LoRA turned out to be superior, especially in terms of subject preservation.
We were also curious about the impact of going even smaller, so we tried a rank-8 LoRA. Surprisingly, it matched the quality of the rank-16 LoRA on our eval set, if not slightly better!
Final Setup ๐
Here are a few more details about this LoRA training:
- Hardware: NVIDIA H100 x 8
- Software: custom diffusers-based training loop leveraging DeepSpeed and gradient checkpointing (aka activation checkpointing) to reduce the memory pressure due to longer sequences (extra reference tokens)
- Hyper parameters:
- LoRA rank: 8 with target modules as in ostris’ ai-toolkit
- Learning rate: 1e-4 with linear warmup
- Effective batch size: 256
- Precision:
bfloat16
- Optimizer: AdamW
- ZeRO stage: 2 (i.e. optimizer+gradient state partitioning)
- Multiple resolutions:
[256, 384, 512, 768, 1024]
- Resolution dependent timestep shifting: enabled (see Flux Kontext paper, Appendix A.2)
- Checkpoint selection: using validation loss (like described in the Stable Diffusion 3 paper - “a strong predictor of performance”) + pairwise image similarity metrics (result vs. ground truth) like DISTS or PieAPP
In a Nutshell ๐
It’s not perfect, but it’s playable. We had fun running this sprint โ now it’s your turn to try it out and see where it shines (or stumbles).
-
the diffusion transformer “[…] readily extends to multiple images y1, y2, …, yN […]”. See paper, section 3. ↩︎
-
“Future work should focus on extending to multiple image inputs […]”. See paper, section 5. ↩︎
-
See also OminiControl, UNO or omini-kontext which also apply shifted positions. ↩︎
-
“FLUX.1 Kontext is able to leverage visual cues like bounding boxes […] to guide targeted modifications”. See paper, sections 4.3 and 4.4. ↩︎