This forum uses cookies

NASS · 11.04.2026, 16:40

Hi Dan, thanks for your reply. I trained the model with xmem2 using train-xmem2.py (I didn't use the original files because they needed to be modified for compatibility between Dinov3+ResNet and xmem2). It took me about 48 hours in FP16 AMP on my RTX 3090

I modified dataset/vos_dataset.py so it can read all types of datasets (DAVIS/REDS/16mm).

In the script I shared, everything is set up to run a training session, except we need to find a way to stabilize the model if we understand xmem2 correctly (Just so you understand, I used Gemini for the xmem2 code). I should have used the GitHub repo directly from the start

Even though I haven’t been able to stabilize it yet, I’m really surprised by the accuracy of the integration of the reference images into a full video! Even with Deep Exemplar-Based and Colormnet 2023, I couldn’t get this result!

Best

NASS

Dan64 · 11.04.2026, 17:31

Hi NASS,

you are doing 2 big changes at once. Given the observed problem, it is better perform the changes in 2 steps:

1) move from DinoV2 to DinoV3
2) move from XMem to XMem2

The 2 steps can be inverted, but the important is that before move to the next step you get already colored images with no artifacts.

Sorry for this, but even for me is better start from scratch than trying to find a problem that could be anywhere.

If you start from a working code and then apply a change a time it is easier to find where the bug was introduced.

Dan

NASS · 11.04.2026, 18:44

Hi Dan, you're absolutely right—that's the right approach: start from scratch! To avoid any issues, here's a technical report on my previous attempt:

Technical Report on the Migration and Optimization of the ColorMNet Pipeline to DINOv3 Base (is the major update to DINOv2) and XMem2 (is the major update to XMem )
1. Background and Initial Objective

The project is based on the ColorMNet architecture (ECCV 2024), which was originally designed with a DINOv2 Small backbone (ViT-S/14). The objective was to transform this prototype into a high-fidelity film colorization tool .
2. Technical Conversion Process (Backbone & Latent Space)

The first phase involved replacing the model’s “eyes”:

Backbone: Switch to DINOv3 Base (ViT-B/16). This change involved a radical modification of the dimensionality. By concatenating the last 4 hidden layers (hidden states), we went from a 1536-channel vector (4x384) to a 3072-channel vector (4x768).

Semantic Compression: To optimize computation on the 3090, a projection layer (conv_proj) was implemented to compress these 3072 channels into a working space of 1024 channels (ValueDim) and 128 channels (KeyDim).

Spatial Alignment: DINOv3 (patch 16) has been synchronized with the ResNet-50 branch (stride 16) to ensure exact pixel-to-pixel correspondence in the PVGFE fusion module.

3. Integration of XMem++ Memory

To address the memory loss issue in the original pipeline—which caused colors to fade after a few seconds—we implemented the XMem2 logic:

Permanent Memory Bank: Unlike the original model, where the copy (reference image) was volatile, we created an immutable anchor in VRAM. The reference image is injected as “eternal memory” that is consulted at every frame to stop color drift.

My recommendation: Start with the original XMEM2 GitHub repository and try integrating Backbone DINOv3 Base (ViT-B/16) and ResNet-50, using the colorization technique inspired by the old Colormnet 2023 pipeline.

I'm here to help with any requests for assistance

Best

NASS

Dan64 · 15.04.2026, 11:03

Hi NASS,

good newes! I extended ColorMNet with Xmem2.
I named the project CMNET2, you can find it at the following link: https://github.com/dan64/cmnet2
The key features implemented are:

Reference-based colorization
Permanent memory (XMem++ style)
Preloading API
Sliding window memory management
Adaptive VRAM management
DINOv2 + ResNet50 fusion backbone

I tried to add also DinoV3 but a full implementation requires a complete new training, which requires a lot of time to perform and implement and given that my time to develop this project is limited I decided to skip this extension (my attempt to train only last 7m nodes was unsuccessful).

The pipeline in which this model will be used involves extracting a certain number of reference images from a B&W video, which will then be colored using this model, passing the reference images (colored with Qwen-Image-Edit) to CMNET2. In this context, there are two main problems: 1) there may be frames that do not have a reference image; in this case, the colors provided by the model are faded, with people's faces appearing gray; 2) the same object appears in multiple reference frames with different colors; in this case, the model often provides an intermediate color between the two.

DinoV3 doesn't solve either of the two problems:
Problem 1 (frame without reference → faded colors) — This is a temporal memory coverage problem, not a feature quality problem. DinoV3 extracts better features, but if there's no reference close in time, the result will still be faded.
Problem 2 (same object with different colors between references) — This is a semantic inconsistency problem between references, caused by Qwen. DinoV3 doesn't know that two references show the same object with different colors — it would calculate the same average as DinoV2.

Instead I'm working on including SAM3 in the pipeline, I hope I can further improve the coloring process this way, we'll see...

Dan

NASS · 16.04.2026, 19:34

Hi Dan,

This is really incredible work — what you’ve built with CMNET2 is seriously impressive.

The integration of XMem2-style memory, adaptive VRAM management, and the DINOv2 + ResNet50 fusion backbone makes the whole pipeline feel much more robust and production-ready. The sliding window memory approach is also a very smart addition for long sequences.

Your analysis of the two core problems is spot on. I completely agree — both issues are more related to temporal coverage and reference consistency than to feature extraction quality, so it makes sense that DINOv3 wouldn’t fundamentally solve them.

SAM3 could definitely add a very useful layer, especially for improving object-level consistency and controlling how colors propagate across frames. That direction sounds very promising.

I’m going to test the pipeline you’ve built and see how it performs in practice.

Great work, seriously — looking forward to seeing where you take this next.

Best,
Nass

Hi Dan,
I also wanted to mention that I’m available to help with SAM3, especially if you move toward training or experimentation.
I took a look at the GitHub earlier, and it’s really promising — the direction you’re taking makes a lot of sense for improving consistency and control in the pipeline.
If you need any support on that side, feel free to reach out.
Best,
Nass

Hi Dan,
I ran some tests, and the results with XMem2 are genuinely superior. There are no color jumps, the temporal consistency is very solid, and the integration of reference images works perfectly.
Really impressive work — congratulations.
I also have a strong feeling that SAM3 might do something really special here — maybe even a breakthrough.
Best,
Nass

Dan64 · 16.04.2026, 21:52

Hi NASS,

thanks for your comments. I do agree with you, moving from Xmem to Xmem2 has increased significantly the temporal consistency.
Feel free to star the repository of cmnet2 and/or propose code enhancement.
I published the project in GitHub so that it will be easier contributing to the project.
In these days I was too busy to work on SAM3, but I hope to have more time the next weekend.

Thanks,
Dan

NASS · 16.04.2026, 22:21

Hi Dan,
Thanks a lot for your message, I really appreciate it.
Wishing you the best of luck with the next steps — looking forward to seeing how the project evolves.
Best,
Nass

Dan64 · 19.04.2026, 15:16

Hi NASS,

I completed the CMNET2 project. Ultimately, I decided against using SAM3 because it didn't offer any significant improvements that would justify the increased inference times.
Instead, I focused on optimizing the code and the inference. If you want to see CMNET2 in all its power, run the test_video_full.py script.
You will see a professionally colorized clip, with a quality that until now had never been achieved by an automatic colorization program.
On my side I will start to work on the replacement of ColorMNet with CMNET2 in HAVC.

Thanks,
Dan

Login
Username:
Password:	Lost Password?
	Remember me