Hello Dan & Selur ,
I am working on a custom video colorization pipeline heavily inspired by ColorMNet, but I completely overhauled the core architecture to make it state-of-the-art:
1. Backbone Upgrade: Replaced DINOv2 with DINOv3 for denser and richer semantic feature extraction.
2. Memory Upgrade: Upgraded the tracking engine to the XMem++ architecture (incorporating Permanent Memory).
The Progress:
I successfully trained the model from scratch up to 145,000 iterations (DAVIS AND REDS AND 16MM FILM)
The temporal stability and object tracking are mind-blowing. If I provide a reference frame with a red car, the car stays perfectly red throughout the whole video, even through severe occlusions.
The Problem:
While the tracking is perfect, I am experiencing a spatial issue: Color Bleeding / Spilling ( specifically spilling over the ground/road and the sky )
Call for Collaboration:
I am reaching out to see if we can team up to stabilize this model. Once we fix this spatial bleeding, I truly believe this will be the ultimate upgrade to ColorMNet.
To get things started, I have attached all the files to this post:
The complete training and inference source code.
The test scripts.
The trained model weights (at 145k iterations).
The visual results along with the reference images.
Let's build something great together. Any advice or pull requests are welcome!
Best
NASS
Script and model:
https://drive.google.com/file/d/1JV7V2pp...sp=sharing
Resultat:
https://drive.google.com/file/d/1aKtCB5Q...sp=sharing
For Test: python nass.py --input 0000.mp4 --ref_path REF --model saves/color_v3_3090_145000.pth
Hello NASS,
This is an interesting project, but please move your post in a dedicated thread under "Small Talk", and name it something like "ColormnetV2 Project".
I strongly suggest you to add your project in github so that it will be easier contributing to your project.
I tried to run your test, but to be able to run it I had to add at the beginning of file "nass.py" the following code:
Code:
from pathlib import Path
import sys
# Ensure local module is found
script_dir = Path(__file__).parent.resolve()
if str(script_dir) not in sys.path:
sys.path.insert(0, str(script_dir))
But, in any case I was unable to complete the test because the access is restricted for the project
dinov3-vitb16-pretrain-lvd1689m
I attach the result obtained using my current pipeline.
Respect to your version, the street is colored better but the colors of the girls crossing the street are faded/washed (a known issue with ColorMnet when coloring small people's faces), while yours are well colored. I think there's room for improvement, but since ColorMnet uses external projects (DINOv3 and XMem++), I suspect some of the problems you're seeing are actually those external projects (the main candidate for the street color is XMem++).
Dan
moved posts to new thread
Hi Dan, thanks for your reply. For Dinov3 Base (
https://huggingface.co/facebook/dinov3-v...n-lvd1689m), which I used, you can request access from Meta. They’ll get back to you within 24 hours
For ColorMnet 2023, we use Dinov2 Small (training data: approximately 142 million images) and XMEM.
For ColorMnet v2 2026, I used Dinov3 Base (training data: 1.7 billion images) and XMEM++, which is superior to XMEM.
I’ve noticed that the temporal consistency of the model I trained with Dinov3 Base and XMEM++ is far superior to that of the old model, thanks, I believe, to the Permanent memory. There are no color jumps, and Dinov3 provides superior object recognition in the video, allowing for precise integration of reference images!!
You’re right: based on my research, it’s XMEM++ that needs to be properly configured to avoid that color on the ground and in the sky! Aside from that, if we can stabilize it, it will truly be superior to other image-guided colorization models! Let me know what you think?
I was able to reproduce your results. On my PC I was able to colorize your clip at a speed of about 2.6fps.
To increase the coloring speed it is possible to reduce the clip frame size and then using the chroma transfer to colorize the full resolution clip.
By coloring a smaller clip I was able to colorize at a speed of about 11.3fps (see attachment), but in this case the colors of the girls crossing the street are faded/washed as in the HAVC version of ColorMNet, which is using the same trick to increase the inference speed.
Dan
Hi Dan, thanks for your reply and your test! I think Dino V3 needs high resolution to recognize the objects it’s moving. That way, the colorization will be better! But the solution you suggested is really very useful if you make sure to stabilize the model! Check out another test to see how accurate Dino V3 and Xmem++ are
https://drive.google.com/file/d/1oyDZcV1...sp=sharing
Hi NASS
I checked your code and it seems Ok. It could be possible that the observed color artifacts are introduced by the inference model "color_v3_3090_145000.pth".
This model was trained by you ? in the case are you able to check the color quality produced by the model near the end of the training ?
Dan
Hi Dan, thanks for your reply. Yes, I trained the model from scratch! And you're probably right. There might be an issue because I included a dataset from a 16mm film. The model might have interpreted the grain as color. Also, I don’t think I included the full xmem++ code (
https://github.com/mbzuai-metaverse/XMem2), because after a thorough analysis, I read that it works with a permanent memory (already included) plus a working memory and a 15-frame jump (not included in the training).
You know, all colormnet scripts are based on xmem scripts. We were able to improve colorization by adding Dinov3 and ResNet. Now we need to understand how xmem2’s memory works. (xmem2 provides a precise reference image, with surgical-grade quality! For example, if you want to apply a color to a person or a car, it will remain consistent throughout the video, even if the object moves or enters the field of view.)
I’ve sent you the latest version of the training file train.py (that I used for training)
https://drive.google.com/file/d/1qr9jkCa...sp=sharing
The command to run: python train.py --davis_root datasets/480p --exp_id ColorMNet_V3_Final --s2_batch_size 4 - -s2_num_frames 10 --s2_lr 2e-5 --key_dim 128 --value_dim 512 --hidden_dim 64
Best regards,
NASS
Hi NASS,
just to understand better, the version that you shared was trained using xmem or xmem2 ?
the model checkpoint "color_v3_3090_145000.pth" was generated using train.py (already available) or the last train-xmem2.py ?
I noted that you are using the Davis dataset, do you know how many pictures were used for the training ?
How long did the training take on your RTX 3090 ?
Dan