11.04.2026, 18:44
Hi Dan, you're absolutely right—that's the right approach: start from scratch! To avoid any issues, here's a technical report on my previous attempt:
Technical Report on the Migration and Optimization of the ColorMNet Pipeline to DINOv3 Base (is the major update to DINOv2) and XMem2 (is the major update to XMem )
1. Background and Initial Objective
The project is based on the ColorMNet architecture (ECCV 2024), which was originally designed with a DINOv2 Small backbone (ViT-S/14). The objective was to transform this prototype into a high-fidelity film colorization tool .
2. Technical Conversion Process (Backbone & Latent Space)
The first phase involved replacing the model’s “eyes”:
Backbone: Switch to DINOv3 Base (ViT-B/16). This change involved a radical modification of the dimensionality. By concatenating the last 4 hidden layers (hidden states), we went from a 1536-channel vector (4x384) to a 3072-channel vector (4x768).
Semantic Compression: To optimize computation on the 3090, a projection layer (conv_proj) was implemented to compress these 3072 channels into a working space of 1024 channels (ValueDim) and 128 channels (KeyDim).
Spatial Alignment: DINOv3 (patch 16) has been synchronized with the ResNet-50 branch (stride 16) to ensure exact pixel-to-pixel correspondence in the PVGFE fusion module.
3. Integration of XMem++ Memory
To address the memory loss issue in the original pipeline—which caused colors to fade after a few seconds—we implemented the XMem2 logic:
Permanent Memory Bank: Unlike the original model, where the copy (reference image) was volatile, we created an immutable anchor in VRAM. The reference image is injected as “eternal memory” that is consulted at every frame to stop color drift.
My recommendation: Start with the original XMEM2 GitHub repository and try integrating Backbone DINOv3 Base (ViT-B/16) and ResNet-50, using the colorization technique inspired by the old Colormnet 2023 pipeline.
I'm here to help with any requests for assistance
Best
NASS
Technical Report on the Migration and Optimization of the ColorMNet Pipeline to DINOv3 Base (is the major update to DINOv2) and XMem2 (is the major update to XMem )
1. Background and Initial Objective
The project is based on the ColorMNet architecture (ECCV 2024), which was originally designed with a DINOv2 Small backbone (ViT-S/14). The objective was to transform this prototype into a high-fidelity film colorization tool .
2. Technical Conversion Process (Backbone & Latent Space)
The first phase involved replacing the model’s “eyes”:
Backbone: Switch to DINOv3 Base (ViT-B/16). This change involved a radical modification of the dimensionality. By concatenating the last 4 hidden layers (hidden states), we went from a 1536-channel vector (4x384) to a 3072-channel vector (4x768).
Semantic Compression: To optimize computation on the 3090, a projection layer (conv_proj) was implemented to compress these 3072 channels into a working space of 1024 channels (ValueDim) and 128 channels (KeyDim).
Spatial Alignment: DINOv3 (patch 16) has been synchronized with the ResNet-50 branch (stride 16) to ensure exact pixel-to-pixel correspondence in the PVGFE fusion module.
3. Integration of XMem++ Memory
To address the memory loss issue in the original pipeline—which caused colors to fade after a few seconds—we implemented the XMem2 logic:
Permanent Memory Bank: Unlike the original model, where the copy (reference image) was volatile, we created an immutable anchor in VRAM. The reference image is injected as “eternal memory” that is consulted at every frame to stop color drift.
My recommendation: Start with the original XMEM2 GitHub repository and try integrating Backbone DINOv3 Base (ViT-B/16) and ResNet-50, using the colorization technique inspired by the old Colormnet 2023 pipeline.
I'm here to help with any requests for assistance
Best
NASS

