Using Stable Diffision models for Colorization

Using Stable Diffision models for Colorization - Printable Version

+- Selur's Little Message Board (https://forum.selur.net)
+-- Forum: Talk, Talk, Talk (https://forum.selur.net/forum-5.html)
+--- Forum: Small Talk (https://forum.selur.net/forum-7.html)
+--- Thread: Using Stable Diffision models for Colorization (/thread-4287.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

RE: Using Stable Diffision models for Colorization - Dan64 - 11.05.2026

(11.05.2026, 15:44)Selur Wrote: I then stopped the server, and called start_server.cmd:

(.venv) F:\Hybrid\64bit\Vapoursynth\DiTServerRPC-main>start_server.cmd Der Befehl "erver.cmd" ist entweder falsch geschrieben oder konnte nicht gefunden werden. Der Befehl "age:" ist entweder falsch geschrieben oder konnte nicht gefunden werden. Der Befehl "rt_server.cmd" ist entweder falsch geschrieben oder konnte nicht gefunden werden. ............................... Der Befehl "f" ist entweder falsch geschrieben oder konnte nicht gefunden werden. [ERROR] Unknown precision argument: "". Use "fp4" or "int4".
=> That didn't work.

Does it really make sense to add this to Hybrid? If yes, in what way?
Adding it to the torch add-on seems like a bad idea, since updates&co could break stuff to easy. (also it's huge)
So only way, this does seem to make sense would be to create a separate add-on and depending on whether it is present or not additional options could be available in HAVC; assuming the plan is to use this in HAVC.

Cu Selur

Ps.: 'DiTServerRPC-main'-folder is ~5GB in size.

The problem is clear: .cmd files have Unix line endings (LF) instead of Windows (CRLF). When Windows CMD reads an LF-only file, it doesn't recognize the lines correctly and interprets comments and configuration text as commands to execute—hence all those errors. The cause: the files were created in LF, and when downloaded from GitHub with core.autocrlf=false, they remain in LF. Immediate fix: You can convert the downloaded files with Notepad++ → Edit → EOL Conversion → Windows (CRLF), or with VS Code by clicking LF in the bottom right and choosing CRLF.

Does it really make sense to add this to Hybrid? No, this is the reason why I split the project in client/server, in HAVC will be implemented only the client part, which is lightweight and has no dependencies. If one want to use the client, must download the server from github and run it.

Dan

P.S.
If you run

python dit_client_pair_example.py --pipeline-config qwen_config_int4.json --use-shm

you shoud be able to colorize 2 images in about 12sec, i.e. 6sec per image, a 2x increase of speed for free

(11.05.2026, 18:12)didris Wrote: installation was not complicated, but I also could not get it to work this way.
I installed it on embedded python - there are quite a few models - C:\Users\YOUR_USERNAME\.cache\huggingface\hub\ - 30.8 GB

The result is an average of 9 seconds per image, uses an average of 23 GB of gpu memory and 39 GB of ram during the process /my card is rtx5090/, It is probably possible to improve the time, but it will be at the expense of quality. It does not colorize quite evenly - on different frames the same thing sometimes colors it differently - probably a lot depends on what is set in the prompt.

Once again, thanks to Dan and Selur for what they have done.

I don't understand why you was not able to follow the instructions provided in github, maybe you had the same Selur's problems regarding LF/CRLF.

On my RTX5070Ti I'm getting the same speed of about 8/9 sec per image, here my output

(.venv) PS D:\PProjects\DiTServerRPC> python dit_client_pair_example.py --pipeline-config qwen_config_fp4.json --use-shm
[INFO] Connecting to http://127.0.0.1:8765/ ...
[INFO] Server is reachable.
[INFO] Transport: shared memory
[INFO] Pipeline already loaded on server.
[INFO] Image 1: sample1_bw.jpg  (1480x1080 px)
[INFO] Image 2: sample2_bw.jpg  (1480x1080 px)
[INFO] Running paired inference (gap=8px) ...
[INFO] Inference time : 8.12s total  (4.06s per image)
[INFO] Round-trip time: 8.28s
[INFO] Saved: sample1_colorized.jpg
[INFO] Saved: sample2_colorized.jpg

If you run

python dit_client_pair_example.py --pipeline-config qwen_config_fp4.json --use-shm

you are able to colorize 2 images at the same speed of 1 image, a 2x increase of speed for free.

you can change qwen_config_fp4.json as follow

{
    "model_name":            "nunchaku-qwen",
    "model_precision":       "fp4",
    "model_rank":            "32",
    "model_inference_steps": "4",
    "cache_dir":             "C:\Users\YOUR_USERNAME\.cache\huggingface\hub",
    "full_model_path":       ""
}

to use your HF cache dir.

Dan

P.S.
Use my version which uses the shared memory instead of conversion of image in PNG->bytes, you will be able to increase the speed by 25% (from 5sec. to 4sec.)
Also try to change the line 143 in dit_colorize_main.py as

if torch.cuda.get_device_properties(0).total_memory / (1024 ** 3) < 48:

to see if the optimizations implemented for 16GB VRAM, will work also on your RTX5090

RE: Using Stable Diffision models for Colorization - didris - 11.05.2026

Hi, Dan

I will try your optimization tips and write to you, in comfui I code for 5 seconds per frame, so it is possible

it would be very good if qwen_edit was integrated into hybrid

just asking if it is not too impudent?
is this script correct for subsequent coloring in hybrid with reference frames? and can it be improved with something? I have already coded a movie - it turned out pretty well:

import vapoursynth as vs
from vapoursynth import core
import sys
import os

# ------------------------------------------------------------
# PATH TO HYBRID VSSCRIPTS (IMPORTANT FIX)
# ------------------------------------------------------------

scriptPath = r"D:/Programs/Hybrid/64bit/vsscripts"
sys.path.insert(0, os.path.abspath(scriptPath))

# ------------------------------------------------------------
# IMPORT HAVC (actually vsdeoldify wrapper)
# ------------------------------------------------------------

import vsdeoldify as havc

# ------------------------------------------------------------
# PATHS
# ------------------------------------------------------------

VideoPath = r"E:\Hybrid\video.mkv"
RefDir    = r"E:\DiTServerRPC\output"

# ------------------------------------------------------------
# LOAD VIDEO
# ------------------------------------------------------------

clip = havc.HAVC_read_video(source=VideoPath)

# ------------------------------------------------------------
# COLOR PROPAGATION (HAVC)
# ------------------------------------------------------------

clip = havc.HAVC_cmnet2(
    clip,
    method=4,
    sc_framedir=RefDir,
    encode_mode=0,
    render_speed="auto",
    max_memory_frames=50,
    ref_mode=0,
    render_vivid=False
)

# ------------------------------------------------------------
# RGB -> YUV420P10 (for x265)
# ------------------------------------------------------------

clip = core.resize.Bicubic(
    clip,
    format=vs.YUV420P10,
    matrix_in_s="709",
    matrix_s="709",
    range_in_s="full",
    range_s="limited",
    dither_type="error_diffusion"
)

# ------------------------------------------------------------
# OUTPUT TO VAPOURSYNTH PIPE
# ------------------------------------------------------------

clip.set_output()

RE: Using Stable Diffision models for Colorization - Dan64 - 12.05.2026

Hi didris,

your script seems Ok, the call to the function HAVC_cmnet2() is the one described in this post: #22

Using ComfUI my inference speed is about 22sec. using the super optimized code of the server I was able to increase the speed of about 5x.
So on your RTX5090 you should be able to perform the inference in less than 2sec (using the pair() trick), probably in 1sec.

The total space of the files necessary to run the server are:

venv : 4.96GB (o/w 4.28GB are related to torch package)
.cache : 23.3GB (nunchaku-qwen-image) + 15.7GB (vae + text_encoder) = 39GB

in summary to run the server are necessary about 44GB.

The total memory (RAM + VRAM) necessary to run the server is about 46GB (see post #5), on top of this is necessary to add the RAM necessary to run Windows OS (about 12GB) for a total RAM of 58GB. As you can see is not the amount of RAM that usually is available on a standard PC.

So I think that the usage of this model is limited to high-end workstations.

I'm happy to know that Selur was able to run the model on its RTX4080, probably using the pair() trick should be able to perform the inference of a full frame in about 5sec.
Using a reference frame every 25, this imply that could be possible to colorize a clip at a speed of about 5fps, not too bad for a DiT model.

I don't see any advantage in including the server in Hybrid, only disadvantages. But both Selur and you are asking for that, but I don't understand why.
If the steps to run the server are too complex, please suggest what are the steps to be improved.

In any case to run the full DiT colorization in Hybrid it will be necessary to split the process in client/server as I already done for CMNET2 because these process are not compatible with Vapoursynth threading.

Moreover using a client/server architecture will allow users, willing to use the DiT colorizer with standard hardware, to rent a powerful GPU to run the server for few hours. It is the cheapest solution compared to a hardware upgrade (especially in these days). For example assuming to rent a RTX5090 it could be possible to colorize a clip at a speed of about 20/25 fps (almost in real-time).

Let me know what you think.

Dan

RE: Using Stable Diffision models for Colorization - Dan64 - 12.05.2026

(11.05.2026, 15:44)Selur Wrote: ... here's what I did:
opened a terminal inside 'Hybrid\64bit\Vapoursynth'

put the content of the repository into DiTServerRPC-main using:

changed into 'DiTServerRPC-main' folder

installed venv (portable Python usually isn't build with venv

created the venv

activated the venv:

Installed the dependencies into the venv:

started the server with the preload:

ran the test script
Opened another terminal where I navigated to 'Hybrid\64bit\Vapoursynth' and called

python DiTServerRPC-main\dit_client_example.py --pipeline-config DiTServerRPC-main\qwen_config_fp4.json --use-shm

that ended with:

[INFO] Connecting to http://127.0.0.1:8765/ ... [INFO] Server is reachable. [INFO] Transport: shared memory [INFO] Pipeline already loaded on server. [INFO] Reading input image: F:\Hybrid\64bit\Vapoursynth\DiTServerRPC-main\assets\santa_bw.png [INFO] Colorizing (1184x880 px) ... [INFO] Inference time : 11.89s [INFO] Round-trip time: 11.93s [INFO] Saved: F:\Hybrid\64bit\Vapoursynth\DiTServerRPC-main\assets\santa_colorized.png
So far so good.

Hello Selur,

your results are good using the pair() trick I expect that you could obtain an inference speed of about 5/6secs per image.
The RTX4080 should have 16GB of VRAM size, how many RAM do you have on your PC (just the understand better the HW requirements) ?
The comments that I wrote to didris in the previous post will apply also to you.

Please let me know what you think.

Dan

RE: Using Stable Diffision models for Colorization - Selur - 12.05.2026

I got 64GB of RAM.

RE: Using Stable Diffision models for Colorization - Dan64 - 12.05.2026

Nice Smile

But what do you think about the integration in Hybrid ?

Thanks,
Dan

RE: Using Stable Diffision models for Colorization - Selur - 12.05.2026

I got no problem with adding support to use the server.
Assuming you would add support for it to HAVC or write a separate wrapper for the rpc calls it doesn't seem much work in Hybrid.
a. ask user for Data (1. server url 2. and maybe how many frames should be processed in parallel (1,2,?)
b. convert video to probably RGB
c. call the wrapper
the ui elements would be minimal and from what I gather the client would not need anything that the torch add-on with HAVC in it wouldn't already provide.

iirc. the plan was to use this for reference images every xy frames, so adding it to HAVC would make sense,... and it would require just a few additional parameters (a. server url, b. intervall in which reference frames get created c. number of frame so process in parallel)

Cu Selur

RE: Using Stable Diffision models for Colorization - Selur - 12.05.2026

Quick and Dirty: just running all frames through the server:

# Imports
import sys
import os
import vapoursynth as vs
# getting Vapoursynth core
core = vs.core
# Limit frame cache to 48449MB
core.max_cache_size = 48449
# Import scripts folder
scriptPath = 'F:/Hybrid/64bit/vsscripts'
sys.path.insert(0, os.path.abspath(scriptPath))
# loading plugins
core.std.LoadPlugin(path="F:/Hybrid/64bit/Vapoursynth/Lib/site-packages/vapoursynth/plugins2/fmtconv.dll")
core.std.LoadPlugin(path="F:/Hybrid/64bit/Vapoursynth/Lib/site-packages/vapoursynth/plugins2/libbestsource.dll")
# Import scripts
import validate
# Source: 'G:\TestClips&Co\files\test.avi'
# clip current meta; color space: YUV420P8, bit depth: 8, resolution: 640x352, fps: 25, color matrix: 470bg, color primaries: Unspecific, color transfer: Unspecified, yuv luminance scale: limited, scanorder: progressive, full height: true ((Source))
# Loading 'G:\TestClips&Co\files\test.avi' using BestSource
clip = core.bs.VideoSource(source="G:/TestClips&Co/files/test.avi", cachepath="J:/tmp/test_bestSource", track=0, hwdevice="opencl")

import xmlrpc.client
import io
import numpy as np
from PIL import Image


clip_rgb = core.resize.Bicubic(clip, format=vs.RGB24, matrix_in_s="470bg")

proxy = xmlrpc.client.ServerProxy("http://127.0.0.1:8765/", use_builtin_types=True)
PROMPT = "Colorize this black and white image with natural, realistic colors."

def frame_to_png_bytes(f):
    w, h = f.width, f.height
    # VapourSynth R55+: planes are accessed with frame[plane]
    r = np.asarray(f[0])
    g = np.asarray(f[1])
    b = np.asarray(f[2])
    arr = np.dstack([r, g, b])
    img = Image.fromarray(arr, "RGB")
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    return xmlrpc.client.Binary(buf.getvalue())

def write_png_to_frame(fout, png_bytes_data):
    out_img = Image.open(io.BytesIO(bytes(png_bytes_data))).convert("RGB")
    out_arr = np.array(out_img)
    for plane_idx in range(3):
        np.copyto(np.asarray(fout[plane_idx]), out_arr[:, :, plane_idx])

# Process pairs: frame N and N+1 together
# Use FrameEval with a clip-of-clips approach, or simply process even frames
# and carry the paired result. A simpler approach for offline encoding:

num_frames = clip_rgb.num_frames
results = {}  # cache colorized frames

def colorize_paired(n, f):
    if n in results:
        return results.pop(n)

    fout = f.copy()
    
    # Get frame n
    png1 = frame_to_png_bytes(f)
    
    # Get frame n+1 (if exists)
    n2 = min(n + 1, num_frames - 1)
    f2 = clip_rgb.get_frame(n2)
    png2 = frame_to_png_bytes(f2)
    fout2 = f2.copy()

    result = proxy.colorize_frame_pair(png1, png2, PROMPT, 8)
    # gap_px=8 is the separator between the two images during inference

    if result["ok"]:
        write_png_to_frame(fout, result["data1"])
        write_png_to_frame(fout2, result["data2"])
        if n2 != n:
            results[n2] = fout2  # cache the second result
    
    return fout

colorized = core.std.ModifyFrame(clip_rgb, clip_rgb, colorize_paired)
output = core.resize.Bicubic(colorized, format=vs.YUV420P8, matrix_s="470bg")
output.set_output()

with this I get ~4s/frame

2026-05-12 19:44:26,630 [INFO] colorize_frame_pair: 13.94s (6.97s/frame)
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.64s/it]
2026-05-12 19:44:54,805 [INFO] colorize_frame_pair: 7.55s (3.78s/frame)
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.63s/it]
2026-05-12 19:49:00,028 [INFO] colorize_frame_pair: 8.40s (4.20s/frame)
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.73s/it]
2026-05-12 19:49:37,091 [INFO] colorize_frame_pair: 8.01s (4.00s/frame)
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.64s/it]
2026-05-12 19:49:44,713 [INFO] colorize_frame_pair: 7.48s (3.74s/frame)
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.64s/it]
2026-05-12 19:49:52,345 [INFO] colorize_frame_pair: 7.49s (3.75s/frame)
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]
2026-05-12 19:50:00,027 [INFO] colorize_frame_pair: 7.54s (3.77s/frame)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.63s/it]
2026-05-12 19:50:07,564 [INFO] colorize_frame_pair: 7.40s (3.70s/frame)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.66s/it]
2026-05-12 19:50:15,198 [INFO] colorize_frame_pair: 7.50s (3.75s/frame)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]
2026-05-12 19:50:22,959 [INFO] colorize_frame_pair: 7.62s (3.81s/frame)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]
2026-05-12 19:50:30,665 [INFO] colorize_frame_pair: 7.56s (3.78s/frame)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.63s/it]
2026-05-12 19:50:38,270 [INFO] colorize_frame_pair: 7.46s (3.73s/frame)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.63s/it]
2026-05-12 19:50:45,998 [INFO] colorize_frame_pair: 7.58s (3.79s/frame)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.68s/it]
2026-05-12 19:50:53,992 [INFO] colorize_frame_pair: 7.84s (3.92s/frame)

Cu Selur

RE: Using Stable Diffision models for Colorization - Dan64 - 12.05.2026

Nice Smile

As you can see, once the model is fully loaded, the inference time drops to about 3.8 seconds, much better than using ComfyUI, and with this speed it makes sense to add it to Hybrid.

Tomorrow I will go in Holiday and I will be away for one week. So I hope to be able to deliver the new RC for HAVC 5.8.5 in 2 weeks.

Thanks,
Dan

RE: Using Stable Diffision models for Colorization - Selur - 12.05.2026

Take your time.

btw. would it be complicated/possible to support RGBS, RGBH, YUV444PS, YUV444PH in hAVC?
Alternatively, I'll think about a wrapper to:
Take convert the original video (if it's high bit depth) to YUV444PS, copy Y to the side, convert the video to RGB24 apply HAVC, convert the HAVC output to YUV444PS and then the combine UV channels with the original YUV444PS, this way at least the high bit depth of the luma would be preserved.

Cu Selur