This forum uses cookies

mikazmaj · 4 hours ago

What “flexible output” is for in VS-MLRT
In normal VapourSynth usage, a filter usually returns a regular

VideoNode

: either a single-plane

GRAY

clip, a three-plane

RGB

clip, or a three-plane

YUV

clip. That works well for classic image/video restoration models, where the ONNX model takes an image and returns another image.
But not every ONNX model has that simple shape. Many ML models return tensors such as:

x 2 x H x W
x 4 x H x W
x 6 x H x W
x 8 x H x W
x N x H x W

Those channels are not always “RGB”. They may be:

U/V chroma planes
alpha
mask
confidence map
optical flow
depth
luma/chroma residuals
multiple temporal outputs
intermediate restoration planes
auxiliary model outputs
packed model-specific data

The problem is that a generic frontend like Hybrid cannot safely assume that a multi-channel ONNX output should be converted to RGB. If the model outputs 2 channels, 4 channels, 6 channels, or more, forcing it into a normal VapourSynth clip either fails or destroys the semantic meaning of the output.
That is exactly what VS-MLRT’s flexible output path solves.
VS-MLRT already exposes

flexible_inference

as a public API, alongside the regular

inference

function. In

vsmlrt.py

,

flexible_inference

is included in

__all__

, so it is intentionally part of the wrapper’s public interface, not just an internal hack.

How VS-MLRT flexible output works internally
The normal

inference()

path returns one

VideoNode

. That is suitable when the ONNX output can be represented as a standard VapourSynth clip.
The flexible path is different.

flexible_inference_with_fallback()

calls

_inference()

with a

flexible_output_prop

argument. Instead of returning only a normal clip, VS-MLRT receives a dictionary containing:

ret["clip"]
ret["num_planes"]

Then it extracts each output channel separately:

planes = [
    clip.std.PropToClip(prop=f"{flexible_output_prop}{i}")
    for i in range(num_planes)
]

and returns a Python list of

VideoNode

s, one per output channel.
So conceptually:

ONNX output:  N x C x H x W

regular inference:
    expects C to be representable as a normal clip

flexible inference:
    exposes output channel 0 as clip[0]
    exposes output channel 1 as clip[1]
    exposes output channel 2 as clip[2]
    ...
    exposes output channel C-1 as clip[C-1]

That means Hybrid would not need to guess whether

C=2

,

C=4

,

C=6

, or

C=8

means RGB, YUV, mask, alpha, flow, or something else. It can expose the channels, then let the user or script decide how to combine them.

Why this matters for multi-channel models
A lot of ONNX models used in VapourSynth are not just “RGB in, RGB out”. Some are more like tensor processors.
For example:

input:  1 x 3 x H x W
output: 1 x 2 x H x W

This could mean the model predicts two chroma planes.
Another model might be:

input:  1 x 6 x H x W
output: 1 x 3 x H x W

That could mean two RGB frames are packed on input and one interpolated or restored RGB frame is returned.
Another one:

input:  1 x 3 x H x W
output: 1 x 4 x H x W

That might be RGB plus alpha, or RGB plus mask, or Y/U/V plus confidence. Without flexible output, Hybrid has no clean generic way to support that.
Another example:

input:  1 x 3 x H x W
output: 1 x 8 x H x W

This cannot be represented as a normal RGB/YUV/GRAY VapourSynth clip at all. But it is still a valid ONNX model and VS-MLRT can expose those eight channels individually through flexible output.
That is the key point: the ONNX model is valid, and VS-MLRT can run it, but a frontend that only expects one normal clip cannot represent the result correctly.

Existing proof inside VS-MLRT: ArtCNN chroma models
This is not only theoretical. The current

vsmlrt.py

already uses flexible output for ArtCNN chroma models.
For chroma variants, VS-MLRT calls:

clip_u, clip_v = flexible_inference_with_fallback(...)

Then it reconstructs a YUV clip with:

clip = core.std.ShufflePlanes([clip, clip_u, clip_v], [0, 0, 0], vs.YUV)

So the model output is not treated as one RGB image. Instead, the two output channels are extracted separately as

clip_u

and

clip_v

, then combined with the original luma plane.
That is an excellent example to show Selur, because it proves the feature is already useful in real VS-MLRT code:

model output channels → separate VapourSynth clips → custom recombination

For Hybrid, this means flexible output would not be an exotic feature. It would expose a capability VS-MLRT already uses internally.

Why forcing everything into RGB is wrong
A frontend may be tempted to do something simple like:

if output has 3 channels → RGB
if output has 1 channel  → GRAY
else reject

That is safe only for very simple models.
But it breaks down quickly:

channels → could be UV, flow x/y, mask pair, chroma residuals
channels → not always RGB; could be YUV, Lab, residuals, flow+mask
channels → could be RGBA, RGB+mask, YUV+alpha, 4 feature maps
channels → could be two RGB frames, bidirectional flow, temporal packed data
8+ channels → often model-specific tensor data

The channel count alone does not define the meaning.
Flexible output avoids pretending that the frontend knows the model’s semantics. It gives Hybrid a lower-level, lossless way to access the model result.
That is important because with ML models, channel order is part of the model contract. Treating arbitrary channels as RGB can silently produce wrong colors, wrong masks, wrong temporal behavior, or completely meaningless output.

Why Hybrid should implement it
The main reason is simple:
Hybrid should not be less capable than VS-MLRT itself.
If VS-MLRT can run a model and expose all output channels, Hybrid should ideally allow the user to access that functionality instead of blocking the model or forcing a wrong interpretation.
The benefit for Hybrid would be:

Support more ONNX models.
Avoid wrong assumptions about output channel meaning.
Preserve model semantics.
Allow advanced users to combine channels manually.
Enable models that output masks, alpha, UV, flow, confidence, or auxiliary planes.
Use VS-MLRT’s native mechanism instead of inventing a workaround.

VS-MLRT’s own code path already passes

flexible_output_prop

into backend model calls when flexible output is requested, so the feature is designed to work through the existing backend infrastructure.

Quote:VS-MLRT already supports flexible output through

flexible_inference
. This is useful for ONNX models whose output tensor has an arbitrary number of channels, not only 1 or 3. Some models output chroma planes, masks, alpha, optical flow, confidence maps, residuals, or other auxiliary data. Forcing such outputs into a normal RGB or GRAY VapourSynth clip either fails or destroys the model semantics.
With flexible output, VS-MLRT exposes every output channel as a separate

VideoNode
. Hybrid would not need to guess whether the output is RGB, YUV, UV, mask, alpha, or something else. It could simply expose the channels and let the user or script combine them correctly.
This is already used in VS-MLRT itself, for example ArtCNN chroma models return separate U and V outputs through

flexible_inference_with_fallback
, and then VS-MLRT recombines them with the original luma plane into a YUV clip. So this is not a theoretical feature; it is already part of the intended VS-MLRT workflow.
Implementing flexible output in Hybrid would make Hybrid compatible with a broader class of ONNX models without modifying the models and without losing channel semantics.

Suggested Hybrid-level behavior
A practical Hybrid UI/script design could be:

Mode: normal output
    Use vsmlrt.inference()
    Expect output to be directly usable as GRAY/RGB/YUV.

Mode: flexible output
    Use vsmlrt.flexible_inference()
    Return/output channel clips separately:
        output_0
        output_1
        output_2
        ...

Then Hybrid could optionally provide helpers:

combine first 3 channels as RGB
combine first 3 channels as YUV
use channel 0 as GRAY
use channels 0/1 as UV
use channel 3 as alpha/mask
export all channels separately

But the important point is that these should be explicit choices, not automatic assumptions.

One-line summary
Flexible output is important because some ONNX models output tensors, not normal images. VS-MLRT can already expose those tensor channels safely; Hybrid should support that path so it does not reject valid models or incorrectly force their outputs into RGB/GRAY.

Login
Username:
Password:	Lost Password?
	Remember me