05.01.2023, 03:23
A few comparisions between KNLMeansCL(Strength increased by 50% from 1,2 to 1,80) and FFT3DGPU (Sigma reduced by 35% from 2.0 to 1.3) - both filtering YUV:
1080p:
KNLMeansCL: 27.14 fps, 7312.13 kbps
FFT3DGPU mode=2 precision=0: 57.18 fps (+111% faster), 6618.81 kbps (-9,5%)
2160p (downscaled by lanczos to 1080p after filtering):
KNLMeansCL 7.49 fps, 7143.20 kbps
ConvertBits(10) FFT3DGPU mode=2 precision=2: 11.98 fps (+60% faster), 6369.00 kbps (-10,8%)
ConvertBits(10) FFT3DGPU mode=1 precision=2: 15.47 fps (+107% faster), 6495.53 kbps (-9,1%)
All described tunings added to FFT3DGPU ( ConvertBits(10) @ 4K, bw+bh=64, wintype=2 and Prefetch(1,7) )
Encoder settings:
Some suboptimal FFT3DGPU default settings:
2160p:
FFT3DGPU untuned default with banding (no additional prefetch(1,7), no ConvertBits(10)), fastest setting without picture errors, mode=1, precision=1, bw+bh=32, wintype=1:
12.23 fps, 6451.48 kbps (improved mode 1 is 26,5% faster)
FFT3DGPU untuned default with banding (no additional prefetch(1,7), no ConvertBits(10)), fastest setting without picture errors, mode=1, precision=2, bw+bh=32, wintype=1 (precision=2 doesn't reduce banding, only "0" does->grid errors):
11.94 fps, 6456.07 kbps (improved mode 1 with ConvertBits(10), prefetch(1,7) and bw+bh=64 is 29,6% faster)
1080p:
FFT3DGPU untuned mode=2, precision=0 (no additional prefetch(1,7), bw+bh=32, wintype=1):
29.66 fps, 4372.59 kbps (improved mode 2 with prefetch(1,7) and bw+bh=64 is 92,8% faster)
->Prefetch becomes more important, if other multithreaded filters that need more resources like a simple resizer follow after FFT3DGPU.
1080p:
KNLMeansCL: 27.14 fps, 7312.13 kbps
FFT3DGPU mode=2 precision=0: 57.18 fps (+111% faster), 6618.81 kbps (-9,5%)
2160p (downscaled by lanczos to 1080p after filtering):
KNLMeansCL 7.49 fps, 7143.20 kbps
ConvertBits(10) FFT3DGPU mode=2 precision=2: 11.98 fps (+60% faster), 6369.00 kbps (-10,8%)
ConvertBits(10) FFT3DGPU mode=1 precision=2: 15.47 fps (+107% faster), 6495.53 kbps (-9,1%)
All described tunings added to FFT3DGPU ( ConvertBits(10) @ 4K, bw+bh=64, wintype=2 and Prefetch(1,7) )
Encoder settings:
NVEncC (x64) 7.06 (r2388) by rigaya, Dec 10 2022 12:26:56 (VC 1929/Win)
OS Version Windows 10 x64 (19043) [UTF-8]
CPU AMD FX(tm)-8350 Eight-Core Processor [4.54GHz] (4C/8T)
GPU #0: NVIDIA RTX A2000 (3328 cores, 1200 MHz)[PCIe2x16][527.27]
NVENC / CUDA NVENC API 12.0, CUDA 12.0, schedule mode: sync
Input Buffers CUDA, 46 frames
Input Info y4m(yv12(10bit))->p010 [SSE2], 1920x1080, 60000/1001 fps
Vpp Filters copyHtoD
Output Info H.265/HEVC main10 @ Level 6.2
1920x1080p 1:1 59.940fps (60000/1001fps)
Encoder Preset quality
Rate Control VBR
Multipass none
Bitrate 0 kbps (Max: 768000 kbps)
Target Quality 25.75
Initial QP I:20 P:23 B:25
QP Offset cb:0 cr:0
VBV buf size auto
Lookahead on, 32 frames, Adaptive I, B Insert
GOP length 600 frames
B frames 5 frames [ref mode: middle]
Ref frames 7 frames, MultiRef L0:6 L1:2
AQ on
CU max / min auto / auto
VUI matrix:bt709,range:limited
Others mv:Q-pel
Some suboptimal FFT3DGPU default settings:
2160p:
FFT3DGPU untuned default with banding (no additional prefetch(1,7), no ConvertBits(10)), fastest setting without picture errors, mode=1, precision=1, bw+bh=32, wintype=1:
12.23 fps, 6451.48 kbps (improved mode 1 is 26,5% faster)
FFT3DGPU untuned default with banding (no additional prefetch(1,7), no ConvertBits(10)), fastest setting without picture errors, mode=1, precision=2, bw+bh=32, wintype=1 (precision=2 doesn't reduce banding, only "0" does->grid errors):
11.94 fps, 6456.07 kbps (improved mode 1 with ConvertBits(10), prefetch(1,7) and bw+bh=64 is 29,6% faster)
1080p:
FFT3DGPU untuned mode=2, precision=0 (no additional prefetch(1,7), bw+bh=32, wintype=1):
29.66 fps, 4372.59 kbps (improved mode 2 with prefetch(1,7) and bw+bh=64 is 92,8% faster)
->Prefetch becomes more important, if other multithreaded filters that need more resources like a simple resizer follow after FFT3DGPU.