Hi,
when i investigated in using the GPU for encoding i found a lot of people meantioning that the speed advantage is only marginally better. Some claimed 20 %, but when i tested out i came to an advantage of 1-2 %.
I thought that logically and technically there must be something wrong and i went inspecting my apps pipe again. In most "regular" cases developers using the GPU for decoding/encoding/transcoding/ existing files, so the data is in system
memory and is moved to the GPU and then back again. Video files are already existent and the frames can be consumed as fast as possible, the encoding would look like :
sysmem(cpu)->vram(gpu)->sysmem(cpu)
In my case i have a "non-regular" GPU encoding at work. When i set my live encoding to lets say 25 fps, then only every 40 milliseconds a frame becomes available for encoding. That alone screws the GPU speed advantage already,
but the more important problem is the whole delivering process to the GPU. My apps encoding looks this way :
vram(gpu)->sysmem(cpu)->sysmem(cpu)->vram(gpu)->sysmem(cpu)
The source in my encoding is a D3D surface ( IDirect3DSurface9 or ID3D10/11Texture2D) and is copied to system memory so that the CPU or GPU implemented encoders can process them. The 2 sysmem in a row is because of RGBA
to BGRA color conversion, from sysmem the bytes get swapped and copied into a media buffer which is also in sysmem. And even though i am using SSE2 and AVX "non-temporal" store functions this step in between of course slows down the whole
encoding process by a good margin. Then the data gets moved to the GPU, and after encoding is done it gets moved back to system memory for storage.
After i had inspected this my mind went directly onto the barricades, why all that senseless moving around when the data is already in vram and GPU has full access to it. I could even use the GPU for color conversion and would not have to move the data to
system memory. The encoding should look like this:
vram(gpu)->sysmem(cpu)
I read about "Hardware Handshake Sequence" and that one hardware MFT can connect its output to another hardware MFTs
input by using
MFT_CONNECTED_STREAM_ATTRIBUTE and
MFT_CONNECTED_TO_HW_STREAM. As i understand it its not possible for Hardware MFTs to consume data directly from device memory. The closest i could find was this thread
where it seems that it is possible with Intel Quick Sync. In my opinion its technically possible with all GPU encoders, be it NVIDIA, AMD or INTEL, the question is how much of extra work outside of media foundation need to be done. Or is it possible to
rewrite the MFT implementation these vendors did and make the MFT consuming from hardware memory ?
Another thought of mine is a little bit futuristic scenarion as it relates to "Unified Memory". Theoretically when i have a platform that is using this model like lets say huma from AMD then it should be possible to use MFCreateDXGISurfaceBuffer
and send the sample to the hardware MFT. Internaly it would only handle the pointer and the MFT could directly consume without the need to move the data before, all that of course would be only possible if Microsoft would implement the support for
these architecture in their memory handling.
Puh walls of text, but i hope that someone has knowledge or maybe just an idea to share. I will invest further into this as it seems to be the fastest encoding of D3D surfaces possible.
regards
coOkie