Hello,
I'm attempting to write a simple windows media foundation command line tool to use IMFSourceReader and IMFSyncWriter to load in a video, read the video and audio as uncompressed streams and
re-encode them to H.246/AAC with some specific hard-coded settings.
Gist of the full simple program: https://gist.github.com/m1keall1son/33ebaf1271a5234a4ed1d8ba765eafd6
A test video: https://www.videvo.net/video/alpaca-on-green-screen/3442/
(Note: the video's i've been testing with are all stereo, 48000k sample rate)
The program works, however in some cases when comparing the newly outputted video to the original in an editing program, I see that the copied video streams match, but the audio stream of the copy is pre-fixed with some amount of silence and the audio is offset,
which is unacceptable in my situation.
audio samples: original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc copy - |[silence] [silence] [silence] [audio1] [audio2] [audio3] ... etc
In these cases the first video frames coming in have a **non zero** timestamp but the first audio frames do have a 0 timestamp.
I would like to be able to produce a copied video who's first frame from the video and audio streams is 0, so I first attempted to subtract that initial timestamp (`videoOffset`) from all subsequent video frames which produced the video i wanted, but resulted
in this situation with the audio:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc copy - |[audio4] [audio5] [audio6] [audio7] [audio8] ... etc
The audio track is shifted now in the other direction by a small amount and still doesn't align.
I've been able to fix this sync alignment and offset the video stream to start at 0 with the following code inserted at the point of passing the audio sample data to the IMFSinkWriter:
//inside read sample while loop ... // LONGLONG llDuration has the currently read sample duration // DWORD audioOffset has the global audio offset, starts as 0 // LONGLONG audioFrameTimestamp has the currently read sample timestamp //add some random amount of silence in intervals of 1024 samples static bool runOnce{ false }; if (!runOnce) { size_t numberOfSilenceBlocks = 1; //how to derive how many I need!? It's aribrary size_t samples = 1024 * numberOfSilenceBlocks; audioOffset = samples * 10000000 / audioSamplesPerSecond; std::vector<uint8_t> silence(samples * audioChannels * bytesPerSample, 0); WriteAudioBuffer(silence.data(), silence.size(), audioFrameTimeStamp, audioOffset); runOnce= true; } LONGLONG audioTime = audioFrameTimeStamp + audioOffset; WriteAudioBuffer(dataPtr, dataSize, audioTime, llDuration);
Oddly, this creates an output video file that matches the original.
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc copy - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
The solution was to insert extra silence in block sizes of 1024 at the beginning of the audio stream. It doesn't matter what the audio chunk sizes provided by IMFSourceReader are, the padding
is in multiples of 1024.
(Note: the linked video only requires 1 extra 1024 size block to sync.)
A screen shot of the audio track offsets of the different attempts : https://i.stack.imgur.com/PP29K.png
My problem is that there seems to be no reason for the the silence offset of this size to exist. Why do i need it? How do i know how much i need? I stumbled across the 1024 sample silence block solution after days
of fighting this problem.
Some videos seem to only need 1 padding block, some need 2 or more, and some need no extra padding at all!
My question here are:
- Does anyone know why this is happening?
- Am I using Media Foundation incorrectly in this situation to cause this?
- If I am correct, How can I use the video metadata to determine how many 1024 blocks of silence I need to pad the audio stream on a video that has a video stream that starts at a later time than the audio stream?
Other random things I have tried:
- Increasing the duration of the first video frame to account for the offset: Produces no effect.