Skip to content

Commit 3f80956

Browse files
committed
update coments
1 parent ac428a3 commit 3f80956

File tree

2 files changed

+37
-24
lines changed

2 files changed

+37
-24
lines changed

src/torchcodec/decoders/_core/VideoDecoder.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1403,7 +1403,10 @@ torch::Tensor VideoDecoder::convertFrameToTensorUsingFilterGraph(
14031403
ffmpegStatus =
14041404
av_buffersink_get_frame(filterState.sinkContext, filteredFrame.get());
14051405
TORCH_CHECK_EQ(filteredFrame->format, AV_PIX_FMT_RGB24);
1406-
std::vector<int64_t> shape = {filteredFrame->height, filteredFrame->width, 3};
1406+
auto frameDims = getHeightAndWidthFromResizedAVFrame(*filteredFrame.get());
1407+
int height = frameDims.height;
1408+
int width = frameDims.width;
1409+
std::vector<int64_t> shape = {height, width, 3};
14071410
std::vector<int64_t> strides = {filteredFrame->linesize[0], 3, 1};
14081411
AVFrame* filteredFramePtr = filteredFrame.release();
14091412
auto deleter = [filteredFramePtr](void*) {
@@ -1426,6 +1429,10 @@ VideoDecoder::~VideoDecoder() {
14261429
}
14271430
}
14281431

1432+
FrameDims getHeightAndWidthFromResizedAVFrame(const AVFrame& resizedAVFrame) {
1433+
return FrameDims(resizedAVFrame.height, resizedAVFrame.width);
1434+
}
1435+
14291436
FrameDims getHeightAndWidthFromOptionsOrMetadata(
14301437
const VideoDecoder::VideoStreamDecoderOptions& options,
14311438
const VideoDecoder::StreamMetadata& metadata) {

src/torchcodec/decoders/_core/VideoDecoder.h

Lines changed: 29 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -429,30 +429,32 @@ class VideoDecoder {
429429
// MaybePermuteHWC2CHW().
430430
//
431431
// Also, importantly, the way we figure out the the height and width of the
432-
// output frame varies and depends on the decoding entry-point:
433-
// - In all cases, if the user requested specific height and width from the
434-
// options, we honor that. Otherwise we fall into one of the categories below.
435-
// - In Batch decoding APIs (e.g. getFramesAtIndices), we get height and width
436-
// from the stream metadata, which itself got its value from the CodecContext,
437-
// when the stream was added.
438-
// - In single frames APIs:
439-
// - On CPU we get height and width from the AVFrame.
440-
// - On GPU, we get height and width from the metadata (same as batch APIs)
432+
// output frame tensor varies, and depends on the decoding entry-point. In
433+
// *decreasing order of accuracy*, we use the following sources for determining
434+
// height and width:
435+
// - getHeightAndWidthFromResizedAVFrame(). This is the height and width of the
436+
// AVframe, *post*-resizing. This is only used for single-frame decoding APIs,
437+
// on CPU, with filtergraph.
438+
// - getHeightAndWidthFromOptionsOrAVFrame(). This is the height and width from
439+
// the user-specified options if they exist, or the height and width of the
440+
// AVFrame *before* it is resized. In theory, i.e. if there are no bugs within
441+
// our code or within FFmpeg code, this should be exactly the same as
442+
// getHeightAndWidthFromResizedAVFrame(). This is used by single-frame
443+
// decoding APIs, on CPU, with swscale.
444+
// - getHeightAndWidthFromOptionsOrMetadata(). This is the height and width from
445+
// the user-specified options if they exist, or the height and width form the
446+
// stream metadata, which itself got its value from the CodecContext, when the
447+
// stream was added. This is used by batch decoding APIs, or by GPU-APIs (both
448+
// batch and single-frames).
441449
//
442-
// These 2 strategies are encapsulated within
443-
// getHeightAndWidthFromOptionsOrMetadata() and
444-
// getHeightAndWidthFromOptionsOrAVFrame(). The reason they exist is to make it
445-
// very obvious which logic is used in which place, and they allow for `git
446-
// grep`ing.
447-
//
448-
// The source of truth for height and width really is the AVFrame: it's the
449-
// decoded ouptut from FFmpeg. The info from the metadata (i.e. from the
450-
// CodecContext) may not be as accurate. However, the AVFrame is only available
451-
// late in the call stack, when the frame is decoded, while the CodecContext is
452-
// available early when a stream is added. This is why we use the CodecContext
453-
// for pre-allocating batched output tensors (we could pre-allocate those only
454-
// once we decode the first frame to get the info frame the AVFrame, but that's
455-
// a more complex logic).
450+
// The source of truth for height and width really is the (resized) AVFrame:
451+
// it's the decoded ouptut from FFmpeg. The info from the metadata (i.e. from
452+
// the CodecContext) may not be as accurate. However, the AVFrame is only
453+
// available late in the call stack, when the frame is decoded, while the
454+
// CodecContext is available early when a stream is added. This is why we use
455+
// the CodecContext for pre-allocating batched output tensors (we could
456+
// pre-allocate those only once we decode the first frame to get the info frame
457+
// the AVFrame, but that's a more complex logic).
456458
//
457459
// Because the sources for height and width may disagree, we may end up with
458460
// conflicts: e.g. if we pre-allocate a batch output tensor based on the
@@ -466,6 +468,10 @@ struct FrameDims {
466468
FrameDims(int h, int w) : height(h), width(w) {}
467469
};
468470

471+
// There's nothing preventing you from calling this on a non-resized frame, but
472+
// please don't.
473+
FrameDims getHeightAndWidthFromResizedAVFrame(const AVFrame& resizedAVFrame);
474+
469475
FrameDims getHeightAndWidthFromOptionsOrMetadata(
470476
const VideoDecoder::VideoStreamDecoderOptions& options,
471477
const VideoDecoder::StreamMetadata& metadata);

0 commit comments

Comments
 (0)