Opened 6 years ago

Closed 3 years ago

Last modified 3 years ago

#7706 closed defect (fixed)

20-30% perf drop in FFmpeg (H264) transcode performance with VAAPI

Reported by: eero-t Owned by:
Priority: important Component: avcodec
Version: git-master Keywords: vaapi regression
Cc: linjie.fu@intel.com Blocked By:
Blocking: Reproduced by developer: yes
Analyzed by developer: no

Description

Summary of the bug:

VAAPI H264 transcode performance dropped 20-30% between following commits:

There's no drop with QSV backend. Between the indicated commit range, there's a series of changes to FFmpeg VAAPI support (and couple of other changes).

Setup:

  • Ubuntu 18.04
  • drm-tip kernel v4.20
  • FFmpeg and iHD driver built from git
  • HW supported by iHD driver

How to reproduce:

$ ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.264 -c:v h264_vaapi -y output.h264

I see the drop on all platforms supported by iHD driver i.e. Broadwell and newer. With i965 driver (which supports also older platforms), the drop is visible on Braswell too, but I don't see it on Sandybridge or Haswell.

This drop may be there also for other codecs, but I've tested only H264.

GPU is running at full speed before and after the change, but it and CPU use less power, i.e. they're underutilized compared to earlier situation. If one runs many instances of FFmpeg at same time so that HW is definitely fully utilized, that still retains same perf => FFmpeg VAAPI usage seems to become more synchronous.

Change History (17)

comment:1 by Carl Eugen Hoyos, 6 years ago

Keywords: vaapi regression added
Version: unspecifiedgit-master

Please use git bisect to find the commit introducing the regression.

comment:2 by eero-t, 6 years ago

Did manual bisect (they go from newer to older commits):

=> encode perf regression came from:

commit 5fdcf85bbffe7451c227478fda62da5c0938f27d
Author:     Mark Thompson <sw@jkqxz.net>
AuthorDate: Thu Dec 20 20:39:56 2018 +0000
Commit:     Mark Thompson <sw@jkqxz.net>
CommitDate: Wed Jan 23 23:04:11 2019 +0000

    vaapi_encode: Convert to send/receive API
    
    This attaches the logic of picking the mode of for the next picture to
    the output, which simplifies some choices by removing the concept of
    the picture for which input is not yet available.  At the same time,
    we allow more complex reference structures and track more reference
    metadata (particularly the contents of the DPB) for use in the
    codec-specific code.
    
    It also adds flags to explicitly track the available features of the
    different codecs.  The new structure also allows open-GOP support, so
    that is now available for codecs which can do it.

comment:3 by Carl Eugen Hoyos, 6 years ago

Component: undeterminedavcodec
Priority: normalimportant
Status: newopen

comment:4 by eero-t, 6 years ago

VAAPI transcode performance is now slower than with QSV, whereas it earlier was in most cases faster (at least for H264, on Intel).

Guilty commit is not codec specific, so it's likely to regress VAAPI encoding perf also with other codecs than H264.

in reply to:  4 comment:5 by eero-t, 6 years ago

Replying to eero-t:

Guilty commit is not codec specific, so it's likely to regress VAAPI encoding perf also with other codecs than H264.

Ticket #7797 could also be due to this regression, as there has been no improvement to VA-API performance since this regression in January (except from drm-tip kernel v4.20 -> 5.0 upgrade).

comment:6 by hbommi, 5 years ago

VAAPI performance drop is not seen on Kaby Lake or coffee lake with kernel version 4.20 using regression patch.

Version 0, edited 5 years ago by hbommi (next)

comment:7 by Linjie.Fu, 5 years ago

Cc: linjie.fu@intel.com added
Reproduced by developer: set

comment:8 by Linjie.Fu, 5 years ago

One possible reason:

For old API encode2, vaBeginPicture and vaSyncSurface will be called in a more "asynchronous" way:

Two pics will be sent to encoder without vaSyncSurface, thus the encoder would not be blocked by the sync and map procedure.

[mpeg2_vaapi @ 0x55a3ad371200] vaBeginPicture happens here.
    Last message repeated 1 times
[mpeg2_vaapi @ 0x55a3ad371200] vaSyncSurface happens here.
    Last message repeated 1 times

For new send/receive API, vaBeginPicture is strictly followed by vaSyncSurface.

[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.

comment:9 by Linjie.Fu, 5 years ago

Currently, vaapi encodes a pic if all its referenced pics are ready,
and then outputs it immediately by calling vaapi_encode_output(vaSyncSurfac).

When working on output procedure, hardware is be able to cope with encoding
tasks in the meantime to have better performance.

So there is a more efficient way to encode the pics whose refs are all ready
during one receive_packets() function and output the pkt when encoder is encoding
new pic waiting for its references.

It's what vaapi originally did before the regression, and the performance could be
improved for ~20%.

CMD:
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128
-hwaccel_output_format vaapi -i bbb_sunflower_1080p_30fps_normal.mp4
-c:v h264_vaapi -f h264 -y /dev/null

Source:
https://download.blender.org/demo/movies/BBB/

Before:

~164 fps

After:

~198 fps

However, it didn't totally meet the performance benchmark before the regression in my experiment.

Hi Eero,

Would you please help to verify this patch:
https://patchwork.ffmpeg.org/patch/16156/

comment:10 by eero-t, 5 years ago

I've started some VA-API tests with that patch applied to FFmpeg. I'll check the results (FPS & PSNR) tomorrow.

comment:11 by eero-t, 5 years ago

I ran several variants of 6 transcode operations and few other media tests:

  • In 8-bit (max FullHD) AVC transcode tests perf improves up to 20%, when running single transcode operation
  • In 10-bit 4K HEVC transcode [1], perf increase was 3-4%
  • When running multiple transcode operations in parallel, there was no perf change (all changes were within daily variance)
  • There were no performance regressions

Even with the patch, there's still very clear gap to original January performance. Because that perf drop concerned only single transcode operations (parallel ones were not impacted), it's possible that some part of the gap was due to P-state power management (I'm not fixing CPU & GPU speeds in my tests, on purpose).

I was testing this on KBL i7 GT3e NUC with 28W TDP. Some observations on power usage:

  • In the tests improving most, patch increases GPU power usage without increased CPU power usage, i.e. FFmpeg was better able to feed work to GPU
  • When many instances of the same test are run in parallel, things are TDP limited. Either there's no change in power usage, or patch causes slightly higher CPU usage, which results in GPU using less power. No idea how latter behavior is able to maintain same speed, maybe P-state is better able to save GPU power with the interaction patterns caused by the patch?

[1] Note: I'm seeing marginal reproducible quality drop (0.1% SSIM, 2-3% PSNR) in this test-case: https://trac.ffmpeg.org/ticket/8328

I assume that's something related to frame timings like with QSV, not a change in encoded frame contents.

comment:12 by Linjie.Fu, 5 years ago

Thanks for verifying this patch.
There is still room for performance improvement.

So based on the test results, does this regression affect single process only?
If that's true,one possible reason is that multiple process encoding makes full use of the resource (hardware/encoder maybe), while single process seems to keep the encoding procedure idle or waiting for sometime.

And it's kind of weird that it benefits HEVC little since the modification is in the general vaapi encode code path.

And since the test covers the whole transcoding procedure, how about the performance of decode/encode separately?

in reply to:  12 comment:13 by eero-t, 5 years ago

Replying to fulinjie:

So based on the test results, does this regression affect single process only?

Looking at the old results, correct. Running multiple (5-50) parallel transcode processes wasn't affected by the regression, even when they were not TDP limited.

If that's true, one possible reason is that multiple process encoding makes full use of the resource (hardware/encoder maybe), while single process seems to keep the encoding procedure idle or waiting for sometime.

Yes, single 8-bit AVC transcode doesn't fill any of the GPU engines 100%, that happens only with multiple parallel transcode operations (it's easy to see from IGT "intel_gpu_top" output).

If GPU is "full" all the time, stuff needs to be queued. For average throughput (= what I'm measuring) it doesn't matter when you go to the queue, GPU is anyway fully utilized. Your patch might help a bit with VA-API latency in multiple parallel transcode cases, but I don't measure that.

And it's kind of weird that it benefits HEVC little since the modification is in the general vaapi encode code path.

My HEVC test-case is 10-bit instead of 8, and 4K instead of 2K or smaller. Therefore, it's processing >4x more data than my AVC test-cases. HEVC encoding is also heavier.
=> As each frame takes longer, feeding GPU timely is less of a problem for keeping average GPU utilization high.

(I was a bit worried about potential extra CPU usage, because that's away from power/temperature "budget" shared with iGPU, but that seems to be low enough not to be a problem.)

And since the test covers the whole transcoding procedure, how about the performance of decode/encode separately?

I did some decode tests with HEVC (for 2K 10-bit data) with and without hwdownload, and as expected, perf of that wasn't impacted.

(RAW data encoding is less of interest for me as the input data is so large that end-to-end perf can be bottlenecked more by disk/network data transfer, rather than GPU usage.)

comment:14 by wenbin,chen, 3 years ago

Resolution: fixed
Status: openclosed

comment:15 by eero-t, 3 years ago

Verified.

There's a large improvement to all single transcode operations VA-API performance, whether it's for small resolution AVC, or 10-bit 4K HEVC transcode. In few cases, the improvement is to significantly higher level than before the regression, and in most at least to same level.

For parallel transcode operations which (more than) fill whole GPU, there can be up to couple of percent regression, but at least in one such case, there was also an (1%) improvement.

=> I.e. in total, this is significant improvement to previous state.

comment:16 by Balling, 3 years ago

For parallel transcode operations which (more than) fill whole GPU, there can be up to couple of percent regression

Command?

comment:17 by eero-t, 3 years ago

For parallel transcode operations which (more than) fill whole GPU, there can be up to couple of percent regression

Command?

The general use-case for them is:

  • Start dozen(s) of *identical* FFmpeg transcode operations in parallel
  • Calculated resulting average FPS over all of them

(Tested transcode use-cases are HEVC->HEVC, AVC->AVC, AVC->MPEG2, and include several different resolutions.)

While one use-case goes couple of percent down on one HW, it goes up on another HW, and for another use-case, the change is the other way round. For all use-cases where there's some HW where result goes marginally down, it goes marginally up on another HW. While results are reproducible, there's no uniform regression from the fix, i.e. you should ignore that effect.

(With slightly more work being done by commit d165ce22a4a7cc4ed60238ce8f3d5dcbbad3e266, but it improving perf, that kind of behaviour could even be expected, as both will affect kernel scheduling.)

Note that after this fix to a huge perf regression, which in some cases even improves things more than just fixing the regression, VA-API is now faster in *all* of those cases than doing indetical thing using QSV API. IMHO what should be looked at now, is QSV perf.

Note: See TracTickets for help on using tickets.