My configuration is as follows:
Graphics Devices:
Name Version State
AMD Radeon HD 7900 Series 16.150.2211.0 Active
Intel(R) HD Graphics 4600 10.18.15.4256 08System info:
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
OS: Microsoft Windows 10 Pro
Arch: 64-bit
With a 1.16 session (Intel(R)_Media_SDK_2016.0.2), using the following parameters to encode H264:
parms.AsyncDepth = 4; parms.IOPattern = MFX_IOPATTERN_IN_VIDEO_MEMORY; parms.mfx.CodecId = MFX_CODEC_AVC; parms.mfx.CodecProfile = MFX_PROFILE_AVC_MAIN; parms.mfx.EncodedOrder = 0; parms.mfx.FrameInfo.FourCC = MFX_FOURCC_NV12; parms.mfx.FrameInfo.ChromaFormat = MFX_CHROMAFORMAT_YUV420; parms.mfx.FrameInfo.PicStruct = MFX_PICSTRUCT_PROGRESSIVE; parms.mfx.FrameInfo.Width = 1280; parms.mfx.FrameInfo.Height = 720; parms.mfx.FrameInfo.CropX = 0; parms.mfx.FrameInfo.CropY = 0; parms.mfx.FrameInfo.CropW = 1280; parms.mfx.FrameInfo.CropH = 720; parms.mfx.GopRefDist = 3; parms.mfx.GopPicSize = 60; parms.mfx.IdrInterval = 0; parms.mfx.NumRefFrame = 1; parms.mfx.NumSlice = 0; parms.mfx.RateControlMethod = MFX_RATECONTROL_CBR; parms.mfx.TargetUsage = MFX_TARGETUSAGE_BALANCED; parms.mfx.TargetKbps = 5000;
, using a D3D11FrameAllocator and MFX_IMPL_HARDWARE2 (my main graphics device is the AMD), with MFX_IMPL_HARDWARE_ANY | MFX_IMPL_VIA_D3D11.
My encode speed was 179ms per GOP (60 frames, 1280 x 720, 5000kbps), which is around 3ms per frame. My decode speed was 197ms per GOP which rounds to 3ms so let's call them the same. Approximately 333 frames per second.
How do these numbers compare to theoretical maximums? We are wanting to encode 2 streams of 1280 x 720 at 60 fps and decode 2, 3 or 4 (or more) streams simultaneously. We have our own pipeline that further processes decoded GOPs for a scientific/industrial application, so we aren't using VPP. Apart from TargetUsage, is there any other way of squeezing more performance out of the encoder or decoder? I noticed when profiling that the vast majority of processor time is taken locking the D3D 11 surfaces (> 85%). Can we speed this up in any way? For example, is there optimised code knocking around for converting UYUV420 to NV12 (and back again) - our base format is UYUV420. I haven't been able to find anything suitable on google.
Note the numbers I have here compare favourably with the output of sample_encode with the -calc_latency flag, so I'm thinking my impl is close to optimal, (assuming yours is!).
Thanks for any advice you can give me.