Skip to content

Commit

Permalink
Move images used by documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mav-intel committed Nov 29, 2023
1 parent 71892ed commit 3ab5f53
Show file tree
Hide file tree
Showing 21 changed files with 17 additions and 17 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ![oneAPI](assets/oneapi-logo.png "oneAPI") Video Processing Library
# ![oneAPI](doc/images/oneapi-logo.png "oneAPI") Video Processing Library

Intel® oneAPI Video Processing Library (oneVPL) supports AI visual inference, media delivery,
cloud gaming, and virtual desktop infrastructure use cases by providing access to hardware
Expand Down
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
4 changes: 2 additions & 2 deletions doc/smt-cascade-scaling-readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ Some transcoding pipelines can greatly benefit from so called cascade scaling. S

Let’s look at the transcoding case showed on the diagram below. Here one decoder feeds 8 encoders with different resolutions, frame rates and picture structures.

![original pipeline](./pic/cs_org_pipeline.jpg)
![original pipeline](./images/cs_org_pipeline.jpg)


As can be seen, deinterlacing and downscaling from original HD resolution is performed 6 times. Because deinterlacing is slow operation and downscaling from original resolution consumes a lot of memory bandwidth, this pipeline may be bottlenecked by VPP performance. To remove this bottleneck cascade scaling may be used as shown on the next diagram.

![CS pipeline](./pic/cs_cs_pipeline.jpg)
![CS pipeline](./images/cs_cs_pipeline.jpg)


Here, number of deinterlacing operations was reduce to two and just three downscaling operations are performed on original HD resolution. With growing number of channels and growing resolution ratio between decoder and encoder channels, benefits of cascade scaling will also grow.
Expand Down
4 changes: 2 additions & 2 deletions doc/smt-parallel-encoding-readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Sometimes, e.g., for very high resolutions, transcoding couldn’t be handled by

Picture below shows pipeline configuration for 2x2 mode, numbers are for 60 fps transcoding case. This is the most powerful mode that can give up to 2x transcoding speed up. We use two GPUs here. Each one runs decode / encode pair. Because performance is usually limited by encoder, each GPU decodes all frames in input bitstream but encodes only half of them. That means that decoder runs at 60 fps, and encoder at 30 fps, reducing encoder workload twice in comparison to sequential transcoding. After two GOPs have been transcoded, we mux them back to single bitstream and write to file.

![CS pipeline](./pic/par_enc_2x2.jpg)
![CS pipeline](./images/par_enc_2x2.jpg)


To run such pipeline, use this command line:
Expand All @@ -32,7 +32,7 @@ Also note, that performance significantly depends on async depth value. It speci

We use just one GPU here and performance gain comes from better distribution of workloads among available HW units of the same GPU. This mode gives much less performance gain in comparison to 2x2 mode. Depending on workload, it may be in 20-30% range. This mode uses one decoder that feeds two encoders. Each one encodes complete GOP. The rest of pipeline is similar to 2x2 mode.

![CS pipeline](./pic/par_enc_1x2.jpg)
![CS pipeline](./images/par_enc_1x2.jpg)


To run such pipeline, use this command line:
Expand Down
24 changes: 12 additions & 12 deletions doc/smt-tracer-readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,26 @@ Transcoding performance, e.g., cascade scaling performance, strongly depends on

In this trace file surface pools utilization will be showed. Picture below is an example for the pipeline described in smt-cascade-scaling-readme.md.

![surface pools utilization](./pic/cs_surface_pool_utilization.jpg)
![surface pools utilization](./images/cs_surface_pool_utilization.jpg)


Eight pools are shown – decoder, three pools for cascade scaling (VPP pools) and four encoder pools (for 2nd, 3rd, 7th and 8th channels). As can be seen, decoder pool is completely utilized, encoder pool for 2nd, 7th and 8th channels have optimal utilization, high enough, but with some spare frames in reserve and cascade scaling pools for 4th, 5th and 6th channels are underutilized. Reducing number of surfaces in these pools will reduce memory footprint in this particular case.

This trace file will also show general execution flow as illustrated by next three pictures. They are also captured for the pipeline described in previous section.

![overall control flow](./pic/cs_control_flow.jpg)
![overall control flow](./images/cs_control_flow.jpg)


On this picture, from top to bottom. First line is decoding events, then three cascade scaling VPPs for 4th 5th and 6th channels, and then 8 encoders. In each channel task submission is shown, under “dec”, “csvpp”, “vpp” and “enc” names, “busy” wait, sync operation wait “syncp” and task dependencies (arrows). Note, that actual processing time (decoding, scaling or encoding) is not shown as a separate block on the diagram, but it can be deduced as a time interval between task submission and corresponded sync operation. Also note, that on high performance system duration of some events may be less than 1 microsecond. Such event is shown as zero duration event and its dependency may be incorrect.

On this picture it can be clearly seen that third channel in this pipeline is the bottleneck. All channels except 3rd start VPP operation and encoding soon after decoding, 3rd channel only after several frames delay. It is also clearly visible from this traces that 3rd channel has two VPP and two encoder calls for each decoded frame, due to i60 to p60 frame rate conversion.

![cascade scaling control flow](./pic/cs_control_flow_enc.jpg)
![cascade scaling control flow](./images/cs_control_flow_enc.jpg)


This is zoomed in area “1” of the previous picture. It shows relations between decoder, cascade scaling , VPPs and encoders in different channels. Note, that 1st channel depends only on decoder output, 2nd uses cascade scaler VPP but also runs its own VPP before encode, 4th uses cascade scaling output directly and so on.

![synchronization control flow](./pic/cs_control_flow_sync.jpg)
![synchronization control flow](./images/cs_control_flow_sync.jpg)


This is zoomed-in area “2” of the previous picture. It shows sync point wait operations. Note, that all processing has been completed before this wait and wait finishes almost immediately for all channels. This is one more conformation that 3rd channel is the bottleneck in this case.
Expand All @@ -44,24 +44,24 @@ Depending on use case, we may need to tune pipeline for E2E or ENC latency. To f

To measure E2E latency, we start timer just before MFXVideoDECODE_DecodeFrameAsync() call and stop it when MFXVideoCORE_SyncOperation() finishes wait for sync point returned from MFXVideoENCODE_EncodeFrameAsync() function. I.e., we measure latency from the moment we send bitstream to decoder till the moment we get bitstream from encoder. Note, that we don’t take into account auxiliary calls, like request for more bitstream data from decoder, because in this case we will also count time spent reading bitstream from disk. Picture below shows example of E2E latency measured for first encoder.

![E2E latency](./pic/cs_e2e.jpg)
![E2E latency](./images/cs_e2e.jpg)


For ENC latency, we start timer just before MFXVideoENCODE_EncodeFrameAsync() call (very small blue rectangle on the picture below) and stop it just after MFXVideoCORE_SyncOperation() call (long green rectangle on the picture below). I.e., we measure pure encoding time, from the moment we send surface to encoding, till the moment we get encoded bitstream.

![ENC latency](./pic/cs_enc.jpg)
![ENC latency](./images/cs_enc.jpg)


Important note. To correctly interpret measured latencies, we have to take into account frame reordering. If input or output bitstreams have reordered frames, then DecodeFrameAsync() may consume one frame as input, but return surface that corresponds to another frame as output. The same is true for encoder – it may consume one frame as input but return another as output in the same EncodeFrameAsync() call. That means that in this case E2E or ENC latencies don’t show timing for the same frame.

Another thing that we have to take into account during latency measurement is asynchronous nature of transcoding. Even for async depth equals 1, sample calls EncodeFrameAsync() before actual decoding is finished. That means that ENC latency includes not only encoding time but decoding also. To measure only encoding time use “-trace::ENC” command line option. This option adds additional synchronization point between decoder and encoder and ensures that encoder is called after decoding is finished and only encoding latency is measured. This is example of pipeline with “-trace::ENC” option.

![ENC latency](./pic/trace_enc_option.jpg)
![ENC latency](./images/trace_enc_option.jpg)


Similar issue exists for E2E latency. Even for async depth equals 1, decoder starts decoding of next frame before encoding of previous frame is finished and as a result decoding workload overlaps with encoding one and decoding time may be measured incorrectly (decoder may finish earlier and wait for encoder to start). To avoid this, use “-trace::E2E” command line option. In this case decoder starts decoding of next frame only after encoding of previous one has been finished in all channels. This is example of “-trace::E2E” option.

![E2E latency](./pic/trace_e2e_option.jpg)
![E2E latency](./images/trace_e2e_option.jpg)


Note, that both these options “-trace::E2E” and “-trace::ENC” affect pipeline and reduce throughput by introducing additional synchronization points. They should be used only to simulate specific use case, e.g., real-time streaming, when we start processing of next frame only after previous one has been encoded and sent out.
Expand All @@ -74,7 +74,7 @@ Note, that both these options

Look at the very end of the SMT output in console. File names are unique for each run.

![file names](./pic/cs_bufer_usage.jpg)
![file names](./images/cs_bufer_usage.jpg)


### How to choose tracer buffer size
Expand All @@ -93,16 +93,16 @@ Open Google Chrome (TM). Type

Open file, for example, using Microsoft(R) Excel(R).

![csv file](./pic/cs_lat_chart1.jpg)
![csv file](./images/cs_lat_chart1.jpg)

First five lines are short description of file format. Then goes latency data. For each channel in 1 to N pipeline there are four columns. First one is frame number. Note, that channels may have different numbers of frames, e.g., due to frame rate conversion. Second column is time when processing of current frame started. It may be start of decoding or encoding depending on use case - E2E or ENC. This column shows so called “wall clock” time. It usually starts at the system boot time but depending on implementation it may be different moment of time. Third column is the same moment in time, but from different starting point. This one from SMT start. Looking at the first frame in this column we can estimate application initialization time. Last column is latency. All numbers are in milliseconds.

To plot latency chart, delete irrelevant columns and rows, then select data and click insert chart button.

![chart](./pic/cs_lat_chart2.jpg)
![chart](./images/cs_lat_chart2.jpg)


![chart](./pic/cs_lat_chart3.jpg)
![chart](./images/cs_lat_chart3.jpg)


To debug system wide behavior, we can combine several .csv files. To do so, leave wall clock and latency columns for required channels, then move them to the same file and plot the chart.
Expand Down

0 comments on commit 3ab5f53

Please sign in to comment.