SNUG Home Gateway Architecture Case Study
SNUG Home Gateway Architecture Case Study
Intel, Singapore
Task Graphs
SNUG 2019 2
Agenda
• Predicting SoC Performance using Software Workload Models
✓ Introduction to Task Graphs
✓ Introduction to HW-SW co-design Methodology
• Home Gateway SoC Case Study
✓ Task Graph Creation and Validation
✓ Performance Simulations for Next Generation Architecture
✓ Limitations of Current HW-SW co-design methodology
SNUG 2019 • Q&A 3
Predicting SoC Performance Using Software Workload Models
SNUG 2019 4
Defining Application Task Graph
TASK GRAPH - Definition
SNUG 2019 5
HW-SW Co-Design Methodology
cycles: 500 cycles: 2000 Pkt Size:64
Rate:10 Gbps WAN Traffic
INPUTS
load: 20% load: 10%
Micro- store: 10% store: 5% Task C
Internal Architecture
WIFI Traffic
NOC
NOC
x86 Program R buffer SRAM
A ✓ Task level parallelism and
NOC
WAV
dependencies SRAM SNPS VPU
C WAV
ARM VDKS E Gfast characterized
✓ Characterized with processing cycles
S as Intel CPU
Multi-Processor SoC Platform Model
and memory accesses
✓ Activation rates
NOTE: INPUTS/SOFTWARE traces are generated using from N-1 Design Platform.
SNUG 2019 6
Home Gateway SoC Case Study
SNUG 2019 7
Software Workload Creation for Linux SoC’s
OUTPUT
Tracer is "function".
SNUG 2019 10
Example Recipe : ftrace
SNUG 2019 11
EMON TRACE
SNUG 2019 12
Flame Graph – Quick Analysis
• perf record -F 99 -p 13204 -g -- sleep 30
•
• Each box represents a function in the stack (a
"stack frame").
• The y-axis shows stack depth (number of frames
on the stack). The top box shows the function
that was on-CPU. Everything beneath that is
ancestry. The function beneath a function is its
parent, just like the stack traces shown earlier.
• The x-axis spans the sample population. It
does not show the passing of time from left to
right, as most graphs do. The width of the box
shows the total time it was on-CPU or part of an
ancestry that was on-CPU (based on sample
count).
• Functions with wide boxes may consume more
CPU per execution than those with narrow boxes,
or, they may simply be called more often. The call
count is not shown (or known via sampling).
SNUG 2019 13
Task Graph Generation Tools
Input Trace Type Workload SPEED TGG Comments
Abstraction (Internal) (3rd Party)
Thread
OS Trace SPEED: Linux, Windows
Thread + Function TGG: Linux, Android,
QNX
X86 Binary Function, Instruction Intel PIN Based
Software Analysis Thread, Function Virtualizer Software
(VDK) Development Kits
HW-SW partitioning use-case for SAMBA transfer workload model required kernel tracing
support which is available with SPEED.
SNUG 2019 14
Task Graph Validation & Exploration Experiments
1. BASE_TRANSFER_RATE_MBPS
NAS rate observed on Host Reference Platform. We
observed X MiBps for our platform.
Map
2. INPUT_TRANSFER_RATE_MBPS
NAS rate required for next generation SoC. We will input
2.5X MiBps for our next generation SoC.
Mapping
SNUG 2019 15
System Configuration
Elasticity of Workload CPU Frequency
INPUT TRANSFER RATE
F GHZ
X MiBps
Varying CPU Cores
Experiment 2: Workload Elasticity
Workload should behave in a realistic manner to change of number hardware resources
1
Core NAS RATE 100 %
0.8X MiBps Utilization
2
Cores NAS RATE 120 %
Reference X MiBps Utilization
Configuration
4
Cores NAS RATE
X MiBps
✓ Single core is almost 100 % utilized and NAS transfer rate 0.8 X MiBps is limited by compute load.
✓ Dual core shows better performance. Verifies realistic behaviour of task graph to compute resources.
✓ Reference design simulation shows strong correlation in terms of CPU utilization (60 % vs 61.25 %).
✓ Quad core doesn’t improve NAS rate further. Verifies INPUT TRANSFER RATE Configuration.
SNUG 2019 16
Elasticity of Workload System Configuration
CPU Frequency 1.25*F GHZ
Increasing CPU Frequency INPUT TRANSFER RATE X MBps
60 % Avg Core
utilization
48 % Avg Core
utilization
Simulation Sweep
Parameter Name Values Core Affinities: 100 %
Reference Image
SNUG 2019 18
Simulation Time
INPUT NAS RATE = X MiBps INPUT NAS RATE = 2.5 X MiBps
OBSERVATIONS
✓ With INPUT_NAS_RATE of X MiBps, increasing frequency ✓ With INPUT_NAS_RATE of 2.5 X MiBps, the performance
does not impact performance of multi-core scenarios as improves with increased compute power, be it frequency or
workload is not compute limited. number of cores.
SNUG 2019 19
NAS TRANSFER RATE X MiBps
TOTAL CPU UTILIZATION Single Core Single Core
100 % utilization 0.8 X MiBps
NAS TRANSFER RATE
Quad Core
Dual Core 1.25 F Ghz
120 %Utilization 98 % Utilization
Quad Core
Dual Core X MiBps
X MiBps
OBSERVATIONS
✓ The core utilization of multi-core scenarios decreases with higher frequencies. This indicates that multi-core
scenarios are not compute limited.
✓ Only single core configuration is compute limited.
SNUG 2019 20
NAS TRANSFER RATE 2.5X MiBps
TOTAL CPU UTILIZATION Quad Core
1.25 F Ghz
NAS TRANSFER RATE Best Performance Quad Core
Dual Core 260 % Utilization
100 % Utilization
Limited Software
Single Core Parallelism
100 % Utilization
Quad Core
2 X MiBps
OBSERVATIONS
✓ With an INPUT_NAS_RATE of 2.5X MiBps, the NAS rate improves with frequency and number of cores for all the scenarios.
✓ NAS rate is not limited by resource parallelism while task level parallelism is insufficient to consume quad cores completely.
SNUG 2019 21
Results Summary
Hardware Architecture Feedback
➢ The maximum NAS speed of 2.5 X MiBps is observed on a quad-core
configuration with a total CPU utilization of 265 % at 1.25 F Ghz Frequency.
➢ We can derive a requirement for 4 cores to reach the targeted performance
speeds of our next architecture design.
SNUG 2019 22
Limitations of Methodology
SNUG 2019 23
Conclusion
Benefits
• Predict SoC Performance Early
including Software Workload Load.
• Define Better Products and SoC’s
• Reduce Schedule Risk
SNUG 2019 24
Thank You
SNUG 2019 25