0% found this document useful (0 votes)
88 views25 pages

SNUG Home Gateway Architecture Case Study

The document discusses using software workload models and cycle-accurate hardware models to accelerate design cycles. It introduces task graphs as a way to model application workloads and provide early performance estimates. It then presents a case study using a home gateway system-on-chip, where task graphs were created from Linux traces to model the workload and validate performance against the target architecture. Limitations of the current hardware-software co-design methodology are also noted.

Uploaded by

VIKRANT KAPILA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views25 pages

SNUG Home Gateway Architecture Case Study

The document discusses using software workload models and cycle-accurate hardware models to accelerate design cycles. It introduces task graphs as a way to model application workloads and provide early performance estimates. It then presents a case study using a home gateway system-on-chip, where task graphs were created from Linux traces to model the workload and validate performance against the target architecture. Limitations of the current hardware-software co-design methodology are also noted.

Uploaded by

VIKRANT KAPILA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Accelerating Design Cycles with Software Workload

Models and Cycle Accurate Hardware Models


A Home Gateway SoC Case Study

Vikrant Kapila Systems Architect


Ingo Volkening Systems Architect
Anant Raj Gupta Systems Architect

Intel, Singapore

June 26-27, 2019


SNUG India – Bangalore
SNUG 2019 1
SoC Design Evaluation
Methods of Predicting Performance and Power

Task Graphs

SNUG 2019 2
Agenda
• Predicting SoC Performance using Software Workload Models
✓ Introduction to Task Graphs
✓ Introduction to HW-SW co-design Methodology
• Home Gateway SoC Case Study
✓ Task Graph Creation and Validation
✓ Performance Simulations for Next Generation Architecture
✓ Limitations of Current HW-SW co-design methodology
SNUG 2019 • Q&A 3
Predicting SoC Performance Using Software Workload Models

SNUG 2019 4
Defining Application Task Graph
TASK GRAPH - Definition

A task graph is execution graph where


each node is an atomic task and
edges represents dependencies between
their input and output.

TASK GRAPH - Purpose


A task graph for an application is
created with the purpose of providing the
designer with reliable early design estimates for
the expected system performance.

TASK GRAPH - Granularity


Each Atomic Task in Task Graph can represent a
function, a thread or unique stack function call
depending on use-case requirements.

SNUG 2019 5
HW-SW Co-Design Methodology
cycles: 500 cycles: 2000 Pkt Size:64
Rate:10 Gbps WAN Traffic
INPUTS
load: 20% load: 10%
Micro- store: 10% store: 5% Task C
Internal Architecture
WIFI Traffic

S Independent Task A Task B


LAN Traffic
Task D
OS Trace O
F SW- Filters Software Driven Stochastic
CPU Load Data Path Load
Linux, Windows, T
Android, Qnx
W TGG
SW-HW Mapping + Data Flows Core Affinities
ap
A
R Multi-core CPU
E Workload Generation L2
DDR
Cycle Accurate
CNN Accelerator HW Platform
T Workload Model DMA

NOC

NOC
x86 Program R buffer SRAM
A ✓ Task level parallelism and

NOC
WAV
dependencies SRAM SNPS VPU
C WAV
ARM VDKS E Gfast characterized
✓ Characterized with processing cycles
S as Intel CPU
Multi-Processor SoC Platform Model
and memory accesses

✓ Activation rates

NOTE: INPUTS/SOFTWARE traces are generated using from N-1 Design Platform.

SNUG 2019 6
Home Gateway SoC Case Study

SNUG 2019 7
Software Workload Creation for Linux SoC’s

✓ Standard Packet Processing Flow for NIC Interface Card

NAS Transfer Rate of X MiBps is Observed on Reference SoC


INPUT
NAS Application Transfer from Client to NAS Device
Linux 1. Linux Ftrace
Ethernet Connection
PMU Sampling Driver 2. PMU Statistics
Home Gateway Unit

OUTPUT

Ftrace + PMU 1. SAMBA Task Graph


2. Background Task Graph

Task Graph Generation


SNUG 2019 8
Validating Workload Model
Experiment 1: Standalone Workload validation
Standalone workload should correctly capture processing and communication requirements from input traces.

Validation Trace Task Graph Error


Metric
Execution 12.259 sec 11.859 sec 3.26 %
Time
DDR Read 5337.54 MB 5350 MB 0.2 %

DDR Write 2002.10 MB 2014 MB 0.5 %

CPU Utilization of 61.25 % is observed


SNUG 2019 9
Understanding Trace

Tracer is "function".

The difference is the number of entries that were


lost due to the buffer filling up (250280 – 140080)

Task name "bash", the task PID "1977“.

CPU that it was running on "000", the timestamp in


<secs>.<usecs> format

Function name that was traced "sys_close" and the parent


function that called this function "system_call_fastpath".

The timestamp is the time at which the function was entered.

SNUG 2019 10
Example Recipe : ftrace

SNUG 2019 11
EMON TRACE

LONGEST_LAT_CACHE.REFERENCE 4,001,560,225 127,915,172 15,890,489

SNUG 2019 12
Flame Graph – Quick Analysis
• perf record -F 99 -p 13204 -g -- sleep 30

• Each box represents a function in the stack (a
"stack frame").
• The y-axis shows stack depth (number of frames
on the stack). The top box shows the function
that was on-CPU. Everything beneath that is
ancestry. The function beneath a function is its
parent, just like the stack traces shown earlier.
• The x-axis spans the sample population. It
does not show the passing of time from left to
right, as most graphs do. The width of the box
shows the total time it was on-CPU or part of an
ancestry that was on-CPU (based on sample
count).
• Functions with wide boxes may consume more
CPU per execution than those with narrow boxes,
or, they may simply be called more often. The call
count is not shown (or known via sampling).

SNUG 2019 13
Task Graph Generation Tools
Input Trace Type Workload SPEED TGG Comments
Abstraction (Internal) (3rd Party)

Thread
OS Trace SPEED: Linux, Windows
Thread + Function TGG: Linux, Android,
QNX
X86 Binary Function, Instruction Intel PIN Based
Software Analysis Thread, Function Virtualizer Software
(VDK) Development Kits

Custom Performance Workload Instruction Count, Cache


Statistics Characterization statistics

Ptrace Function, Instruction Linux Ptrace

HW-SW partitioning use-case for SAMBA transfer workload model required kernel tracing
support which is available with SPEED.
SNUG 2019 14
Task Graph Validation & Exploration Experiments

Configurable Workload Model

1. BASE_TRANSFER_RATE_MBPS
NAS rate observed on Host Reference Platform. We
observed X MiBps for our platform.

Map
2. INPUT_TRANSFER_RATE_MBPS
NAS rate required for next generation SoC. We will input
2.5X MiBps for our next generation SoC.
Mapping

3. NAS Calibration Factor


Based on underlying extrapolation function. For example,
linear extrapolation.

HUGE MEMORY BANDWIDTH


Core Affinity

For next experiment we assume huge


Setting Specific Core Affinities memory bandwidth.

SNUG 2019 15
System Configuration
Elasticity of Workload CPU Frequency
INPUT TRANSFER RATE
F GHZ
X MiBps
Varying CPU Cores
Experiment 2: Workload Elasticity
Workload should behave in a realistic manner to change of number hardware resources

1
Core NAS RATE 100 %
0.8X MiBps Utilization

2
Cores NAS RATE 120 %
Reference X MiBps Utilization
Configuration

4
Cores NAS RATE
X MiBps

OBSERVATIONS: Increased Number of Cores

✓ Single core is almost 100 % utilized and NAS transfer rate 0.8 X MiBps is limited by compute load.
✓ Dual core shows better performance. Verifies realistic behaviour of task graph to compute resources.
✓ Reference design simulation shows strong correlation in terms of CPU utilization (60 % vs 61.25 %).
✓ Quad core doesn’t improve NAS rate further. Verifies INPUT TRANSFER RATE Configuration.

SNUG 2019 16
Elasticity of Workload System Configuration
CPU Frequency 1.25*F GHZ
Increasing CPU Frequency INPUT TRANSFER RATE X MBps

Experiment 3: Workload Elasticity


Workload should behave in a realistic manner to change of hardware compute power.

60 % Avg Core
utilization

48 % Avg Core
utilization

OBSERVATIONS: Increased CPU Frequency


✓ Single core is almost 98 % utilized on an average and NAS RATE improved to X MiBps against 0.8 MiBps in F Ghz Core Configuration.
✓ Dual core average utilization decreased from 60 % to 48 % for same NAS file transfer rate of X MiBps.
SNUG 2019 17
Performance Simulations for Next Gen Architecture
SAMBA Workload is mapped to Cycle Accurate Next Generation SoC Platform

Simulation Sweep
Parameter Name Values Core Affinities: 100 %

INPUT_NAS_RATE X MiBps, 2.5 X MiBps


CPU_FREQUENCY F Ghz, 1.25*F Ghz
Real Cycle Accurate
NUMBER_OF_CORES 1, 2, 4 HW Platform not shown here.

Reference Image

SNUG 2019 18
Simulation Time
INPUT NAS RATE = X MiBps INPUT NAS RATE = 2.5 X MiBps

OBSERVATIONS

✓ With INPUT_NAS_RATE of X MiBps, increasing frequency ✓ With INPUT_NAS_RATE of 2.5 X MiBps, the performance
does not impact performance of multi-core scenarios as improves with increased compute power, be it frequency or
workload is not compute limited. number of cores.

SNUG 2019 19
NAS TRANSFER RATE X MiBps
TOTAL CPU UTILIZATION Single Core Single Core
100 % utilization 0.8 X MiBps
NAS TRANSFER RATE
Quad Core
Dual Core 1.25 F Ghz
120 %Utilization 98 % Utilization

Quad Core
Dual Core X MiBps
X MiBps

OBSERVATIONS
✓ The core utilization of multi-core scenarios decreases with higher frequencies. This indicates that multi-core
scenarios are not compute limited.
✓ Only single core configuration is compute limited.
SNUG 2019 20
NAS TRANSFER RATE 2.5X MiBps
TOTAL CPU UTILIZATION Quad Core
1.25 F Ghz
NAS TRANSFER RATE Best Performance Quad Core
Dual Core 260 % Utilization
100 % Utilization

Limited Software
Single Core Parallelism
100 % Utilization

Quad Core
2 X MiBps

OBSERVATIONS

✓ With an INPUT_NAS_RATE of 2.5X MiBps, the NAS rate improves with frequency and number of cores for all the scenarios.
✓ NAS rate is not limited by resource parallelism while task level parallelism is insufficient to consume quad cores completely.

SNUG 2019 21
Results Summary
Hardware Architecture Feedback
➢ The maximum NAS speed of 2.5 X MiBps is observed on a quad-core
configuration with a total CPU utilization of 265 % at 1.25 F Ghz Frequency.
➢ We can derive a requirement for 4 cores to reach the targeted performance
speeds of our next architecture design.

Software Architecture Feedback

➢ The Software architecture limits the full utilization of available compute


resources.
➢ After discussions with the software architecture team we found that this speed
will be further limited due to specific core affinities.

SNUG 2019 22
Limitations of Methodology

• Cache Size Extrapolation


✓ No definitive way to account for change in cache hierarchy.
✓ Limited by availability of exact memory access pattern.

• VPU CPI Characterization


✓ Generic Virtual Processing Unit needs to be manually characterized to mimic
real hardware.
✓ For example, software performance model comprising of tasks is characterized
using software instructions, while CPI for underlying hardware needs to come
from Virtual Processing Unit.

SNUG 2019 23
Conclusion

HW-SW co-design methodology enables


early Analysis of Linux SoC Architectures
• Linux Workload Creation & Validation
• Platform Mapping and KPI Analysis

Benefits
• Predict SoC Performance Early
including Software Workload Load.
• Define Better Products and SoC’s
• Reduce Schedule Risk

SNUG 2019 24
Thank You

SNUG 2019 25

You might also like