DWF13 Euf Net T0645
DWF13 Euf Net T0645
November 2013
• The baseband market trends and requirements
• QorIQ Qonverge B4860 block diagram overview
• e6500 and SC3900 cores
• Memory system and interconnect
• CPRI
• MAPLE B3
• Data Path Architecture ( DPAA )
• Power Management
• Q&A
TM 2
Higher Capacities and Data Rates Global LTE Macro Base Station Deployments
Smartphone Density to Reach 1.5 Million by 2015
(In-Stat, Sep-11)
• 32x increase per km2 by 2015
Internet over Mobile Global Capital Expenditures by Wireless
• 70% of mobile data by 2014 Carriers for 4G LTE Infrastructure Gear will
Reach $36.1 Billion by 2015
(Source: Bell Labs, Apr-11) (iSuppli Research, Jan-12)
TM 3
Connectivity
• Coverage: Urban, highways and rural
• Spectral efficiency: Radio and network performance
High
• Multi-standard: Supports variety of users Throughputs &
• Reliability: Zero down time Coverage
TM 4
• Next generation, e6500 Dual-
Thread Power Architecture® cores
offer highest CoreMark/Watt with
AltiVec technology for dramatic L2
scheduling acceleration
• 20GHz of Programmable
Performance
TM 5
3 sector, 20 MHz LTE 3 sector, 20 MHz LTE
with 5 major components on a single SoC
CPRI
Antenna
Layer-1
Back Haul
Antenna B4860 PHY 10 Gbps
GE
DSP PHY 1Gbps
I2C
Layer-2/3 sRIO
Transport UART
Maint.
& Control
CPRI
DSP
Flas
h
DDR
DDR3
POWER 3
B4860 SoC
4X Cost Reduction
3X Power Reduction
TM 6
High Density Baseband Solution
TM 7
High Performance AltiVec AltiVec AltiVec AltiVec
T T T T T T T T
• 64-bit Power Architecture core
PMC
PMC
PMC
PMC
• Dual threads provide 1.7 times the e6500 e6500 e6500 e6500
performance of a single thread 32K 32K 32K 32K 32K 32K 32K 32K
Energy Efficiency
• Drowsy: core, cluster, AltiVec e500 core e6500*
processor (2 thread)
*Based on simulation
TM 8
StarCore SC3900 Flexible Vector Processor
• High DSP performance without compromising flexibility SC3900 SC3900
High Speed FVP Core FVP Core
• Step function in performance over previous generation Baseband
Accelerators
─ 8 instructions per cycle Interface 32K 32K 32K 32K
o Up to 8 data lanes vector in a single instruction (SIMD8)
─ 38.4 GMACS per core @1.2 GHz & 1.2 Tbps
memory bandwidth per core 2MB 16-way Shared L2 Cache, 4 Banks
• State-of-the-art support for control code with Branch
Prediction
• Fully featured Memory Management Unit and Logical to CoreNet Coherent Fabric
Real Address Translation
37,460 BDTI
StarCore SC3900 FVP Clusters Highest
• Six SC3900 Cores Speed
Score
• Clustering two SC3900 under a 2MB, multi-banked L2 20,030
cache
• High bandwidth accelerator ports (up to 1Tbps per
cluster)
• Hardware support for memory coherency between L1,
L2 caches and the main memory Texas Freescale BDTIsimMark2000™
Instruments SC3900 BDTImark2000™
C66x 1.2GHz
1.5GHz
TM 9
Integrated vector processing unit DPAA Control/Transport Hardware Accelerators
for accelerating L2 scheduling
FMAN >20 Gbps aggregate throughput,
Frame Manager Parse, Classify, Distribute
BMAN Manages buffer pools for accelerators
Buffer Manager and network interfaces
Simplified sharing of network
QMAN
interfaces and hardware accelerators
Queue Manager
by multiple cores
Security acceleration for offload of RMAN
transport functionality Seamless mapping sRIO to DPAA
Rapid IO Manager
SEC SNOW-3G, Kasumi, ZUC, IPSec,
Security AES, DES, MD5, SHA-1/2….
Saving CPU Cycles for higher value work
TM 10
Advance I/O –
Glueless Interfaces to Antenna
Backhaul networking, & Backhaul
delivering line-rate at Advanced Real-time
2x 10 GbE XFI/KR
smallest packet sizes Tracing and Monitoring Classify-Parse-Distribute
4x 1 GbE/2.5 GbE Timing Synchronization
iEEE1588v2
High-speed, industry-
8x CPRI v4.2 standard Antenna
Interfaces at 9.8G
2 controllers with 8 lanes
2x sRIO V2.1
5G
PCIe v2.0 Quad lanes at 5G
Modern NOR/NAND flash
IFC controller & Legacy ASIC
connectivity
TM 11
TM
• 64-bit Power Architecture
• e5500 core features plus:
− Two threads per core (SMT)
− Dual load/store units, one per thread
• Shared L2 in cluster of 4 cores (8 threads
per cluster)
− 2048KB 16-way, 4 Banks
− High-performance eLink bus between
coreLd/St and instruction fetch units
• Power
− Drowsy core
− Power Mgt Unit
− Wait-on-reservation instruction
• Enhanced MPPerformance
− Accelerated Atomic Operations
− Optimized Barrier Instructions
− Fast intra-cluster sharing
• AltiVec SIMD Unit
• CoreNet BIU
− 256-bit Din and Dout data busses
• 36-bit Real Address
− 256 GByte physical addr. space
• Each thread: Superscalar , seven-issue, out-of-order execution/in-
order completion, Branch units with a 512-entry, 4-way set • Hardware Table Walk
associative Branch Target/History • LRAT
− Logical to Real Addr. translation mechanism
• Execution units: 1 Load/Store Unit per thread, 2 Simple integer per for improved hypervisor performance
thread,
1 Complex for integer Multiply & Divide, 1 Floating-point Unit,
Altivec
• 64 TLB SuperPages, 1024-entry 4K Pages, 36-bit Physical Address
TM 13
• Cluster consists of 2x SC3900 under large and fast shared memory
FVP Cluster (Kibo) – 2 Cores Cluster + Maple port
− Multibank Cache of 2 MB
SC3900 SC3900
− AXI based accelerators coupling port (45-90 GBps) AXI port
to Maple
AXI port
to Maple
Sub-System Sub-System
TM 14
• SC3900 has a fully cache based memory:
No constraints due to internal memory sizes and rigid allocations
No DMA management and scheduling overheads
Smaller internal memories required as only used code/data is allocated
TM 15
• MAPLE/CPRI read/write coherent accesses to clusters L2 caches
− Provides tight coupling of MAPLE/CPRI to the DSP cores
− Provides high BW (>76GBps per cluster) and low latency
− Significantly reduces DDR bus load and coherency traffic
− Parallel access to multiple SC3900 clusters
• MAPLE/CPRI accesses to DDRs directly via CoreNet fabric
• Target selection is based on MAPLE MMU FVP Cluster (Kibo) – 2 Cores Cluster + Maple port
SC3900 SC3900
AXI port AXI port
to Maple Sub-System Sub-System
to Maple
SC3900 SC3900
CPRI FVP Core FVP Core
MAPLE B3
I-Cache D-Cache I-Cache D-Cache
Baseband Accel.
Shared L2 Cache
SoC Fabric
TM 16
SC3900 Main Features:
• 4 symmetric DMUs in the DALU
Caches / Memory
− 32xMACs per cycle
XDBWB – 256bits
XDBWA – 256bits
XDBRA – 256bits
XDBRB – 256bits
− 16 FLOPs/cycle Floating Point support
XABB – 32bits
XABA – 32bits
PDB- 256bits
− 4xSIMD8/vector support
PAB – 32bits
• High memory bandwidth
− Program bus – 256 bit/cycle
− Data bus – up to 1024 bit/cycle
• Address & integer unit
PCU
− 2 Load/Store units 8 Loop 32 Address
Fetch 64 Data Registers
Regs Registers
− General Integer Processing Unit (IPU)
Program DALU
− Support multiply and shift AGU
BRU
Control
128 entries MAC1
DMU MAC2
DMU MAC3
DMU MAC4
DMU
• Large register files BTB
LSU LSU IPU 8xMAC 8xMAC 8xMAC 8xMAC
− For both AGU and DALU 2xFMAD 2xFMAD 2xFMAD 2xFMAD
TM 17
• SC3900 is optimized to efficiently handle Baseband PHY Layer
processing
• PHY layer processing can be divided into three categories:
− Computation intensive DSP code (mainly MAC intensive)
− Data manipulation and less intensive DSP code
− Control code
• Each one of the categories is non-negligible in processing
requirements
• There is no clear boundary separation
• SC3900 accelerates all types of Baseband L1 processing
TM 18
• SC3900 provides Vector processor capability by increasing the
execution units and the whole data-path accordingly
− Up to 32 MACs per cycles
− 64 dedicated data registers of 40bit each)
− Upto 1024bit (128B) core-to-L1 Data Cache throughput
per cycle
Strong and flexible cache based data streaming abilities
• Performance:
− SC3900 is 3.5x-4x better than SC3850 in intensive DSP code
TM 19
• “Data manipulation” stands for many different functions existing in Baseband Layer 1 - For
examples:
− Data preparation before/after intensive kernels
Ex: data re-ordering, matrix transpose, pack/unpack
− Less regular kernels or serial/cyclic kernels with low parallelism
Ex: QR Decomposition, IIR, Interleaver, encoder.
• Performance:
− SC3900 is 2x-3x better than SC3850 in “Data Manipulation”
TM 20
• SC3900 control code efficiency
− L1 control functions are tightly integrated with the Arithmetic intensive SW
− Useful for running scheduling functions that are control intensive
TM 21
• Robust and flexible Instruction set
− Significant improvement over previous generations
− Instructions are highly flexible and fit wide range of DSP operations
− For example, MAC instruction is defined to support:
Single precision 16bx16b, mixed-precision 16bx32b and double precision 32bx32b
Saturated and non-saturated arithmetic
SIMD and dot-product
Real x Real, Real x Complex, Complex x Complex, Complex x Conjugate
TM 22
• The Instruction set definition is based on deep analysis of baseband
requirements and MAPLE™ offloaded functionality
• Powerful application specific instructions are introduced in SC3900
(patents pending) – few examples:
Maximum/Peak search
• Find maximum value and index between 4 words and previous results
• Up to 4 Maxsearch (total of 20 elements) per cycle
Filter and correlation dedicated instructions
• Support for both complex and real filters
FFT/DFT highly optimized kernel and instructions
Specialized load/store instructions
• Support matrix transposed & manipulation (2x2, 2x4, 4x4, 2x8, 4x8, 8x8)
Bit manipulation
• Dedicated instruction for scrambling, puncturing, interleaver
Reciprocal (1/x), 1/Square Root and Log approximation instructions
TM 23
TM
• Instead of using a conventional L3 cache, the B4860 has a CoreNet
Platform Cache (CPC) — 512KB for each of the two DRAM
controllers.
TM 25
• Data and code sharing between cores on the same cluster
Low L2-cache access latency
Increased cores utilization
Reduce DDR traffic
Lower power
TM 26
• B4860 has two main fabrics
Corenet fabric
AXI fabric
• AXI fabric
Connect between Maple units, CPRI to SC3900 clusters and to CoreNet
fabric
Allows for high throughput, low latency transfers in Layer 1 sub-system
TM 27
• Coherent fabric
Maintains coherence among all the CPU and DSP caches and memories.
Reduces multicore software development effort
Easier software partitioning and upgrade
• High data bandwidth buses: 256-bit at 667 MHz
42.5 GB/s of raw bandwidth per cluster
• Performance features
Parallel accesses
Deep pipeline
Out-of-order completion
Inter-processor communication
• Stashing
• PAMU - Peripheral Access Management Unit
TM 28
SC3900 FVP Core SC3900 FVP Core
SC3900
StarCore SC3900
FVP Core StarCore
TM 1.2GHz FVP Core
TM 1.2GHz e6500DT CPU Cluster 1.8GHz
SC3900
StarCore SC3900
FVP Core StarCore
TM 1.2GHz FVP Core
TM 1.2GHz 32 KB 32 KB 32 KB 32 KB DDR3 1.866GHz
512kB L3
DCache ICache
StarCore DCache
TM 1.2GHz ICache
StarCoreTM 1.2GHz I-Cache I-Cache I-Cache I-Cache 64-bit
32DCache
KB 32 ICache 32DCache
KB 32 ICache Power TM Power TM Power TM Power TM
Cache
KB KB DRAM Controller
32DCache
KB 32 ICache
KB 32DCache
KB 32 ICache
KB dual thread dual thread dual thread dual thread
32 KB 32 KB 32 KB 32 KB 1.8GHz 1.8GHz 1.8GHz 1.8GHz
Shared 2MB 32 KB 32 KB 32 KB 32 KB
Shared 1024 KB DDR3 1.866GHz
Shared
L2 Cache2MB D-Cache D-Cache D-Cache D-Cache 512KB L3
L2 Cache2MB
Shared 64-bit
L2 Cache
Shared 2MB Cache
L2 Cache L2 Cache DRAM Controller
3
2 CoreNet TM
PAMU Coherency Fabric 667MHz
1
• Step 1 - Initiator (Core/ Accelerator/ IO) requests data from
MAPLE B3
Baseband Accel.
CoreNet fabric
• Step 2 – CoreNet broadcasts the request to relevant
initiators/target that might have the data
• Step 3 – An initiator/Target which has the latest data
responds and the data reaches the requestor (and
optionally from memory) without a need for SW intervention
TM 29
SC3900 FVP Core SC3900 FVP Core
SC3900
StarCore SC3900
FVP Core StarCore
TM 1.2GHz FVP Core
TM 1.2GHz e6500DT CPU Cluster 1.8GHz
SC3900
StarCore SC3900
FVP Core StarCore
TM 1.2GHz FVP Core
TM 1.2GHz 32 KB 32 KB 32 KB 32 KB DDR3 1.866GHz
512kB L3
DCache ICache1.2GHz
StarCoreTM DCache ICache1.2GHz
StarCoreTM I-Cache I-Cache I-Cache I-Cache 64-bit
32DCache
KB 32 ICache 32DCache
KB 32 ICache Power TM Power TM Power TM Power TM
Cache
KB KB DRAM Controller
32DCache
KB 32 ICache
KB 32DCache
KB 32 ICache
KB dual thread dual thread dual thread dual thread
32 KB 32 KB 32 KB 32 KB 1.8GHz 1.8GHz 1.8GHz 1.8GHz
Shared 2MB 32 KB 32 KB 32 KB 32 KB
Shared Shared 1024 KB DDR3 1.866GHz
L2 Cache2MB D-Cache D-Cache D-Cache D-Cache 512KB L3
L2 Cache2MB
Shared 64-bit
L2 Cache
Shared 2MB Cache
L2 Cache L2 Cache DRAM Controller
2 3 CoreNet TM
PAMU Coherency Fabric 667MHz
1
• Step 1 - Initiator (Core/ Accelerator/ IO) informs the
MAPLE B3
Baseband Accel.
CoreNet fabric on its intention to update the memory
• Step 2 – CoreNet broadcasts the request to find the
relevant caches that might hold old copies of the data
• Step 3 – Initiators which hold an old version of the data
invalidate it and the write data is written by the requestor
TM 30
SC3900 FVP Core SC3900 FVP Core
SC3900
StarCore SC3900
FVP Core StarCore
TM 1.2GHz FVP Core
TM 1.2GHz e6500DT CPU Cluster 1.8GHz
SC3900
StarCore SC3900
FVP Core StarCore
TM 1.2GHz FVP Core
TM 1.2GHz 32 KB 32 KB 32 KB 32 KB DDR3 1.866GHz
512kB L3
DCache ICache
StarCore DCache
TM 1.2GHz
ICache
StarCoreTM 1.2GHz I-Cache I-Cache I-Cache I-Cache 64-bit
32DCache
KB 32 ICache 32DCache
KB 32 ICache Power TM Power TM Power TM Power TM
Cache
KB KB DRAM Controller
32DCache
KB 32 ICache
KB 32DCache
KB 32 ICache
KB dual thread dual thread dual thread dual thread
32 KB 32 KB 32 KB 32 KB 1.8GHz 1.8GHz 1.8GHz 1.8GHz
Shared 2MB 32 KB 32 KB 32 KB 32 KB
Shared Shared 1024 KB DDR3 1.866GHz
L2 Cache2MB D-Cache D-Cache D-Cache D-Cache 512KB L3
L2 Cache2MB
Shared 64-bit
L2 Cache
Shared 2MB Cache
L2 Cache L2 Cache DRAM Controller
2 CoreNet TM
PAMU
3 Coherency Fabric 667MHz
1
• Step 1 - Initiator (Core/ Accelerator/ IO) informs the
MAPLE B3
Baseband Accel.
CoreNet fabric on its intent to write data to a specific cache
(or cache and memory)
• Step 2 – CoreNet broadcasts the request to find the
relevant caches that might hold old copies of the data
• Step 3 – The data is written only to the designated target(s)
TM 31
• B4860 instantiates two OceAN DMAs
Transfer data between PCI/SRIO to/from the device memories
Transfer data between two locations in the memory
TM 32
TM
• The CPRI complex enables communication among radio devices over a CPRI
bus.
• The CPRI complex is designed to support the CPRI V4.2 specification and
can be configured to support several air interface standards, including
WiMAX, LTE, and WCDMA.
• The complex supports up to 8 CPRI links (4 pairs) with each link configurable
as a master or slave port.
• Up to 9.8Gbps per lane
• Each CPRI link supports three types of service access points (SAPs):
• IQ samples for antenna transferred through the SAP IQ Interface
• CPRI frames synchronized by the SAP synchronization interface
• CPRI link control and management (C&M) data transferred between
SAPs in both CPRI master and slave ports.
TM 34
TM 35
TM
PSIF
Central programmable control for:
SoC/FVP clusters
• Tasks Scheduling
• Efficient DMAing from/to SoC and internally
• Flexible processing flow allow multiple
standards support
Programmable System • BD parsing and job configurations
Interface (PSIF) • Interrupts handling
• Internal Embedded Data Flows
PE-s
• Highly efficient HW implementation of baseband
computational extensive algorithms
PE1 PE2 PEN
• Lego like concept allowing:
• Fast solution derivation (Macro to Femto)
Processing Elements • Use of algorithms commonality between
technologies (LTE, WCDMA, CDMA,
WiMAX)
TM 37
• LTE/LTE-Advanced, HSPA/WCDMA, WiMAX and Multi-Mode
acceleration solution. MAPLE-B3
TM 38
DTX test
PUSCH Processing Decoded RI/ACK bits
CQI/PMI controls
Transport Rate
To Turbo HARQ
CRC24a Block CRC24b DeMat
Layer-2 Decode combine
Assembly ch
PDSCH Processing
From Code
Turbo HARQ
Layer-2 CRC24a Block CRC24b
Encoder Rate Match
Segment
Physical
Guard Insertion Downlink
To Resource MIMO Layer QAM
IFFT Reference Symbol Scrambling
Antenna Block Precode Mapper Mapper
Cyclic Prefix Generation
Mapper
MAPLE-PE
3GPP TS36.211/212/213
MAPLE
Embedded Flow
SC3900 Core
TM 39
TM
• Acceleration of frame/packet processing
− Network protocols (Layer-1 to 4)
− Standard algorithmics (security and content Ethernet
processing) Interfaces
• Classification and Distribution of data flows
among cores and software partitions
Frame
− Load balancing through Core
parse/classify/distribute Manager
L2$ Core
Core
− Load spreading through queues shared among L2$ D$ Core
I$
multiple consumers Core
SEC L2$ D$ Core
I$
Core
CoreNet™
• Abstract and manage efficiently Intercore
Queue L2$ D$ Core
I$
communications and the access to shared Core
resources (NW interfaces, HW-Accelerators, Manager L2$ D$ Core
I$
Core
Buffers, Queues) L2$ D$ Core
I$
Core
− More sophisticated approach compared to L2$ D$ CoreI$
basic BD/buffer list Core
L2$ D$ I$
• Scalability and Portability RMan
D$ I$
− Across ‘any’ mix of cores, accelerators and
device boundaries Buffer
− Across device generations Manager SRIO
Interfaces
Memory
TM 41
HW
Accel
Core
D$ I$
Core
Network Core
…
D$ I$
…
Interface D$
CoreI$
Network
D$
CoreI$ Interface
D$ I$
Core
D$ I$
Buffer
Manager
TM 42
Frame Manager is responsible for
preprocessing and moving Ethernet • General
• Supports offline PCD on frames extracted from
packets into and out of the datapath QMan
• Parsing • Supports “Independent” mode
• Packet Parsing at wire speed (no work with BMan & QMan, BD ring model)
• Supports standard protocols parsing and • Per port egress rate limiting
identification by HW • Statistics & Multicast support
(VLAN/IP/UDP/TCP/SCTP/PPPoE/PPP/MPL
S/GRE/IPSec …) • Support for IEEE1588 thru HW-Timestamping
• Classification / Distribution
• Coarse classification based on Key generation
Hash and exact match lookup Parse, Classify,
• Supports aggregated speed of 20Gbps,
Distribute
30Mpps@667MHz
• Lookups configured by user, can be chained Buffers
• Classification result is frame queue ID,
storage profile and policing profile.
1G/2.5G/10G 1G/2.5G 1G/2.5G
• Ingress Policing
• Two rate – three colour marking algorithm ( rfc 1G/2.5G/10G 1G/2.5G 1G/2.5G
2968 & 4115)
• Up to 256 internal profiles
TM 43
TM
• Dual-AMC form factor
• Standalone operation or pluggable into open-top standard MicroTCA chassis
• 2x DDR3
− 4GB Dual rank 64b/72b, 1.867GHz w/ ECC
− 2GB 64b/72b, 1.866GHz w/ ECC
• Ethernet –
− Up to 6x 1G/2.5G SGMII
− Up to 2x 10G XFI/XAUI
TM 45
TM