0% found this document useful (0 votes)
15 views133 pages

Lecture 2,3,4,5

A Single Board Computer (SBC) is a complete computer built on a single circuit board, integrating essential components like the microprocessor, memory, and I/O features, often used for educational purposes or as embedded controllers. Popular examples include the Raspberry Pi and BeagleBone, which serve various applications such as IoT and robotics, with different models offering varying capabilities. The Raspberry Pi features a Broadcom SoC, GPIO pins for interfacing, and supports various peripherals, making it a versatile platform for developers and hobbyists.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views133 pages

Lecture 2,3,4,5

A Single Board Computer (SBC) is a complete computer built on a single circuit board, integrating essential components like the microprocessor, memory, and I/O features, often used for educational purposes or as embedded controllers. Popular examples include the Raspberry Pi and BeagleBone, which serve various applications such as IoT and robotics, with different models offering varying capabilities. The Raspberry Pi features a Broadcom SoC, GPIO pins for interfacing, and supports various peripherals, making it a versatile platform for developers and hobbyists.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 133

Single Board Computer

What is a Single Board Computer?


• A single board computer or SBC is a whole
computer constructed on a single circuit board,
with memory, microprocessor, I/O and also other
features needed for a functional computer.
• These were finished as development systems for
educational systems or as embedded computer
controllers.
• Different kinds of portable or home computers are
integrated onto a single PCB (printed circuit
board).
 Not like a desktop PC, single board computers frequently do
not trust on increase slots for peripheral purposes.
 Some single board computers are finished to plug into a
backplane for system expansion.
 Single board computers have been constructed using an
extensive range of microprocessors.
 Eg. :Raspberri pi ,Beagle board ,Beagle Bone etc.
Single-board computer (SBC)
• A single-board computer (SBC) is a complete computer built on a
single circuit board, with microprocessor(s), memory, input/output
(I/O) and other features required of a functional computer. Single-
board computers are commonly made as demonstration or
development systems, for educational systems, or for use as
embedded computer controllers. Many types of home computers or
portable computers integrate all their functions onto a single printed
circuit board.
Unlike a desktop personal computer, single board computers often do
not rely on expansion slots for peripheral functions or expansion. Single
board computers have been built using a wide range of microprocessors.
Simple designs often use static RAM and low-cost 8- or 16-bit
processors.
Comparison of Raspberry pi &
BeagleBone
• Raspberry Pi is a small single board computer. By connecting
peripherals like Keyboard, mouse, display to the Raspberry Pi, it will act
as a mini personal computer.
• Raspberry Pi is popularly used for real time Image/Video Processing, IoT
based applications and Robotics applications.
• Raspberry Pi is slower than laptop or desktop but is still a computer
which can provide all the expected features or abilities, at a low power
consumption.
• Raspberry Pi Foundation is name of series of single board computers
made by the Raspberry Pi Foundation for educate people in computing.
Features
Quad core 64 bit ARM Cortex A53 processor.
4 USB port
HDMI video outputs .
Video Core IV multimedia.
BeagleBone

BeagleBone Black is a low-cost, community-supported development


platform for developers and hobbyists. Boot Linux in under 10 seconds
and get started on development in less than 5 minutes with just a single
USB cable.
Feature
3D graphics
NEON Floating point
5V DC External vai Expansion Header
Optical onboard serial header
Has Wi-Fi , Bluetooth
Basic Version Board Coprocessor
• Raspberry Pi has basic version board processor.
• Raspberry Pi
• Raspberry Pi is a small single board computer. By connecting
peripherals like Keyboard, mouse, display to the Raspberry Pi, it will
act as a mini personal computer.
• Raspberry Pi is popularly used for real time Image/Video Processing,
IoT based applications and Robotics applications.
Raspberry Pi is slower than laptop or desktop but is still a computer which
can provide all the expected features or abilities, at a low power
consumption.
Raspberry Pi Foundation is name of series of single board computers made
by the Raspberry Pi Foundation for educate people in computing.
Architecture of Raspberry pi
Main blocks of Raspberry pi
Raspbrry pi consist of following blocks
Processor: Broadcom BCM2835
○ GPIO ports
○ RAM: 512 MB
○ USB: 2 USB 2.0
○ Network: Ethernet
○ Video out: HDMI
○ Audio out: 3.5 mm jack
○ SD Card Storage (Up to 32GB)
○ Micro USB power
○ Display Serial Interface Port (DSI)
○ Camera Serial Interface Port (CSI)
Pin description of Raspberry pi

A powerful feature of the Raspberry Pi is the row of


GPIO (general-purpose input/output) pins along the top
edge of the board. A 40-pin GPIO header is found on all
current Raspberry Pi boards .
Any of the GPIO pins can be designated (in software) as an input or output
pin and used for a wide range of purposes.

Voltages
Two 5V pins and two 3V3 pins are present on the board, as well as a
number of ground pins (0V), which are unconfigurable. The remaining pins
are all general purpose 3V3 pins, meaning outputs are set to 3V3 and
inputs are 3V3-tolerant.

Outputs
A GPIO pin designated as an output pin can be set to high (3V3) or low
(0V).
Inputs
A GPIO pin designated as an input pin can be read as high (3V3) or
low (0V). This is made easier with the use of internal pull-up or pull-
down resistors. Pins GPIO2 and GPIO3 have fixed pull-up resistors, but
for other pins this can be configured in software.

More
As well as simple input and output devices, the GPIO pins can be used
with a variety of alternative functions, some are available on all pins,
others on specific pins.
PWM (pulse-width modulation)
Software PWM available on all pins
Hardware PWM available on GPIO12, GPIO13, GPIO18, GPIO19
SPI
SPI0: MOSI (GPIO10); MISO (GPIO9); SCLK (GPIO11); CE0 (GPIO8), CE1
(GPIO7)
SPI1: MOSI (GPIO20); MISO (GPIO19); SCLK (GPIO21); CE0 (GPIO18); CE1
(GPIO17); CE2 (GPIO16)
I2C
Data: (GPIO2); Clock (GPIO3)
EEPROM Data: (GPIO0); EEPROM Clock (GPIO1)
Serial
TX (GPIO14); RX (GPIO15)
PIN GROUP PIN NAME DESCRIPTION

POWER SOURCE +5V, +3.3V, GND and Vin +5V -power output
+3.3V -power output
GND – GROUND pin

COMMUNICATION INTERFACE UART Interface(RXD, TXD) [(GPIO15,GPIO14)] UART (Universal Asynchronous Receiver Transmitter) used for interfacing sensors and
other devices.

SPI Interface(MOSI, MISO, CLK,CE) x 2 SPI (Serial Peripheral Interface) used for communicating with other boards or
[SPI0-(GPIO10 ,GPIO9, GPIO11 ,GPIO8)] peripherals.
[SPI1--(GPIO20 ,GPIO19, GPIO21 ,GPIO7)]

TWI Interface(SDA, SCL) x 2 [(GPIO2, GPIO3)] TWI (Two Wire Interface) Interface can be used to connect peripherals.
[(ID_SD,ID_SC)]

INPUT OUTPUT PINS 26 I/O Although these some pins have multiple functionsthey can be considered as I/O pins.

PWM Hardware PWM available on GPIO12, GPIO13, GPIO18, GPIO19 These 4 channels can provide PWM (Pulse Width Modulation) outputs.
*Software PWM available on all pins

EXTERNAL INTERRUPTS All I/O In the board all I/O pins can be used as Interrupts.
Comparison of Raspberry Pi models
• There are different versions of raspberry pi available as listed below:
1.Raspberry Pi 1 Model A
2.Raspberry Pi 1 Model A+
3.Raspberry Pi 1 Model B
4.Raspberry Pi 1 Model B+
5.Raspberry Pi 2 Model B
6.Raspberry Pi 3 Model B
7.Raspberry Pi Zero
Raspberry Pi Raspberry Pi 2 Raspberry Pi 3
Features Raspberry Pi zero
Model B+ Model B Model B

SoC BCM2835 BCM2836 BCM2837 BCM2835

CPU ARM11 Quad Cortex A7 Quad Cortex A53 ARM11

Operating
700 MHz 900 MHz 1.2 GHz 1 GHz
Freq.

RAM 512 MB SDRAM 1 GB SDRAM 1 GB SDRAM 512 MB SDRAM

250 MHz Videocore 250MHz 400 MHz


GPU 250MHz Videocore IV
IV Videocore IV Videocore IV

Storage micro-SD Micro-SD micro-SD micro-SD


Ethernet Yes Yes Yes No

Wireless WiFi and Bluetooth No No No


Architectural features of Raspberry Pi
• All models feature a Broadcom system on a chip (SoC), which includes an
ARM compatible central processing unit (CPU) and an on-chip graphics
processing unit (GPU, a VideoCore IV).

• CPU speed ranges from 700 MHz to 1.2 GHz for the Pi 3 and on board
memory range from 256 MB to 1 GB RAM.

• Secure Digital (SD) cards are used to store the operating system and
program memory in either the SDHC or MicroSDHC sizes.
• Most boards have between one and four USB slots, HDMI and
composite video output, and a 3.5 mm phono jack for audio.

• Lower level output is provided by a number of GPIO pins which


support common protocols like I²C.

• The B-models have an 8P8C Ethernet port and the Pi 3 and Pi Zero
W have on board Wi-Fi 802.11n and Bluetooth.
Peripherals used in BCM2835
• It contains following ARM peripherals:
• Timers
• Interrupt controller
• GPIO
• USB
• PCM / I2S
• DMA controller
• I2C master
• I2C / SPI slave
• SPI0, SPI1, SPI2
• PWM
• UART0, UART1
Raspberry pi processor
• The Broadcom BCM2835 SoC used in the first generation Raspberry
Pi includes a 700 MHz ARM1176JZF-S
• processor, VideoCore IV graphics processing unit (GPU), and RAM. It
has a level 1 (L1) cache of 16 KiB and a level 2 (L2) cache of 128 KiB.
The level 2 cache is used primarily by the GPU. The SoC
• is stacked underneath the RAM chip, so only its edge is visible. The
ARM1176JZ(F)-S is the same CPU used in the original
• iPhone although at a higher clock rate, and mated with a much faster
GPU.
The earlier V1.1 model of the Raspberry Pi 2 used a Broadcom BCM2836 SoC
with a 900 MHz 32-bit, quad-core ARM Cortex-A7 processor, with 256 KiB
shared L2 cache.The Raspberry Pi 2 V1.2 was upgraded to a Broadcom
BCM2837 SoC with a 1.2 GHz 64-bit quad-core ARM Cortex-A53 processor the
same SoC which is used on the Raspberry Pi 3, but underclocked (by default)
to the same 900 MHz CPU clock speed as the V1.1. The BCM2836 SoC is no
longer in production as of late 2016.

The Raspberry Pi 3 Model B uses a Broadcom BCM2837 SoC with a 1.2 GHz
64-bit quad-core ARM Cortex-A53 processor, with 512 KiB shared L2 cache.
The Model A+ and B+ are 1.4 GHz
The Raspberry Pi 4 uses a Broadcom BCM2711 SoC with a 1.5 GHz 64-
bit quad-core ARM Cortex-A72 processor, with 1 MiB shared L2
cache.Unlike previous models, which all used a custom
interrupt controller poorly suited for virtualisation, the interrupt
controller on this SoC is compatible with the ARM Generic Interrupt
Controller (GIC) architecture 2.0, providing hardware support for
interrupt distribution when using ARM virtualisation capabilities.
The Raspberry Pi Zero and Zero W use the same Broadcom BCM2835
SoC as the first generation Raspberry Pi, although now runningat 1 GHz
CPU clock speed.
•ARM CPU/GPU -- This is a Broadcom BCM2835 System on a Chip (SoC)
that's made up of an ARM central processing unit (CPU) and a Videocore
4 graphics processing unit (GPU). The CPU handles all the computations
that make a computer work (taking input, doing calculations and
producing output), and the GPU handles graphics output.
•GPIO -- These are exposed general-purpose input/output connection
points that will allow the real hardware hobbyists the opportunity to
tinker.
•RCA -- An RCA jack allows connection of analog TVs and other similar
output devices.
•Audio out -- This is a standard 3.55-millimeter jack for connection of
audio output devices such as headphones or speakers. There is no audio
in.
•LEDs -- Light-emitting diodes, for all of your indicator light needs.
•USB -- This is a common connection port for peripheral devices of all
types (including your mouse and keyboard). Model A has one, and Model
B has two. You can use a USB hub to expand the number of ports or plug
your mouse into your keyboard if it has its own USB port.
•HDMI -- This connector allows you to hook up a high-definition
television or other compatible device using an HDMI cable.
•Power -- This is a 5v Micro USB power connector into which you can plug
your compatible power supply.
•SD cardslot -- This is a full-sized SD card slot. An SD card with an operating
system (OS) installed is required for booting the device. They are available for
purchase from the manufacturers, but you can also download an OS and save
it to the card yourself if you have a Linux machine and the wherewithal.
•Ethernet -- This connector allows for wired network access and is only
available on the Model B.
Many of the features that are missing, such as WiFi and audio in, can be
added using the USB port(s) or a USB hub as needed. Next: More details on
the device itself and its compatible operating systems.
The Broadcom BCM2835 is a System on Chip (SoC) with multimedia
capabilities, and usually utilised in mobile phones and portable
devices. It is a highly competitive industry and nearly all the big names
that manufacture SoC chips keep their designs a secret.

This IC provides a full HD video and stereo audio. You would expect to
find it in digital camcorders, digital cameras, mobile phones, games
boxes, and the applications are limitless. It has a VideoCore IV®
Multimedia Co-Processor, ARM1176JZ-F application processor, and
high performance OpenGL-ES® 1.1/2.0 GPU.
It is interesting to note that this chip also has an advanced Image Sensor
Pipeline (ISP) for connecting a 20 MP Camera. It can transfer data at the
rate of 220 MP/s. There is also an LCD MIPI type interface for driving a
high-resolution LCD panel at 1080p resolutions.
CPU Architecture
• The ARM1176JZ-S processor incorporates an integer core that
implements the ARM11 ARM architecture v6. It supports the ARM
and Thumb instruction sets, Jazelle technology to enable direct
execution of Java bytecodes, and a range of SIMD DSP instructions
that operate on 16-bit or 8-bit data values in 32-bit registers.
ARM11J6JZF-S (ARM11 Family) consist of following features
ARMv6 Architecture
Single Core
32-Bit RISC
700 MHz Clock Rate
8 Pipeline Stages
Branch Prediction
Low interrupt latency configuration
Internal coprocessor CP14 and CP15
External coprocessor interface
Instruction and data MMU
64 bit interface to both caches TCM which can be used as
local RAM with DMA
JTAG based debug
vector floating point
provision for Intelligent Energy Management (IEM™)
high-speed Advanced Microprocessor Bus Architecture
(AMBA) Advanced Extensible Interface (AXI) level two
interfaces supporting prioritized multiprocessor
implementations.
• an integer core with integral Embedded ICE-RT logic
• Instruction and data caches, including a non-blocking data cache with
Hit-Under-Miss (HUM)
• virtually indexed and physically addressed caches
• 64-bit interface to both caches
• level one Tightly-Coupled Memory (TCM) that you can use as a local
RAM with DMA
• external coprocessor support
• trace support

Note The only functional difference between the ARM1176JZ-S and


ARM1176JZF-S processor is that the ARM1176JZF-S processor includes a
Vector Floating-Point (VFP) coprocessor.
Components of ARM1176JZ-S Processor
● Core
● Load Store Unit
● Prefetch Unit
● Memory System
● Level One Mem.System
● Interrupt Handling
● System Control
● AMBA Interface
● Coprocessor Interface
● Debug
● Instruction cycle summary and interlocks
● Vector Floating-point
ARM1176JZ-S architecture with Jazelle technology
The ARM1176JZ-S processor has three instruction sets:
• the 32-bit ARM instruction set used in ARM state, with media instructions
• the 16-bit Thumb instruction set used in Thumb state
• the 8-bit Java bytecodes used in Jazelle state.

1.Instruction compression: A typical 32-bit architecture can manipulate


32-bit integers with single instructions, and address a large address space
much more efficiently than a 16-bit architecture. When processing 32-bit
data, a 16-bit architecture takes at least two instructions to perform the
same task as a single 32-bit instruction.
When a 16-bit architecture has only 16-bit instructions, and a 32-bit
architecture has only 32-bit instructions, overall the 16-bit architecture has
higher code density, and greater than half the performance of the 32-bit
architecture.
Thumb implements a 16-bit instruction set on a 32-bit architecture, giving
higher performance than on a 16-bit architecture, with higher code density
than a 32-bit architecture.
The ARM1176JZ-S processor can easily switch between running in ARM state
and running in Thumb state. This enables you to optimize both code density
and performance to best suit your application requirements.
2.The Thumb instruction set
The Thumb instruction set is a subset of the most commonly used 32-bit ARM
instructions. Thumb instructions are 16 bits long, and have a corresponding
32-bit ARM instruction that has the same effect on the processor model.
Thumb instructions operate with the standard ARM register configuration,
enabling excellent interoperability between ARM and Thumb states.
Thumb has all the advantages of a 32-bit core:
• 32-bit address space
• 32-bit registers
• 32-bit shifter and Arithmetic Logic Unit (ALU)
• 32-bit memory transfer.
Thumb therefore offers a long branch range, powerful arithmetic operations,
and a large address space. The availability of both 16-bit Thumb and 32-bit
ARM instruction sets, gives you the flexibility to emphasize performance or
code size on a subroutine level, according to the requirements of their
applications. For example, you can code critical loops for applications such as
fast interrupts and DSP algorithms using the full ARM instruction set, and
linked with Thumb code.
3.Java bytecodes
ARM architecture v6 with Jazelle technology executes variable length Java
bytecodes. Java bytecodes fall into two classes:
Hardware execution
Bytecodes that perform stack-based operations.

Software execution
Bytecodes that are too complex to execute directly in hardware are executed
in software. An ARM register is used to access a table of exception handlers to
handle these particular bytecodes.
1.Integer core
The ARM1176JZ-S processor is built around the ARM11 integer core. It is an
implementation of the ARMv6 architecture and runs the ARM, Thumb, and
Java instruction sets. The processor contains Embedded ICE-RT logic and a JTAG
debug interface to enable hardware debuggers to communicate with the
processor. The following sections describe the core in more detail:
•Instruction set categories
•Conditional execution
•Registers
•Modes and exceptions
•Thumb instruction set
•DSP instructions
•Media extensions
•Datapath
•Branch prediction
•Return stack
All section of integer core are discussed bellow.
i.Instruction set categories
The main instruction set categories are:
•branch instructions
•data processing instructions
•status register transfer instructions
•load and store instructions
•coprocessor instructions.
exception-generating instructions.
Note
Only load, store, and swap instructions can access data from memory.
ii.Conditional execution
The processor conditionally executes nearly all ARM instructions. You can
decide if the condition code flags, Negative, Zero, Carry, and Overflow, are
updated according to their result.
iii.Registers
The ARM1176JZ-S core contains:
•33 general-purpose 32-bit registers
•7 dedicated 32-bit registers.

Note
At any one time, 16 general-purpose registers are visible. The remainder
are banked registers used to speed up exception processing.
iv.Modes and exceptions
The core provides a set of operating and exception modes, to support
systems combining complex operating systems, user applications, and
real-time demands. There are eight operating modes, six of them are
exception processing modes:
•User
•Supervisor
•fast interrupt
•normal interrupt
•Abort
•System
•Undefined
•Secure Monitor.
v.Thumb instruction set
The Thumb instruction set contains a subset of the most commonly-used 32-
bit ARM instructions encoded into 16-bit wide opcodes. This reduces the
amount of memory required for instruction storage.
vi.DSP instructions
The DSP extensions to the ARM instruction set provide:
•16-bit data operations
•saturating arithmetic
•MAC operations.
The processor executes multiply instructions using a single-cycle 32x16
implementation. The processor can perform 32x32, 32x16, and 16x16
multiply instructions (MAC).
vii.Media extensions
The ARMv6 instruction set provides media instructions to complement the DSP
instructions. There are four media instruction groups:

• Multiplication instructions for handling 16-bit and 32-bit data, including


dual-multiplication instructions that operate on both 16-bit halves of their
source registers. This group includes an instruction that improves the
performance and size of code for multi-word unsigned multiplications.

• Single Instruction Multiple Data (SIMD) Instructions to perform operations on


pairs of 16-bit values held in a single register, or on sets of four 8-bit values
held in a single register. The main operations supplied are addition and
subtraction, selection, pack, and saturation.
.
• Instructions to extract bytes and half words from registers and zero-extend
or sign-extend them. These include a parallel extraction of two bytes followed
by extension of each byte to a half word.

• Unsigned Sum-of-Absolute-Differences (SAD) instructions. This is used in


MPEG motion estimation
viii.Datapath
The datapath consists of three pipelines:
• ALU, shift and Sat pipeline
• MAC pipeline
• load or store pipeline, see Load Store Unit (LSU)
a)ALU, shift or Sat pipe
The ALU, shift and Sat pipeline executes most of the ALU operations, and
includes a 32-bit barrel shifter. It consists of three pipeline stages:
Shift : The Shift stage contains the full barrel shifter. This stage performs all
shifts, including those required by the LSU.
The Shift stage implements saturating left shift that doubles the value of an
operand and saturates it.
ALU : The ALU stage performs all arithmetic and logic operations, and
generates the condition codes for instructions that set these flags.
The ALU stage consists of a logic unit, an arithmetic unit, and a flag
generator.
The pipeline logic evaluates the flag settings in parallel with the main
adder in the ALU. The flag generator is enabled only on flag-setting
operations.
The ALU stage separates the carry chains of the main adder for 8-bit and
16-bit SIMD instructions.
Sat : The Sat stage implements the saturation logic required by the various
classes of
DSP instructions.
b)MAC pipeline
The MAC pipeline executes all of the enhanced multiply, and multiply-
accumulate instructions.
The MAC unit consists of a 32x16 multiplier and an accumulate unit that is
configured to calculate the sum of two 16x16 multiplies. The accumulate
unit has its own dedicated single register read port for the accumulate
operand.
To minimize power consumption, the processor only clocks each of the
MAC and ALU stages when required.
C)Return stack
The processor includes a three-entry return stack to accelerate returns
from procedure calls. For each procedure call, the processor pushes
the return address onto a hardware stack. When the processor
recognizes a procedure return, the processor pops the address held in
the return stack that the prefetch unit uses as the predicted return
address.
2.Load Store Unit (LSU)
The Load Store Unit (LSU) manages all load and store operations. The
load-store pipeline decouples loads and stores from the MAC and ALU
pipelines. When the processor issues LDM and STM instructions to the
LSU pipeline, other instructions run concurrently, subject to the
requirements of supporting precise exceptions.

3)Prefetch unit
The prefetch unit fetches instructions from the instruction cache,
Instruction TCM, or from external memory and predicts the outcome
of branches in the instruction stream.
Branch prediction
The core uses both static and dynamic branch prediction. All
branches are predicted where the target address is an immediate
address, or fixed-offset PC-relative address.
The first level of branch prediction is dynamic, through a 128-
entry Branch Target Address Cache (BTAC). If the PC of a branch
matches an entry in the BTAC, the processor uses the branch history
and the target address to fetch the new instruction stream.
The processor might remove dynamically predicted branches from
the instruction stream, and might execute such branches in zero
cycles.
If the address mappings are changed, the BTAC must be flushed. A
BTAC flush instruction is provided in the CP15 coprocessor.
The processor uses static branch prediction to manage branches not
matched in the BTAC. The static branch predictor makes a prediction
based on the direction of the branches.
4. Memory system :
The level-one memory system provides the core with:
• separate instruction and data caches
• separate instruction and data Tightly-Coupled Memories
• 64-bit datapaths throughout the memory system
• virtually indexed, physically tagged caches
• memory access controls and virtual memory management
• support for four sizes of memory page
• two-channel DMA into TCMs
• I-fetch, D-read/write interface, compatible with multi-layer AMBA AXI
• 32-bit dedicated peripheral interface
• export of memory attributes for second-level memory system.
The following sections describe the memory system in more detail:

• Instruction and data caches


• Cache power management
• Instruction and data TCM
• TCM DMA engine
• DMA features
• Memory Management Unit
i)Instruction and data caches
The core provides separate instruction and data caches. The cache has the
following features:
• Independent configuration of the instruction and data cache during
synthesis to sizes between 4KB and 64KB.
• 4-way set-associative instruction and data caches. You can lock each way
independently.
• Pseudo-random or round-robin replacement.
• Eight word cache line length.
• The MicroTLB entry determines whether cache lines are write-back or
write-through.
• Ability to disable each cache independently, using the system control
coprocessor.
• Data cache misses that are non-blocking. The processor supports up to
three outstanding data cache misses.
• Streaming of sequential data from LDM and LDRD operations, and
sequential instruction fetches.
• Critical word first filling of the cache on a cache-miss. You can implement
all the cache RAM blocks, and the associated tag and valid RAM blocks using
standard ASIC RAM compilers. This ensures optimum area and performance
of your design.
• Each cache line is marked with a Secure or Non-secure tag that defines if
the line contains Secure or Non-secure data.
ii)Cache power management:
To reduce power consumption, the core uses sequential cache operations
to reduce the number of full cache reads. If a cache read is sequential to
the previous cache read, and the read is within the same cache line, only
the data RAM set that was previously read is accessed. The core does not
access tag RAM during sequential cache operations. To reduce
unnecessary power consumption additionally, the core only reads the
addressed words within a cache line at any time.
iii)Instruction and data TCM :
Because some applications might not respond well to caching,
configurable memory blocks are provided for Instruction and Data Tightly
Coupled Memories (TCMs). These ensure high-speed access to code or
data.
An Instruction TCM typically holds an interrupt or exception code that the
processor must access at high speed, without any potential delay resulting
from a cache miss.
A Data TCM typically holds a block of data for intensive processing, such as
audio or video processing.
Each can configure each TCM to be Secure or Non-secure.
Level one memory system
You can separately configure the size of the Instruction TCM (ITCM) and
the size of the Data TCM (DTCM) to be 0KB, 4KB. 8KB, 16KB, 32KB or
64KB. For each side (ITCM and DTCM):
• If you configure the TCM size to be 4KB you get one TCM, of 4KB, on this
side.
• If you configure the TCM size to be larger than 4KB you get two TCMs on
this side, each of half the configured size. So, for example, if you configure
an ITCM size of 16KB you get two ITCMs, each of size 8KB.
Table 1-1 lists all possible TCM configurations.
The TCM can be anywhere in the memory map. The INITRAM pin enables
booting from the ITCM
iv)TCM DMA engine:
To support use of the TCMs by data-intensive applications, the core
provides two DMA channels to transfer data to or from the Instruction or
Data TCM blocks. DMA can proceed in parallel with CPU accesses to the
TCM blocks. Arbitration is on a cycle-by-cycle basis. The DMA channels
connect with the System-on-Chip (SoC) backplane through a dedicated 64-
bit AMBA AXI port.
The DMA controller is programmed using the CP15 system-control
coprocessor. DMA accesses can only be to or from the TCM, and an
external memory. There is no coherency support with the caches.
Only one of the two DMA channels can be active at any time. .
v)DMA features:
The DMA controller has the following features:
• runs in background of CPU operations
• enables CPU priority access to TCM during DMA
• programmed with Virtual Addresses
• controls DMA to either the instruction or data TCM
• allocated by a privileged process (OS)
• software can check and monitor DMA progress
• interrupts on DMA event
• ability to configure each channel to transfer data between Secure TCM and
Secure external memory
vi)Memory Management Unit :
The Memory Management Unit (MMU) has a unified Translation Lookaside
Buffer (TLB) for both instructions and data.
The MMU includes a 4KB page mapping size to enable a smaller RAM and
ROM footprint for embedded systems and operating systems such as
Windows CE that have many small mapped objects.
The ARM1176JZ-S processor implements the Fast Context Switch Extension
(FCSE) and high vectors extension that are required to run Microsoft
Windows CE.
The MMU is responsible for protection checking, address translation,
and memory attributes, and some of these can be passed to an external
level two memory system. The memory translations are cached in
MicroTLBs for each of the instruction and data caches, with a single
Main TLB backing the MicroTLBs.
The MMU has the following features:
• matches Virtual Address, ASID, and NSTID
• each TLB entry is marked with the NSTID
• checks domain access permissions
• checks memory attributes
• translates virtual-to-physical address
• supports four memory page sizes
• maps accesses to cache, TCM, peripheral port, or external memory
• hardware handles TLB misses
• software control of TLB.
Paging
Four page sizes are supported:
• 16MB super sections
• 1MB sections
• 64KB large pages
• 4KB small pages.
Domains
Sixteen access domains are supported.
TLB
A two-level TLB structure is implemented. Eight entries in the main TLB are
lockable.
Hardware TLB loading is supported, and is backwards compatible with
previous versions of the
ARM architecture.
ASIDs
TLB entries can be global, or can be associated with particular processes or
applications using Application Space IDentifiers (ASIDs). ASIDs enable TLB
entries to remain resident during context switches to avoid subsequent
reload of TLB entries and also enable task-aware debugging.
NSTID
TrustZone extensions enable the system to mark each entry in the TLB as
Secure or Non-secure with the Non-Secure Table IDentifier (NSTID).
System control coprocessor
Cache, TCM, and DMA operations are controlled through a dedicated
coprocessor, CP15, integrated within the core. This coprocessor provides a
standard mechanism for configuring the
level one memory system, and also provides functions such as memory
barrier instructions.
5) AMBA AXI interface
The bus interface provides high bandwidth connections between the
processor, second level
caches, on-chip RAM, peripherals, and interfaces to external memory.
There are separate bus interfaces for:
• instruction fetch, 64-bit data
• data read/write, 64-bit data
• peripheral access, 32-bit data
• DMA, 64-bit data.
All interfaces are AMBA AXI compatible. This enables them to be merged in
smaller systems.
Additional signals are provided on each port to support second-level cache.
The ports support the following bus transactions:
a)Instruction fetch
Servicing instruction cache misses and non cacheable instruction fetches.
b)Data read/write
Servicing data cache misses, hardware handled TLB misses, cache eviction
and non cacheable data reads and writes.
c)DMA
Servicing the DMA engine for writing and reading the TCMs. This behaves as
a single bidirectional port.
These ports enable several simultaneous outstanding transactions,
providing:
• high performance from second-level memory systems that support
parallelism
• high use of pipelined and multi-page memories such as SDRAM.
The following sections describe the AMBA AXI interface in more detail:
• Bus clock speeds
• Unaligned accesses
• Mixed-endian support
• Write buffer
• Peripheral port.
i)Bus clock speeds
The bus interface ports operate synchronously to the CPU clock if IEM is not
implemented.
ii)Unaligned accesses
The core supports unaligned data access. Words and halfwords can align to
any byte boundary.
This enables access to compacted data structures with no software overhead.
This is useful for
multi-processor applications and reducing memory space requirements.
The Bus Interface Unit (BIU) automatically generates multiple bus cycles for
unaligned
accesses.
iii)Mixed-endian support
The core provides the option of switching between little-endian and byte
invariant big endian data access modes. This means the core can share data
with big-endian systems, and improves the way the core manages certain
types of data.
iv)Write buffer
All memory writes take place through the write buffer. The write buffer
decouples the CPU
pipeline from the system bus for external memory writes. Memory reads are
checked for
dependency against the write buffer contents.
v)Peripheral port
The peripheral port is a 32-bit AMBA AXI interface that provides direct
access to local, on-shared devices separately. The peripheral port does not
use the main bus system. The memory regions that these non-shared
devices use are marked as Device and Non-Shared.
Accesses to these memory regions are routed to the peripheral port
instead of to the data read-write ports.
6.Coprocessor interface
The ARM1176JZ-S processor connects to external coprocessors through
the coprocessor interface. This interface supports all ARM coprocessor
instructions:
• LDC
• LDCL
• STC
• STCL
• MRC
• MRRC
• MCR
• MCRR
• CDP.
The memory system returns data for all loads to coprocessors in the
order of the accesses in the program. The processor suppresses HUM
operation of the cache for coprocessor instructions.
The external coprocessor interface relies on the coprocessor
executing all its instructions in order.
Externally-connected coprocessors follow the early stages of the core
pipeline to permit the exchange of instructions and data between the
two pipelines. The coprocessor runs one pipeline stage behind the
core pipeline.
To prevent the coprocessor interface introducing critical paths, wait
states can be inserted in external coprocessor operations. These wait
states enable critical signals to be retimed.
7. Debug
The ARM1176JZ-S core implements the ARMv6.1 Debug architecture
that includes extensions of the ARMv6 Debug architecture to support
TrustZone. It introduces three levels of debug:
• debug everywhere
• debug in Non-secure privileged and user, and Secure user
• debug in Non-secure only.
The core provides extensive support for real-time debug and performance
profiling.
The following sections describe debug in more detail:
• System performance monitoring
• ETM interface on
• ETM trace buffer on
• Software access to trace buffer
• Real-time debug facilities
• Debug and trace Environment
i)System performance monitoring
This is a group of counters that you can configure to monitor the operation
of the processor and memory system.

ii)ETM interface
You can connect an external Embedded Trace Macrocell (ETM) unit to the
processor for real-time code tracing of the core in an embedded system.
The ETM interface collects various processor signals and drives these
signals from the core. The interface is unidirectional and runs at the full
speed of the core. The ETM interface connects directly to the external ETM
unit without any additional glue logic. You can disable the ETM interface for
power saving.
iii)ETM trace buffer:
You can extend the functionality of the ETM by adding an on-chip trace
buffer. The trace buffer is an on-chip memory area. The trace buffer
stores trace information during capture that otherwise passes
immediately through the trace port at the operating frequency of the
core. When capture is complete the stored information can be read out
at a reduced clock rate from the trace buffer using the JTAG port of the
SoC, instead of through a dedicated trace port. This is a two-step process
that avoids you implementing a wide trace port that has many high-speed
device pins. In effect, a zero-pin trace port is created where the device
already has a JTAG port and associated pins.
iv)Software access to trace buffer:
You can access buffered trace information through an APB slave-based
memory-mapped peripheral included as part of the trace buffer. You can
perform internal diagnostics on a closed system where a JTAG port is not
normally brought out.

v)Real-time debug facilities:


The ARM1176JZ-S processor contains an EmbeddedICE-RT logic unit that
provides the following real-time debug facilities:
• up to six breakpoints
• thread-aware breakpoints
• up to two watchpoints
• Debug Communications Channel (DCC).
The EmbeddedICE-RT logic connects directly to the core and monitors the
internal address and data buses. You can access the EmbeddedICE-RT logic in
one of two ways:
• executing CP14 instructions
• through a JTAG-style interface and associated TAP controller.
The EmbeddedICE-RT logic supports two modes of debug operation:

1.Halting debug-mode:
On a debug event, such as a breakpoint or watchpoint, the debug logic stops
the core and forces the core into Debug state. This enables you to examine the
internal state of the core, and the external state of the system, independently
from other system activity. When the debugging process completes, the core
and system state is restored, and normal program execution resumes.
2.Monitor debug-mode
On a debug event, the core generates a debug exception instead of
entering Debug state, as in Halting debug-mode. The exception entry
activates a debug monitor program that performs critical interrupt service
routines to debug the processor. The debug monitor program
communicates with the debug host over the DCC.
vi)Debug and trace Environment
Several external hardware and software tools are available for you to
enable:
• real-time debugging using the EmbeddedICE-RT logic
• execution trace using the ETM.
VFP11 - coprocessor

• More than one instruction to be completed per cycle.

• Instructions issued to the FMAC pipeline can complete out of order


with respect to operations in the LS and DS pipelines

• Except for divide and square root operations, the pipelines support
single-cycle throughput for all single-precision operations and most
doubleprecision operations
• Double-precision multiply and multiply and accumulate operations
have a two-cycle throughput.

• The LS pipeline is capable of supplying two singleprecision operands or


one double-precision operand per cycle, balancing the data transfer
capability with the operand requirements.
8. Instruction cycle summary and interlocks
Complex instruction dependencies and memory system interactions make
it impossible to describe briefly the exact cycle timing behavior for all
instructions in all circumstances. The timings that describes are accurate
in most cases. If precise timings are required you must use a cycle-
accurate model of the processor. Unless otherwise stated, cycle counts
and result latencies that this chapter describes are best case numbers.
They assume:
• no outstanding data dependencies between the current instruction and
a previous instruction
• the instruction does not encounter any resource conflicts
• all data accesses hit in the MicroTLB and Data Cache, and do not cross
protection region boundaries
• all instruction accesses hit in the Instruction Cache.
9. System control
The control of the memory system and its associated functionality, and
other system-wide control attributes are managed through a dedicated
system control coprocessor, CP15.
The purpose of the system control and configuration registers is to
provide overall management of:
• TrustZone behavior
• memory functionality
• interrupt behavior
• exception handling
• program flow prediction
• coprocessor access rights for CP0-CP13.
The system control and configuration registers also provide the processor
ID. The system control and configuration registers consist of three 32-bit
read only registers and eight 32-bit read/write registers.
10.Interrupt handling
Interrupt handling in the ARM1176JZ-S processor is compatible with previous
ARM architectures, but has several additional features to improve interrupt
performance for real-time applications.
The following sections describe interrupt handling in more detail:
• Vectored Interrupt Controller port
• Low interrupt latency configuration
• Configuration
• Exception processing enhancements
i)Vectored Interrupt Controller port
The core has a dedicated port that enables an external interrupt controller,
such as the ARM
Vectored Interrupt Controller (VIC), to supply a vector address along with an
interrupt request
(IRQ) signal. This provides faster interrupt entry but you can disable it for
compatibility with
earlier interrupt controllers.
ii.Low interrupt latency configuration
• This mode minimizes the worst-case interrupt latency of the processor, with
a small reduction in peak performance, or instructions-per-cycle. You can tune
the behavior of the core to suit the requirements of the application.
• The low interrupt latency configuration disables HUM operation of the
cache. In low interrupt latency configuration, on receipt of an interrupt, the
ARM1176JZ-S processor:
abandons any pending restartable memory operations
restarts memory operations on return from the interrupt.
• To obtain maximum benefit from the low interrupt latency configuration,
software must only use multi-word load or store instructions that are fully
restartable.
The software must not use multi-word load or store instructions on memory
locations that produce side-effects for the type of access concerned. This
applies to:
ARM LDC, all forms of LDM, LDRD, and STC, and all forms of STM and STRD.
Thumb LDMIA, STMIA, PUSH, and POP.
• To achieve optimum interrupt latency, memory locations accessed with
these instructions must not have large numbers of wait-states associated
with them. To minimize the interrupt latency, the following is recommended:
• multiple accesses to areas of memory marked as Device or Strongly
Ordered must not be performed
• access to slow areas of memory marked as Device or Strongly Ordered
must not be performed. That is, those that take many cycles in generating a
response
• SWP operations must not be performed to slow areas of memory
iii)Configuration
You configure the processor for low interrupt latency mode by use of the
system control coprocessor. To ensure that a change between normal and
low interrupt latency configurations is synchronized correctly, you must
use software systems that only change the configuration while interrupts
are disabled.
iv)Exception processing enhancements
The ARMv6 architecture contains several enhancements to exception
processing, to reduce interrupt handler entry and exit time:
SRS Save return state to a specified stack frame.
RFE Return from exception.
CPS Directly modify the CPSR.
11.Vector Floating-Point (VFP)
The VFP coprocessor supports floating point arithmetic operations and is a
functional block within the ARM1176JZF-S processor. The VFP coprocessor
is mapped as coprocessor numbers 10 and 11. Software can determine
whether the VFP is present by the use of the Coprocessor Access Control
Register. See c1, Coprocessor Access Control Register on page 3-51 for more
details. The VFP implements the ARM VFPv2 floating point coprocessor
instruction set. It supports single and double-precision arithmetic on
vector-vector, vector-scalar, and scalar-scalar data sets. Vectors can consist
of up to eight single-precision, or four double-precision elements. The VFP
has its own bank of 32 registers for single-precision operands that you can:
• use in pairs for double-precision operands
• operate loads and stores of VFP registers in parallel with arithmetic
operations.
The VFP supports a wide range of single and double precision operations, including ABS, NEG,
COPY, MUL, MAC, DIV, and SQRT. The VFP effectively executes most of these in a single
cycle. Table 1-2 lists the exceptions. These issue latencies also apply to individual elements in
a vector operation.

Compliance with the IEEE 754 standard


The VFP supports all five floating point exceptions defined by the IEEE 754 standard:
• invalid operation
• divide by zero
• overflow
• underflow
• inexact.
You can individually enable or disable these exception traps. If disabled,
the default results defined by IEEE 754 are returned. All rounding modes
are supported, and basic single and basic double formats are used. For full
compliance, the VFP requires support code to handle arithmetic where
operands or results are de-norms. This support code is normally installed
on the Undefined instruction exception handler.

Flush-to-zero mode
A flush-to-zero mode is provided where a default treatment of de-norms is
applied. Table 1-3 lists the default behavior in flush-to-zero mode.
CPU Pipeline Stages
The Pipeline consist of 3 stages

• Fetch stages

• Decode stage

• Execute stage
Pipeline stages

Figure shows:
• the two Fetch stages
• a Decode stage
• an Issue stage
• the four stages of the ARM1176JZ-S integer execution pipeline.
From Figure the pipeline operations are:
Fe1 First stage of instruction fetch where address is issued to memory and
data returns from memory

Fe2 Second stage of instruction fetch and branch prediction. De Instruction


decode.

Iss Register read and instruction issue.


Sh Shifter stage.

ALU Main integer operation calculation.

Sat Pipeline stage to enable saturation of integer results.

WBex Write back of data from the multiply or main execution pipelines.
MAC1 First stage of the multiply-accumulate pipeline.

MAC2 Second stage of the multiply-accumulate pipeline.

MAC3 Third stage of the multiply-accumulate pipeline.

ADD Address generation stage.

DC1 First stage of data cache access.

DC2 Second stage of data cache access.

WBls Write back of data from the Load Store Unit.


By overlapping the various stages of operation, the ARM1176JZ-S
processor maximizes the clock rate achievable to execute each
instruction. It delivers a throughput approaching one instruction for each
cycle.
The Fetch stages can hold up to four instructions, where branch
prediction is performed on instructions ahead of execution of earlier
instructions.
The Issue and Decode stages can contain any instruction in parallel with a
predicted branch.
The Execute, Memory, and Write stages can contain a predicted branch,
an ALU or multiply instruction, a load/store multiple instruction, and a
coprocessor instruction in parallel execution.
Typical pipeline operations:
Figure shows all the operations in each of the pipeline stages in the ALU
pipeline, the load/store pipeline, and the HUM buffers.
Typical ALU pipeline operations:
Figure shows a typical ALU data processing instruction. The processor does not
use the load/store pipeline or the HUM buffer.
Typical multiply operation

Figure shows a typical multiply operation. The MUL instruction can loop in the
MAC1 stage until it has passed through the first part of the multiplier array
enough times. The MUL instruction progresses to MAC2 and MAC3 where it
passes through the second half of the array once to produce the final result.
Instruction progression
Figure shows an LDR/STR operation that hits in the data cache.
Fig. shows the progression of an LDM/STM operation that completes by use of the load/store pipeline. Other instructions
can use the ALU pipeline at the same time as the LDM/STM completes in the load/store pipeline.
Software Pipelining
Software pipelining (also known as loop pipelining and loop
folding) is a technique that overlaps loop iterations (i.e.,
subsequent iterations start before previous finished). This
technique is suitable to increase performance but may also
increase register pressure (not a main problem in
reconfigurable array architectures with pipeline stages). One
of the most used software pipelining techniques is the
iterative modulo scheduling .
All efficient compilers include software pipelining as part of
their set of optimizations. This technique is mostly applied at
the intermediate representation (IR) level of a program, but
can also be applied at the source code level (and in this
case, it is considered a code transformation technique),.
Cache organization
Each cache is implemented as a four-way set associative cache of
configurable size. The caches are virtually indexed and physically tagged. You
can configure the cache sizes in the range of 4 to 64KB. Both the Instruction
Cache and the Data Cache can provide two words per cycle for all requesting
sources.
Each cache way is architecturally limited to 16KB in size, because of the
limitations of the virtually indexed, physically tagged implementation. The
number of cache ways is fixed at four, but the cache way size can vary
between 1KB and 16KB in powers of 2. The line length is not configurable
and is fixed at eight words per line.
Write operations must occur after the Tag RAM reads and
associated address comparisons are complete. A three-entry
Write Buffer is included in the cache to enable the written
words to be held until they can be written to cache. One or
two words can be written in a single store operation. The
addresses of these outstanding writes provide an additional
input to the Tag RAM comparison for reads.
To avoid a critical path from the Tag RAM comparison to the
enable signals for the data RAMs, there is a minimum of one
cycle of latency between the determination of a hit to a
particular way, and the start of writing to the data RAM of
that way. This requires the Data Cache Write Buffer to hold
three entries, for back-to-back writes. Accesses that read the
dirty bits must also check the Data Cache Write Buffer for
The cache dirty bits for the Data Cache are updated when the Data Cache
Write Buffer data is written to the RAM. This requires the dirty bits to be
held as a separate storage array. Significantly, the Tag arrays cannot be
written, because the arrays are not accessed during the data RAM writes,
but permits the dirty bits to be implemented as a small RAM.
The other main operations performed by the cache are cache line refills and
Write-Back. These occur to particular cache ways, that are determined at the
point of the detection of the cache miss by the victim selection logic.all RAM.
To reduce overall power consumption, the number of full cache reads is
reduced by the sequential nature of many cache operations, especially on the
instruction side. On a cache read that is sequential to the previous cache read,
only the data RAM set that was previously read is accessed, if the read is within
the same cache line. The Tag RAM is not accessed at all during this sequential
operation.
To reduce unnecessary power consumption additionally, only the addressed
words within a cache line are read at any time. With the required 64-bit read
interface, this is achieved by disabling half of the RAMs on occasions when only
a 32-bit value is required. The implementation uses two 32-bit wide RAMs to
implement the cache data RAM shown in , with the words of each line folded
into the RAMs on an odd and even basis. This means that cache refills can take
several cycles, depending on the cache line lengths. The cache line length is
eight words.
The control of the level one memory system and the associated
functionality, together with other system wide control attributes are
handled through the system control coprocessor, CP15.
System Control Coprocessor describes this.
Level one cache block diagram
Features of the cache system
The level one cache system has the following features:
•The cache is a Harvard implementation.
•The caches are lockable at a granularity of a cache way, using Format C
lockdown. See Cache control and configuration.
•Cache replacement policies are Pseudo-Random or Round-Robin, as
controlled by the RR bit in CP15 register c1. Round-Robin uses a single counter
for all sets, that selects the way used for replacement.
•Cache line allocation uses the cache replacement algorithm when all cache
lines are valid. If one or more lines is invalid, then the invalid cache line with
the lowest way number is allocated to in preference to replacing a valid cache
line. This mechanism does not allocate to locked cache ways unless all cache
ways are locked. See Cache miss handling when all ways are locked down.
•Data cache misses are nonblocking with three outstanding Data Cache
misses being supported.
•Streaming of sequential data from LDM and LDRD operations, and for
sequential instruction fetches is supported.
•Cache lines can contain either Secure or Non-secure data and the NS Tag, that
the MicroTLB provides, indicates when the cache line comes from Secure or
Non-secure memory.
•Cache lines can be either Write-Back or Write-Through, determined by the
MicroTLB entry.
•Only read allocation is supported.
•The cache can be disabled independently from the TCM, under control of the
appropriate bits in CP15 c1. The cache can be disabled in Secure state while
enabled in Non-secure state and enabled in Secure state while disabled in
Non-secure state.
• The CL bit in the system control coprocessor, see
c1, Non-Secure Access Control Register, reserves cache lockdown registers for
Secure world operation. When the CL bit is 0 the cache lockdown registers are
only available in the Secure world. When the CL bit is 1 they are available for
both Secure and Non-secure operation.
Branch folding
Branch folding is a technique where, on the prediction of most branches,
the branch instruction is completely removed from the instruction stream
presented to the execution pipeline. Branch folding can significantly
improve the performance of branches, taking the CPI for branches
significantly lower than 1.
Branch folding only operates in ARM and Thumb states.
Branch folding is done for all dynamically predicted branches, except that
branch folding is not done for:
•BL and BLX instructions, to avoid losing the link
•predicted branches onto branches
•branches that are breakpointed or have generated an abort when
fetched.

You might also like