Lecture 2,3,4,5
Lecture 2,3,4,5
Voltages
Two 5V pins and two 3V3 pins are present on the board, as well as a
number of ground pins (0V), which are unconfigurable. The remaining pins
are all general purpose 3V3 pins, meaning outputs are set to 3V3 and
inputs are 3V3-tolerant.
Outputs
A GPIO pin designated as an output pin can be set to high (3V3) or low
(0V).
Inputs
A GPIO pin designated as an input pin can be read as high (3V3) or
low (0V). This is made easier with the use of internal pull-up or pull-
down resistors. Pins GPIO2 and GPIO3 have fixed pull-up resistors, but
for other pins this can be configured in software.
More
As well as simple input and output devices, the GPIO pins can be used
with a variety of alternative functions, some are available on all pins,
others on specific pins.
PWM (pulse-width modulation)
Software PWM available on all pins
Hardware PWM available on GPIO12, GPIO13, GPIO18, GPIO19
SPI
SPI0: MOSI (GPIO10); MISO (GPIO9); SCLK (GPIO11); CE0 (GPIO8), CE1
(GPIO7)
SPI1: MOSI (GPIO20); MISO (GPIO19); SCLK (GPIO21); CE0 (GPIO18); CE1
(GPIO17); CE2 (GPIO16)
I2C
Data: (GPIO2); Clock (GPIO3)
EEPROM Data: (GPIO0); EEPROM Clock (GPIO1)
Serial
TX (GPIO14); RX (GPIO15)
PIN GROUP PIN NAME DESCRIPTION
POWER SOURCE +5V, +3.3V, GND and Vin +5V -power output
+3.3V -power output
GND – GROUND pin
COMMUNICATION INTERFACE UART Interface(RXD, TXD) [(GPIO15,GPIO14)] UART (Universal Asynchronous Receiver Transmitter) used for interfacing sensors and
other devices.
SPI Interface(MOSI, MISO, CLK,CE) x 2 SPI (Serial Peripheral Interface) used for communicating with other boards or
[SPI0-(GPIO10 ,GPIO9, GPIO11 ,GPIO8)] peripherals.
[SPI1--(GPIO20 ,GPIO19, GPIO21 ,GPIO7)]
TWI Interface(SDA, SCL) x 2 [(GPIO2, GPIO3)] TWI (Two Wire Interface) Interface can be used to connect peripherals.
[(ID_SD,ID_SC)]
INPUT OUTPUT PINS 26 I/O Although these some pins have multiple functionsthey can be considered as I/O pins.
PWM Hardware PWM available on GPIO12, GPIO13, GPIO18, GPIO19 These 4 channels can provide PWM (Pulse Width Modulation) outputs.
*Software PWM available on all pins
EXTERNAL INTERRUPTS All I/O In the board all I/O pins can be used as Interrupts.
Comparison of Raspberry Pi models
• There are different versions of raspberry pi available as listed below:
1.Raspberry Pi 1 Model A
2.Raspberry Pi 1 Model A+
3.Raspberry Pi 1 Model B
4.Raspberry Pi 1 Model B+
5.Raspberry Pi 2 Model B
6.Raspberry Pi 3 Model B
7.Raspberry Pi Zero
Raspberry Pi Raspberry Pi 2 Raspberry Pi 3
Features Raspberry Pi zero
Model B+ Model B Model B
Operating
700 MHz 900 MHz 1.2 GHz 1 GHz
Freq.
• CPU speed ranges from 700 MHz to 1.2 GHz for the Pi 3 and on board
memory range from 256 MB to 1 GB RAM.
• Secure Digital (SD) cards are used to store the operating system and
program memory in either the SDHC or MicroSDHC sizes.
• Most boards have between one and four USB slots, HDMI and
composite video output, and a 3.5 mm phono jack for audio.
• The B-models have an 8P8C Ethernet port and the Pi 3 and Pi Zero
W have on board Wi-Fi 802.11n and Bluetooth.
Peripherals used in BCM2835
• It contains following ARM peripherals:
• Timers
• Interrupt controller
• GPIO
• USB
• PCM / I2S
• DMA controller
• I2C master
• I2C / SPI slave
• SPI0, SPI1, SPI2
• PWM
• UART0, UART1
Raspberry pi processor
• The Broadcom BCM2835 SoC used in the first generation Raspberry
Pi includes a 700 MHz ARM1176JZF-S
• processor, VideoCore IV graphics processing unit (GPU), and RAM. It
has a level 1 (L1) cache of 16 KiB and a level 2 (L2) cache of 128 KiB.
The level 2 cache is used primarily by the GPU. The SoC
• is stacked underneath the RAM chip, so only its edge is visible. The
ARM1176JZ(F)-S is the same CPU used in the original
• iPhone although at a higher clock rate, and mated with a much faster
GPU.
The earlier V1.1 model of the Raspberry Pi 2 used a Broadcom BCM2836 SoC
with a 900 MHz 32-bit, quad-core ARM Cortex-A7 processor, with 256 KiB
shared L2 cache.The Raspberry Pi 2 V1.2 was upgraded to a Broadcom
BCM2837 SoC with a 1.2 GHz 64-bit quad-core ARM Cortex-A53 processor the
same SoC which is used on the Raspberry Pi 3, but underclocked (by default)
to the same 900 MHz CPU clock speed as the V1.1. The BCM2836 SoC is no
longer in production as of late 2016.
The Raspberry Pi 3 Model B uses a Broadcom BCM2837 SoC with a 1.2 GHz
64-bit quad-core ARM Cortex-A53 processor, with 512 KiB shared L2 cache.
The Model A+ and B+ are 1.4 GHz
The Raspberry Pi 4 uses a Broadcom BCM2711 SoC with a 1.5 GHz 64-
bit quad-core ARM Cortex-A72 processor, with 1 MiB shared L2
cache.Unlike previous models, which all used a custom
interrupt controller poorly suited for virtualisation, the interrupt
controller on this SoC is compatible with the ARM Generic Interrupt
Controller (GIC) architecture 2.0, providing hardware support for
interrupt distribution when using ARM virtualisation capabilities.
The Raspberry Pi Zero and Zero W use the same Broadcom BCM2835
SoC as the first generation Raspberry Pi, although now runningat 1 GHz
CPU clock speed.
•ARM CPU/GPU -- This is a Broadcom BCM2835 System on a Chip (SoC)
that's made up of an ARM central processing unit (CPU) and a Videocore
4 graphics processing unit (GPU). The CPU handles all the computations
that make a computer work (taking input, doing calculations and
producing output), and the GPU handles graphics output.
•GPIO -- These are exposed general-purpose input/output connection
points that will allow the real hardware hobbyists the opportunity to
tinker.
•RCA -- An RCA jack allows connection of analog TVs and other similar
output devices.
•Audio out -- This is a standard 3.55-millimeter jack for connection of
audio output devices such as headphones or speakers. There is no audio
in.
•LEDs -- Light-emitting diodes, for all of your indicator light needs.
•USB -- This is a common connection port for peripheral devices of all
types (including your mouse and keyboard). Model A has one, and Model
B has two. You can use a USB hub to expand the number of ports or plug
your mouse into your keyboard if it has its own USB port.
•HDMI -- This connector allows you to hook up a high-definition
television or other compatible device using an HDMI cable.
•Power -- This is a 5v Micro USB power connector into which you can plug
your compatible power supply.
•SD cardslot -- This is a full-sized SD card slot. An SD card with an operating
system (OS) installed is required for booting the device. They are available for
purchase from the manufacturers, but you can also download an OS and save
it to the card yourself if you have a Linux machine and the wherewithal.
•Ethernet -- This connector allows for wired network access and is only
available on the Model B.
Many of the features that are missing, such as WiFi and audio in, can be
added using the USB port(s) or a USB hub as needed. Next: More details on
the device itself and its compatible operating systems.
The Broadcom BCM2835 is a System on Chip (SoC) with multimedia
capabilities, and usually utilised in mobile phones and portable
devices. It is a highly competitive industry and nearly all the big names
that manufacture SoC chips keep their designs a secret.
This IC provides a full HD video and stereo audio. You would expect to
find it in digital camcorders, digital cameras, mobile phones, games
boxes, and the applications are limitless. It has a VideoCore IV®
Multimedia Co-Processor, ARM1176JZ-F application processor, and
high performance OpenGL-ES® 1.1/2.0 GPU.
It is interesting to note that this chip also has an advanced Image Sensor
Pipeline (ISP) for connecting a 20 MP Camera. It can transfer data at the
rate of 220 MP/s. There is also an LCD MIPI type interface for driving a
high-resolution LCD panel at 1080p resolutions.
CPU Architecture
• The ARM1176JZ-S processor incorporates an integer core that
implements the ARM11 ARM architecture v6. It supports the ARM
and Thumb instruction sets, Jazelle technology to enable direct
execution of Java bytecodes, and a range of SIMD DSP instructions
that operate on 16-bit or 8-bit data values in 32-bit registers.
ARM11J6JZF-S (ARM11 Family) consist of following features
ARMv6 Architecture
Single Core
32-Bit RISC
700 MHz Clock Rate
8 Pipeline Stages
Branch Prediction
Low interrupt latency configuration
Internal coprocessor CP14 and CP15
External coprocessor interface
Instruction and data MMU
64 bit interface to both caches TCM which can be used as
local RAM with DMA
JTAG based debug
vector floating point
provision for Intelligent Energy Management (IEM™)
high-speed Advanced Microprocessor Bus Architecture
(AMBA) Advanced Extensible Interface (AXI) level two
interfaces supporting prioritized multiprocessor
implementations.
• an integer core with integral Embedded ICE-RT logic
• Instruction and data caches, including a non-blocking data cache with
Hit-Under-Miss (HUM)
• virtually indexed and physically addressed caches
• 64-bit interface to both caches
• level one Tightly-Coupled Memory (TCM) that you can use as a local
RAM with DMA
• external coprocessor support
• trace support
Software execution
Bytecodes that are too complex to execute directly in hardware are executed
in software. An ARM register is used to access a table of exception handlers to
handle these particular bytecodes.
1.Integer core
The ARM1176JZ-S processor is built around the ARM11 integer core. It is an
implementation of the ARMv6 architecture and runs the ARM, Thumb, and
Java instruction sets. The processor contains Embedded ICE-RT logic and a JTAG
debug interface to enable hardware debuggers to communicate with the
processor. The following sections describe the core in more detail:
•Instruction set categories
•Conditional execution
•Registers
•Modes and exceptions
•Thumb instruction set
•DSP instructions
•Media extensions
•Datapath
•Branch prediction
•Return stack
All section of integer core are discussed bellow.
i.Instruction set categories
The main instruction set categories are:
•branch instructions
•data processing instructions
•status register transfer instructions
•load and store instructions
•coprocessor instructions.
exception-generating instructions.
Note
Only load, store, and swap instructions can access data from memory.
ii.Conditional execution
The processor conditionally executes nearly all ARM instructions. You can
decide if the condition code flags, Negative, Zero, Carry, and Overflow, are
updated according to their result.
iii.Registers
The ARM1176JZ-S core contains:
•33 general-purpose 32-bit registers
•7 dedicated 32-bit registers.
Note
At any one time, 16 general-purpose registers are visible. The remainder
are banked registers used to speed up exception processing.
iv.Modes and exceptions
The core provides a set of operating and exception modes, to support
systems combining complex operating systems, user applications, and
real-time demands. There are eight operating modes, six of them are
exception processing modes:
•User
•Supervisor
•fast interrupt
•normal interrupt
•Abort
•System
•Undefined
•Secure Monitor.
v.Thumb instruction set
The Thumb instruction set contains a subset of the most commonly-used 32-
bit ARM instructions encoded into 16-bit wide opcodes. This reduces the
amount of memory required for instruction storage.
vi.DSP instructions
The DSP extensions to the ARM instruction set provide:
•16-bit data operations
•saturating arithmetic
•MAC operations.
The processor executes multiply instructions using a single-cycle 32x16
implementation. The processor can perform 32x32, 32x16, and 16x16
multiply instructions (MAC).
vii.Media extensions
The ARMv6 instruction set provides media instructions to complement the DSP
instructions. There are four media instruction groups:
3)Prefetch unit
The prefetch unit fetches instructions from the instruction cache,
Instruction TCM, or from external memory and predicts the outcome
of branches in the instruction stream.
Branch prediction
The core uses both static and dynamic branch prediction. All
branches are predicted where the target address is an immediate
address, or fixed-offset PC-relative address.
The first level of branch prediction is dynamic, through a 128-
entry Branch Target Address Cache (BTAC). If the PC of a branch
matches an entry in the BTAC, the processor uses the branch history
and the target address to fetch the new instruction stream.
The processor might remove dynamically predicted branches from
the instruction stream, and might execute such branches in zero
cycles.
If the address mappings are changed, the BTAC must be flushed. A
BTAC flush instruction is provided in the CP15 coprocessor.
The processor uses static branch prediction to manage branches not
matched in the BTAC. The static branch predictor makes a prediction
based on the direction of the branches.
4. Memory system :
The level-one memory system provides the core with:
• separate instruction and data caches
• separate instruction and data Tightly-Coupled Memories
• 64-bit datapaths throughout the memory system
• virtually indexed, physically tagged caches
• memory access controls and virtual memory management
• support for four sizes of memory page
• two-channel DMA into TCMs
• I-fetch, D-read/write interface, compatible with multi-layer AMBA AXI
• 32-bit dedicated peripheral interface
• export of memory attributes for second-level memory system.
The following sections describe the memory system in more detail:
ii)ETM interface
You can connect an external Embedded Trace Macrocell (ETM) unit to the
processor for real-time code tracing of the core in an embedded system.
The ETM interface collects various processor signals and drives these
signals from the core. The interface is unidirectional and runs at the full
speed of the core. The ETM interface connects directly to the external ETM
unit without any additional glue logic. You can disable the ETM interface for
power saving.
iii)ETM trace buffer:
You can extend the functionality of the ETM by adding an on-chip trace
buffer. The trace buffer is an on-chip memory area. The trace buffer
stores trace information during capture that otherwise passes
immediately through the trace port at the operating frequency of the
core. When capture is complete the stored information can be read out
at a reduced clock rate from the trace buffer using the JTAG port of the
SoC, instead of through a dedicated trace port. This is a two-step process
that avoids you implementing a wide trace port that has many high-speed
device pins. In effect, a zero-pin trace port is created where the device
already has a JTAG port and associated pins.
iv)Software access to trace buffer:
You can access buffered trace information through an APB slave-based
memory-mapped peripheral included as part of the trace buffer. You can
perform internal diagnostics on a closed system where a JTAG port is not
normally brought out.
1.Halting debug-mode:
On a debug event, such as a breakpoint or watchpoint, the debug logic stops
the core and forces the core into Debug state. This enables you to examine the
internal state of the core, and the external state of the system, independently
from other system activity. When the debugging process completes, the core
and system state is restored, and normal program execution resumes.
2.Monitor debug-mode
On a debug event, the core generates a debug exception instead of
entering Debug state, as in Halting debug-mode. The exception entry
activates a debug monitor program that performs critical interrupt service
routines to debug the processor. The debug monitor program
communicates with the debug host over the DCC.
vi)Debug and trace Environment
Several external hardware and software tools are available for you to
enable:
• real-time debugging using the EmbeddedICE-RT logic
• execution trace using the ETM.
VFP11 - coprocessor
• Except for divide and square root operations, the pipelines support
single-cycle throughput for all single-precision operations and most
doubleprecision operations
• Double-precision multiply and multiply and accumulate operations
have a two-cycle throughput.
Flush-to-zero mode
A flush-to-zero mode is provided where a default treatment of de-norms is
applied. Table 1-3 lists the default behavior in flush-to-zero mode.
CPU Pipeline Stages
The Pipeline consist of 3 stages
• Fetch stages
• Decode stage
• Execute stage
Pipeline stages
Figure shows:
• the two Fetch stages
• a Decode stage
• an Issue stage
• the four stages of the ARM1176JZ-S integer execution pipeline.
From Figure the pipeline operations are:
Fe1 First stage of instruction fetch where address is issued to memory and
data returns from memory
WBex Write back of data from the multiply or main execution pipelines.
MAC1 First stage of the multiply-accumulate pipeline.
Figure shows a typical multiply operation. The MUL instruction can loop in the
MAC1 stage until it has passed through the first part of the multiplier array
enough times. The MUL instruction progresses to MAC2 and MAC3 where it
passes through the second half of the array once to produce the final result.
Instruction progression
Figure shows an LDR/STR operation that hits in the data cache.
Fig. shows the progression of an LDM/STM operation that completes by use of the load/store pipeline. Other instructions
can use the ALU pipeline at the same time as the LDM/STM completes in the load/store pipeline.
Software Pipelining
Software pipelining (also known as loop pipelining and loop
folding) is a technique that overlaps loop iterations (i.e.,
subsequent iterations start before previous finished). This
technique is suitable to increase performance but may also
increase register pressure (not a main problem in
reconfigurable array architectures with pipeline stages). One
of the most used software pipelining techniques is the
iterative modulo scheduling .
All efficient compilers include software pipelining as part of
their set of optimizations. This technique is mostly applied at
the intermediate representation (IR) level of a program, but
can also be applied at the source code level (and in this
case, it is considered a code transformation technique),.
Cache organization
Each cache is implemented as a four-way set associative cache of
configurable size. The caches are virtually indexed and physically tagged. You
can configure the cache sizes in the range of 4 to 64KB. Both the Instruction
Cache and the Data Cache can provide two words per cycle for all requesting
sources.
Each cache way is architecturally limited to 16KB in size, because of the
limitations of the virtually indexed, physically tagged implementation. The
number of cache ways is fixed at four, but the cache way size can vary
between 1KB and 16KB in powers of 2. The line length is not configurable
and is fixed at eight words per line.
Write operations must occur after the Tag RAM reads and
associated address comparisons are complete. A three-entry
Write Buffer is included in the cache to enable the written
words to be held until they can be written to cache. One or
two words can be written in a single store operation. The
addresses of these outstanding writes provide an additional
input to the Tag RAM comparison for reads.
To avoid a critical path from the Tag RAM comparison to the
enable signals for the data RAMs, there is a minimum of one
cycle of latency between the determination of a hit to a
particular way, and the start of writing to the data RAM of
that way. This requires the Data Cache Write Buffer to hold
three entries, for back-to-back writes. Accesses that read the
dirty bits must also check the Data Cache Write Buffer for
The cache dirty bits for the Data Cache are updated when the Data Cache
Write Buffer data is written to the RAM. This requires the dirty bits to be
held as a separate storage array. Significantly, the Tag arrays cannot be
written, because the arrays are not accessed during the data RAM writes,
but permits the dirty bits to be implemented as a small RAM.
The other main operations performed by the cache are cache line refills and
Write-Back. These occur to particular cache ways, that are determined at the
point of the detection of the cache miss by the victim selection logic.all RAM.
To reduce overall power consumption, the number of full cache reads is
reduced by the sequential nature of many cache operations, especially on the
instruction side. On a cache read that is sequential to the previous cache read,
only the data RAM set that was previously read is accessed, if the read is within
the same cache line. The Tag RAM is not accessed at all during this sequential
operation.
To reduce unnecessary power consumption additionally, only the addressed
words within a cache line are read at any time. With the required 64-bit read
interface, this is achieved by disabling half of the RAMs on occasions when only
a 32-bit value is required. The implementation uses two 32-bit wide RAMs to
implement the cache data RAM shown in , with the words of each line folded
into the RAMs on an odd and even basis. This means that cache refills can take
several cycles, depending on the cache line lengths. The cache line length is
eight words.
The control of the level one memory system and the associated
functionality, together with other system wide control attributes are
handled through the system control coprocessor, CP15.
System Control Coprocessor describes this.
Level one cache block diagram
Features of the cache system
The level one cache system has the following features:
•The cache is a Harvard implementation.
•The caches are lockable at a granularity of a cache way, using Format C
lockdown. See Cache control and configuration.
•Cache replacement policies are Pseudo-Random or Round-Robin, as
controlled by the RR bit in CP15 register c1. Round-Robin uses a single counter
for all sets, that selects the way used for replacement.
•Cache line allocation uses the cache replacement algorithm when all cache
lines are valid. If one or more lines is invalid, then the invalid cache line with
the lowest way number is allocated to in preference to replacing a valid cache
line. This mechanism does not allocate to locked cache ways unless all cache
ways are locked. See Cache miss handling when all ways are locked down.
•Data cache misses are nonblocking with three outstanding Data Cache
misses being supported.
•Streaming of sequential data from LDM and LDRD operations, and for
sequential instruction fetches is supported.
•Cache lines can contain either Secure or Non-secure data and the NS Tag, that
the MicroTLB provides, indicates when the cache line comes from Secure or
Non-secure memory.
•Cache lines can be either Write-Back or Write-Through, determined by the
MicroTLB entry.
•Only read allocation is supported.
•The cache can be disabled independently from the TCM, under control of the
appropriate bits in CP15 c1. The cache can be disabled in Secure state while
enabled in Non-secure state and enabled in Secure state while disabled in
Non-secure state.
• The CL bit in the system control coprocessor, see
c1, Non-Secure Access Control Register, reserves cache lockdown registers for
Secure world operation. When the CL bit is 0 the cache lockdown registers are
only available in the Secure world. When the CL bit is 1 they are available for
both Secure and Non-secure operation.
Branch folding
Branch folding is a technique where, on the prediction of most branches,
the branch instruction is completely removed from the instruction stream
presented to the execution pipeline. Branch folding can significantly
improve the performance of branches, taking the CPI for branches
significantly lower than 1.
Branch folding only operates in ARM and Thumb states.
Branch folding is done for all dynamically predicted branches, except that
branch folding is not done for:
•BL and BLX instructions, to avoid losing the link
•predicted branches onto branches
•branches that are breakpointed or have generated an abort when
fetched.