Slide 1
Slide 1
(CS F342)
Motivation and Introduction
Automatic, Single & General Purpose Computing
Lab :
Lab. Test 15 Lab : TBA
Reference Books:
(R1) Digital Design: With a Introduction to the Verilog HDL by M. Morris Mano &
Michael D. Ciletti
(R2) Verilog HDL: A Guide to Digital Design and Synthesis by Samir Palnitkar.
(R3) Computer Organisation & Architecture: Designing for performance by William
Stallings.
8
We shall answer the following questions
1. What’s computation?
2. What’s uncomputable?
3. What’s automatic computation/automation?
4. What’s a single-purpose microprocessor?
5. What’s a general-purpose microprocessor?
6. How does a laptop solve the problems?
7. Past & Future of Microprocessors
© Kanchan Manna; BITS-Pilani, Goa Campus, India.
9
What is the meaning of Computable?
• Result: 6
Cin A B S Cout
Input Output Find the relation between input variables and
0 0 0 0 0 output variables for compressing the table.
A B S Cout
0 0 1 1 0
0 0 0 0
0 1 0 1 0
0 1 1 0
0 1 1 0 1 S = A (XOR) B (XOR) Cin
1 0 1 0
Cout = A (AND) B + A (AND) Cin + B (AND) Cin
1 0 0 1 0
1 1 0 1
1 0 1 0 1
1 1 0 0 1
S = A (XOR) B
Cout = A (AND) B 1 1 1 1 1
• Datapath
• Controller
Basic Elements:
Mapping of High-level Construct onto Digital
Construct
For example,
C language provides datatype to represent and store the data.
Register:
32-bit register
31 1 0
A latch or flip-flop
• RS
• D
• T
Fig: D-ff
Fig: D-Latch
https://circuitfever.com/d-flip-flop-in-verilog
How to map Algorithm’s elements onto Architectural 19
elements
elements
if (sel)
a = 10;
else
a = 5;
Multiplexer
elements
if (sel)
a = 10;
else
b = 5;
Decoder
} }
Mem Multiplexer
Decoder
How to map Algorithm’s elements onto Architectural 23
elements
Load (Max. Value)
int i;
Counter/Register
Decrement Max. Value
CLK
int i, k;
Counter (i) Zero (1)?
stop
for (i=10; i>0; i--){
for (k=20; k>0; k--){
}
Decrement Max. Value (10) Load (Max. Value)
}
Zero = 1 if the content of
the counter is 0. Zero (1)?
Counter (k)
Stop = 1 if both counters
value is 0. CLK
elements
Comparison
CLK CLK
PC/i
< MaxReg
+ MEM CLK
MinReg
>
0 1 LoadMin
Controller
∞
?
+ MEM CLK
MinReg Find maximum Time from all
> possible Time values
0 1 LoadMin
Controller
∞
MinMax Processor
Necessity of General-purpose processor 35
• Is there an Algorithm which will execute or simulate other Algorithms?
• The processor executes any algorithms
• Programmable
• Turing Model
• Is there any limitation of such an Algorithm?
• Halting problem: Can we have an Algorithm which takes other Algorithm as input and decides that
whether given input Algorithm will halt/stop or not, in general?
• Consider [*] such Algo. exists A(P, D). Another Algo. B(X): loop-forever if A (X, X) = “Halt” else
Halt. Next use B(B), it is unable to decide the answer. A(P, I) doesn’t exist.
• Used Self-referential structure
• What kind of Algorithm do we need for making the processor general purpose
or programmable?
• Fetch-and-Execute Algorithm Fetch-and-Execute Processor
• Stored program (?) [*]
• Generalized Datapaths ALU/FU
• Generalized Functional Unit Datapath All possible
MEM Operations
• Proposed by Jhon von Neumann [*]
Controller
[*] An URL is embedded.
36
Fetch-and-Execute Algorithm
• What is to be fetched?
• Program/instructions (Birth of the program or software)
Number representation
• Representation of instruction
• Instruction format
38
Fetch-and-Execute Algorithm Number representation
• Representation of instruction
• Instruction format
Microprocessor
without
Interlocked
Pipelined
Stages
39
Fetch-and-Execute Algorithm
LW 100010
SW 100011
BEQ 000100
R-type 000000
addi 001000
j 000010
40
Shift instructions in MIPS
41
Shift instructions in MIPS
42
Shift instructions in MIPS
43
Fetch-and-Execute Algorithm
• Bit wise and (&) and shift operations (<< and >>)
44
MIPS Processor/Generalized components
#define OPCODE 0b11111100000000000000000000000000
#define RS 0b00000011111000000000000000000000 Fetch-and-Execute Algorithm 45
#define RT 0b00000000000111110000000000000000
#define DST 0b00000000000000001111100000000000 //RD
#define OFFSET 0b00000000000000001111111111111111
46
#define RT
#define RD
0b00000000000111110000000000000000
0b00000000000000001111100000000000
MIPS Microprocessor (32 bit)
#define SHIFT 0b00000000000000000000011111000000
#define OFFSET 0b00000000000000001111111111111111
ALU(Src1, Src2){
switch (ALUControl){
int PC, IMM[1024], DMM[1024], RF[32], ALUControl; bool ZERO; case B-type: ZERO = (Src1- Src2) == 0 ? 1: 0;
Load(IMM, DMM); break;
Set PC with address 1st instruction which is stored in IMM; case ADD: return (Src1 + Src2); break;
while (1){
…
}
switch((IMM[PC] & OPCODE) >>26){ Need conversion for
}
case R-type: 16 bit offset to 32 bits
Set ALUControl; //ADD, SUB, AND, OR, etc
RF[IMM[PC] & DST] = ALU(RF[(IMM[PC] & RS)>>21], RF[(IMM[PC] & RT) >>16]); PC = PC + 4;
case SW-type:
Set ALUControl = 0b0010;
DMM[ALU((IMM[PC] & RS)>>21, (IMM[PC] & OFFSET) ] = RF[(IMM[PC] & RT)>>16]; PC = PC + 4;
case LW-type:
Set ALUControl = 0b0010;
RF[IMM[PC] & RT]= DMM[ALU((IMM[PC] & RS) >>21, IMM[PC] & OFFSET) ]; PC = PC + 4;
case B-type:
Set ALUControl = 0b0110;
ALU(RF[(IMM[PC] & RS)>>21], RF[(IMM[PC] & RT) >>16]);
IF (ZERO ==1) PC = (PC + 4) + ((IMM[PC] & OFFSET) <<2); ELSE PC = PC + 4;
}
}
(c) Kanchan Manna; BITS-Pilani, Goa Campus, India.
47
Analysis of data-path for Fetch stage 47
Why 4?
+
4 Clock Period (T-pc-to-pc)
Source Reg is PC
Destination Reg is PC
Read address
PC
Instruction
CLK
Instruction Memory
IMM[PC]; PC = PC + 4
49
Analysis of data-path for I-type instruction Is offset a physical address?
• LW $S1, offset[$S2] //$S1 DM[ offset + $S2]
No. It is a relative address (here,
relative with respective to Reg.)
op rs rt offset
50
Analysis of data-path for I-type instruction Is offset a physical address?
• LW $S1, offset[$S2] //$S1 DM[ offset + $S2]
No. It is a relative address (here,
relative with respective to Reg.)
op rs rt offset
How does one measure the clock period? (c) Kanchan Manna; BITS-Pilani, Goa Campus, India.
51
51
Analysis of data-path for I-type instruction
• SW $S1, offset[$S2] //DM[ offset + $S2] $S1
op rs rt offset
52
Analysis of data-path for I-type instruction
• SW $S1, offset[$S2] //DM[ offset + $S2] $S1
op rs rt offset
How does one measure the clock period? (c) Kanchan Manna; BITS-Pilani, Goa Campus, India.
53
53
Analysis of data-path for I-type instruction
• LW $S1, offset[$S2] //$S1 DM[ offset + $S2]
• SW $S1, offset[$S2] //DM[ offset + $S2] $S1
op rs Rt offset
54
Analysis of data-path for I-type instruction
• LW $S1, offset[$S2] //$S1 DM[ offset + $S2]
• SW $S1, offset[$S2] //DM[ offset + $S2] $S1
op rs rt offset
How does one measure the clock period? (c) Kanchan Manna; BITS-Pilani, Goa Campus, India.
Analysis of data-path for I(B)-type instruction 55
• BEQ $S1, $S2, offset //Jump to the offset no. of instr., when $S1 = $S2
55
• BNE $S1, $S2, offset //Jump to the offset no. of instr., when $S1 != $S2
op rs rt offset
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
PC + 4
Sum Branch Address
25:21 Add
Read Read
register 1 data 1
Instruction 20:16 Read Left Zero
register 2
Write
Shift
by
2-bits
? ALU Offset indicates number
register of instructions
Read 31:0
Write data 2 ALUControl
data Left shift by 2-bits to
ALUDecoder align the instruction
RegWrite Sign boundary
15:0 Extn.
ALUOp = (01)2
How does one measure the clock period? (c) Kanchan Manna; BITS-Pilani, Goa Campus, India.
57
57
Analysis of data-path I-type instruction
• ADDI $S1, $S2, -12 //$S1 $S2 + (-12)
op rs rd Immediate
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
58
Analysis of data-path I-type instruction
• ADDI $S1, $S2, -12 //$S1 $S2 + (-12)
op rs rd Immediate
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
How does one measure the clock period? (c) Kanchan Manna; BITS-Pilani, Goa Campus, India.
Analysis of data-path j-type instruction 59
59
• J addrs //PC PC[31:28]addrs[27:0]
op address
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
60
• J addrs //PC PC[31:28]addrs[27:0]
op address
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
P Read
C address
Instruction
Instruction
Memory
?
31:28 +
25:0
4 << 2
27:0
How does one measure the clock period? (c) Kanchan Manna; BITS-Pilani, Goa Campus, India.
61
61
Building Microprocessor
• Designed the individual datapath for
• Instruction Fetch
• R-type instructions
• I-type instructions
• J-type instructions
• For ALU and write register, source of data, for the input, is more than one
62
• Insert MUX before such input signal and control the inputs through MUX-select line
+
4 MemWrite
RegDst
ALUSrc
23
0
M
U
X
+ 1
+
4 <<2 MemWrite
RegDst
ALUSr
c
25:21
Read Read MemtoReg
Read Branch
register data 1 Address
PC address
1
20:16
Read Zero Data Read
Instruction 1
register memor data
0 M
0 2 y
M U
Instruction M Write U ALU X
Memory U register X Write 0
15:11 X Read 1 data
1 Write data 2
data S ALUControl
W MemRead
15:0 Result RegWrite Sign ALUDecode
How does one measure the clock period? Extn r
.
5:0 ALUOp
LW
(c) Kanchan Manna; BITS-Pilani, Goa Campus,
India.
Combined Fetch cycle, R, M, I, B and J-type data-path 64
24
0
How does one measure the clock period? M
U
X
+ 1
+
4 <<2
RegDst MemWrite
ALUSrc MemtoReg
0 25:21
Read Read
M Read Branch
register data 1
U PC address Address
1
X 20:16
1 Read Zero Data Read
Instruction 1
register memor data
0 M
0 2 y
M U
Jump Instruction M Write U ALU X
Memory 15:11 U register X Write 0
X Read 1 data
1 Write data 2
31:28 data S ALUControl
W MemRead
25:0 15:0 Result RegWrite Sign ALUDecode
Extn r
27:0 .
<<2 5:0 ALUOp
LW
26
Identify the control signals
• Jump How can we design a Controller?
• RegDst ALUOp Meaning
00 add
• RegWrite
01 subtract
• ALUSrc 10 Look at funct field
• Branch 11 n/a
• ALUOp
• MemRead
• MemWrite
• MemtoReg
27
Control Unit
Inputs
MemtoReg
MemWrite
Main Branch
Decoder Jump
ALUSrc Outputs
RegDst
RegWrite
ALUOP1:0
ALU
ALUControl2:0
Decoder
28
Generation of Controls: Main decoder truth table
op-code part [31:26]
Inputs to the control unit: op-code part [31:26] and funct part [5:0] of the ALUOp Meaning
instruction
00 add
6:2^6 01 subtract
Instr. Jump RegDst RegWrite ALUSrc Branch ALUOp1 ALUOp0 MemRead MemWrite MemtoReg
(Input)
R-type 0 1 1 0 0 1 0 0 0 0
lw 0 0 1 1 0 0 0 1 0 1
sw 0 x 0 1 0 0 0 0 1 x
addi 0 0 1 1 0 0 0 0 0 0
B-type 0 x 0 0 1 0 1 0 0 x
J-type 1 x 0 x x x x 0 0 x
(c) Kanchan Manna; BITS-Pilani, Goa Campus, India. Book- COD by P&H –ch-appendix-D
68
28
Generation of Controls: Main decoder truth table
op-code part [31:26]
Inputs to the control unit: op-code part [31:26] and funct part [5:0] of the
instruction
lw sw
ALUSrc = 1;
Output of the control unit:
Else
Instr. Jump RegDst RegWrite ALUSrc ALUSrc = 0;
(Input)
R-type 0 1 1 0 How can we write it using Logical expression?
lw 0 0 1 1
ALUSrc = (! R-type & ! B-type & lw) + (! R-type & !
sw 0 x 0 1
B-type & sw) + (! R-type & ! B-type & addi)
addi 0 0 1 1
B-type 0 x 0 0
Hardwired CU
J-type 1 x 0 x
(c) Kanchan Manna; BITS-Pilani, Goa Campus, India. Book- COD by P&H –ch-appendix-D
69
29
ALU Operations
0000 AND
0001 OR
0010 Add
0110 Subtract
00 010 (add) X
01 110 (subtract) X
31
Generation of Controls
Inst. opcode ALUOp Instr. operation Funct field Desired ALU ALUControl
action
100010 (LW) 00 load word xxxxxx add 0010
34
Performance analysis implementation
•
73
Performance analysis of implementation
Resource Usage
Instr. Total Resource
PC IMM ADD_PC RF EXTN ALU <<2 ADD_B DMM RF (Write)
Usage
ADD √ √ √ √ × √ × × × √ 6
BNE √ √ √ √ √ √ √ √ × × 8
J √ √ √ × × × × × × × 3
SW √ √ √ √ √ √ × × √ × 7
LW √ √ √ √ √ √ × × √ √ 8
ADDI √ √ √ √ √ √ × × × √ 7
What could be maximum time to update the PC? Because updated PC will bring the next instruction in the datapath
Which of the instruction using maximum resource? Next consider the time taken by each resource to compute the functionality
74 0
M
U
X
+ 1
+ CLK <<2
CLK 4 MemWrite
RegDst
ALUSrc CLK MemtoReg
0 25:21 Read Read
M Read register 1 data 1 Branch
U PC address Address
X 20:16
1 Read Zero Data Read
Instruction register 2 1
memor data
0 M
0 y
M U
Jump Instruction M Write U ALU X
Memory 15:11 U register X Write 0
X Read 1 data
1 Write data 2
75
implementation
32
Single-cycle implementation
• The previous design is called single-cycle implementation
• The instruction memory, register file and data memory are all read
combinationally
• What does it mean?
• The new instruction appears to output of instruction memory after some
propagation delay, if the address changes
• Operations are done on rising edge of the clock
• The single-cycle microarchitecture executes an entire instruction in one
clock cycle
• Simple control unit (why?)
• No next state is associated with it
• Every operation is done in a clock cycle
33 implementation
78 implementation
• How to calculate Delay
• Delay: time between applying the Input and producing the output, another
way to say time between two inputs, i.e., when we can update the PC
• Input is reading an instruction from Memory
• Output is producing the result by the read instruction
• Next input is available when we update the PC by (PC + 4) or Branch address
79
implementation Paramet Delay (ps)
• XYZ-organization is going to build the er
20
100 billion instructions.
80 implementation
Parameter Delay (ps)
•
30
250
150
200
25
20
.text
main:
la $a0, array
lw $a1, array_size
lw $t2, maxE # max
lw $t3, minE # min
Find the minimum and maximum number from a 86
set of numbers
loop_array:
beq $a1, $zero, print_and_exit
lw $t0, ($a0)
bge $t0, $t3, not_min # if (current_element >= current_min) {don't modify min}
move $t3, $t0
not_min:
ble $t0, $t2, not_max # if (current_element <= current_max) {don't modify max}
move $t2, $t0
not_max:
addi $a1, $a1, -1
addi $a0, $a0, 4
j loop_array
Find the minimum and maximum number from a 87
set of numbers
# print maximum
print_and_exit: li $v0, 4
# print minimum la $a0, array_min
li $v0, 4 #for string syscall
la $a0, array_max
syscall li $v0, 1
move $a0, $t3
li $v0, 1 #for number syscall
move $a0, $t2
syscall # exit
li $v0, 10
syscall
Find the minimum and maximum number from a 88
set of numbers
int main()
{
int arr[10] = {1, 2, -8, 0, 23, 11, -10};
int N = 10, i;
int minE = 9999, maxE = -9999;
// Traverse the given array
for (i = 0; i < N; i++) {
// If current element is smaller than minE then update it
if (arr[i] < minE) {
minE = arr[i];
}
// If current element is greater than maxE then update it
if (arr[i] > maxE) {
maxE = arr[i];
} } printf("The minimum element is %d", minE); printf("\n");
printf("The maximum element is %d", maxE);
return 0;
}
89
Algorithm and its Possible Architectures
Algorithm
Algorithm
Program/Language
Runtime system How do we ensure problems are solved by
(OS, VM, MM) electrons?
ISA (Architecture)
Logic
Devices
Yale Patt, “Requirements, Bottlenecks, and Good Fortune- Agents for Microprocessor Evolution,” Proc. of the IEEE, VOL. 89, NO. 11, NOV. 2001
92
History of computation
• Homework
• Go through the material on Gdrive, shared with you.
• https://drive.google.com/drive/folders/1JkTqBCwtP8o6-7YzjX8Jfxbab
KT8puzW
• Go through the order mentioned in the xlsx file
• Will ask the question in the next class
101
Changes in Computation
• Manual
• Mechanical
• gears, chains, pulleys, and steam power
• Punch cards
• Electro-mechanical
• switches, relays
• Electrical
• plugboards, vaccum tubes
• later came DRUM memory, core memory, transistors and so on ...
http://www.computerhistory.org/timeline
102
Computation on 2004
• 64-bit Itanium processor developed by
Intel
• 1.7 billion transistors
• 1.7 GHz, issue up to 8 instructions per
cycle
• 26 MByte of cache
• In ~30 years, about 100,000 fold growth
in transistor count and performance
103
What kind of
growth is it?
106
Future is about
Quantum Computing
107
Quantum Computing
• https://awards.acm.org/about/2020-acm-prize
• https://www.ibm.com/quantum-computing/
• Google wants to build a useful quantum computer by 2029
• https://www.theverge.com/2021/5/19/22443453/google-quantum-computer-2029-d
ecade-commercial-useful-qubits-quantum-transistor
• Quantum Computing: Untangling the Hype (Talk at The Royal Institution)
• https://www.youtube.com/watch?v=wE1OCXvaDtc
110
Homework
• Design a single-purpose processor for performing the only bubble sort.
• Design a general processor for a set of instructions:
• ADD R1, R2, R3
• SUB R1, R2, R3
• LW R2, offset(R3)
• SW R2, offset(R3)
• BNE R1, R2, offset
• Has a instruction and a data memory
• Has 32 number of registers
• Register R0 holds zero value only.
111
Summary
• Motivation for automated Computation
• Dedicated processor Vs. General-purpose processor
• Limitation of Algorithm
• Building block of a program
• Steps to solve a problem by a computer/laptop
• Changes in Computation
29
ALU Operations
ALU control lines (2:0) ALU Functions
000 AND
001 OR
010 Add
110 Subtract
00 010 (add) X
01 110 (subtract) X
to
M
ai
B
a
n
c
k
(c) Kanchan Manna; BITS-Pilani, Goa Campus, India.