Verification and Computer Architecture Important Links
Verification and Computer Architecture Important Links
https://www.youtube.com/watch?v=TMJj015C93A&list=PLAwxTw4SYaPkr-
vo9gKBTid_BWpWEfuXe&index=76 - coherence
https://www.youtube.com/watch?v=fT4DJcNCisM&list=PLAwxTw4SYaPndXEsI4kAa6BDSTRbkCKJN
- consistance
Architecture vs micro architecture
Architecture -ISA
micro architecture - Organisation
Load store queue can be used to aviod load followed with stores
https://www.youtube.com/watch?v=bEB7sZTP8zc&list=PLAwxTw4SYaPkNw98-
MFodLzKgi6bYGjZs&t=22
note: WAW is problem in multiple issue design. or in OOO desigs
WAW and WAR are also called as false dependencies,(if instruction i3 is
writing to register R3 and i5 is writing to register R3 then instruction after
i5(i6,i7,i8...)
should see R3 by value written by I5)..
1)Tamosulo:
Issue: Instruction is sent to reservation stations by going
through RAT (now rats maps regsiters-->reservation stations)
If rservation stations are full then we need to
stall.
Dispatch: Once the operands are avaiable dispatch the
instruction
If multiple instructions are ready to
dispatch then we can do one of these three
a) send oldest firstname
b) send the one which has more
dependencies -- power hungry and difficult
c)Random
Broadcast: Once the instruction is complete broadcast the
results to reservation stations. And make that particular RS free.
If multiple instructions are ready to
broadcast then give preference to slower execution unit.
Tamasulo drawbacks :Execptions handling
create problem in OOO execution
(becuase many instaructions will execeuted instead of waiting)
Branch mispredictions: div 40
cycles
beq dependtednt
add -> already done (say R3
has sum already)
but if branch is not taken
then its a problem
Register values are stored out of
order - to overcome this we use ROB
2)ROB:
a) Execute OOO
b) Broadcast OOO
c) Deposit results in order
Reorder buffer is required to store the instructions so
that we can store the results in order
OOO has following steps https://www.youtube.com/watch?
v=0w6lXz71eJ8&list=PLAwxTw4SYaPnhRXZ6wuHnnclMLfg_yjHs&index=6
Issue: Instruction is sent to reservation stations and to
ROB's in order (yaa order al kalsidivi anta gnapka itkobeku)
Dispatch: Once the operands are avaiable dispatch the
instruction
once the instruction is dispatched then RS is
made free
Broadcast: Once the instruction is complete broadcast the
results to reservation stations
and store the result in ROB.(ade order al
store madbeku)
Commit: Store the result computed in order using commit and
issue pointers
Instruction I5 na commit madbekandre I4 tanka
ela commit agirbeku illa andre ROB li wait madthirathe
change the RAT to point back to register
Commit pointer and issue pointer present in ROB helps in
understaning which is the previous commit and which is the latest issue.
Data hazard
schedule
stalling
bypass
speculate
Cache
Why cache was needed? DRAM - 1000 cycles
How to place block in the cache - Direct , set asso, fully asso (address mod
cache_lines)= block where data has to placed.
note cache_lines =1 for fully asso
What will cahce exploit - Temporal locality and spatial locality
How block is found?
Block offset: determined by how many bytes in each cache line
Index:which cache line (cache_line everywhere here is cache_set)
Tag:which memory address
More associtivity requires more tag comparators and less index bits
Misses:
Compulsory miss(initially),
capcity miss(no space in cache),
conflict miss(even though there is space if you want to write in same address
line eg.Direct mapped cache)
Coherney misses
Cache Optimizations:
1. Larger block size - compulsary misses
2. Larger cache size - capcity misses
3. Higher Associitivity (less cnfilict)
4. Reducing Miss penalty, hit time- multilevel cache
5. Giving reads priority over writes - using write buffers
Cache Performance:
Mem access time = hit time + miss rate * mis penalty
We can increase cache performance by reducing hit time or miss rate or mis
penalty
Redcuing cahce size will decrease hit time but increases miss rate
Reducing associtivity will decrease hit time but increases miss rate
Cache pipelining: L1 cahce which consumes more than 1 cycles can be easily
pipelined
index--> reading data - one stage
comparing tags - one stage
selecting block using block offset - one stage
pipelining reduces wait time for consiqutive cache
hits
[even if I1,I2 are hits I2 need to wait till I1
cache oper is done]
Physically indexed physical tagged cahce:
VA-->TLB-->PA--
>Cache
Adv: when
processces changes (context switching) TLB also changes but here it is not a
problem
Disadv: SLOW
Virtually indexed virtual tagged cahce:
VA-->Cache VA--
>TLB(if there is cache miss)
Disadv: On
processes change we need flush the cache (because now VA translations are
different)
Many
virtual address may map to same cache (Aliasing)
adv:SPEED
Virtually indexed physical tagged cahce:
VA(index)--
>Cache , VA(tag)-->TLB both happens parallely
Adv: No cache
flush on context switching (better than VIVT)
Fast -->
better than PIPT
Disadv: need to
solve aliasing (2 VA's pointing to same PA)
Mixture of set assocoiative and direct mapped cache:
In set associative cache(low miss rate) Instead of reading
and checking tags,
speculate which block gonna win and choose that (less hit
time)
this is called way predictions
Prefetching:
Speculating what would be next memory access and getting
that to cache.
right speculatation - less latency
wrong speculation - correct ag iro block na tegdu yado
tandange cache ge - this is called cache pollution
Prefetchers:- Stream buffers- if A then get next
consecutive blocks.
Stride based prefetcher -A...B...C...
Co realtion prefetehcer - ABC....ABC
detects patterns of memory acesses
Multi_processors:
NUMA - Non uniform memory access time
group of clusters where clusters are -->core+cahce+memory (almost a
uniprocessor system)
Each cluster access other cluster memory through message passing
f--N == f/4---4N ,if f=f/4 and we can reduce the volatge also. so power
efficient.
-only if we parallelize to a good extent
ILP- pipelining, OOO ex,
DLP- SIMD
TaskLP -diff tasks
Loosely coupledMP: cores hve different address space ....not shared memory
Tightly coupled MP: cores have same adress space.. shared memory
Multi threading:https://www.youtube.com/watch?
v=ZpqeeHFWxes&list=PLAwxTw4SYaPkr-vo9gKBTid_BWpWEfuXe&t=405
Coarse grained - switch on event like cache miss
Fine grained - ever cycle -different threading
Simultaneous MT - Multiple thread istructions at the same time
Exclusive state: Data is present only with me not with anyone else
(reduces BUS traffic)
Take example of Read A
Write A
in MOSI - I-->S-->I-->M
(2 bus transaction)
in MESI - I-->E-->M (1
bus transaction)
The exclusive (E) state in MESI protocol states that the cache block is
clean, valid (same value in the main memory)
and cached only in one block whereas the owned (O) state in MOSI protocol
depicts that the cache value is dirty
and is present in just one block.
Pipelining:
- structual hazard - lack of resources
- RAW - true dependency - data hazard
- WAR and WAW makes OOO difficult
forwarding and stalling used to aviod hazards
- Branch has only one cycle delay slot
- Branch was resolved in ID stage itself
- load followed by normal isnst
normal inst in exec stage will get loaded value from mem stage
through forwarding II unit.
- store followed by normal instead
no problem
- store followed by load
- no problem
_ Load followed by store
- no problem
Branch Prediction:
-Static branch prediction - we choose either it will be always
taken/not taken/not
-Dynamic branch predcition - 1 bit -prediction bit will change if
descision is wrong
-2 bit - prediction bits follow
a state diagram
2 consequetive
wrong decision results in decsion change
BHT- branch history table (holds
history of branch 1bit,or 2 bit like above)
BTB- branch target buffer (holds
branch PC)
Initial state: Good to start with
weak states than from strong states
but in some
corner cases like (T,N,T,N) starting from weak state will always give wrong
prediction
-History based predictors
Used when there is a pattern T,N,T,N,T,N TT,NN,TT,NN
Learns and filles BHT during inital iterations
then onwards it will be always right (eg: if last decision is
false then next desicison should be true)
- Tournament predictors
amalgmation of two predictors(P1 and P2) which is good for 2
differnt branch patterns
there is one more meta predictor which selects either from
predictor1 and predictor2
P1 and P2 is trained as above but meta predictore should also be
trained
Virtual Memories:
MIPS provides 32 bit address space i.e. load word and store word can
have address of 32bit
so programmer has 4GB in his hands.
But if your ram is just 1GB. Then U NEED VIRTUAL MEMORY.
Each process has its own virtual memory and corresponding page tables
If we have 4GB of virtual memory (eg MIPS) then to map each virtual
addrress we need page table with 4G entries
Hence to overcome his memory is divided into blocks called pages
A virtual memory page is mapped to physical memory page through lookup
tabels called page tabels
Virtual memory provides indirection or MAPs the program address to
physical(RAM) address.
Basics of OOPS:
What if you want to edit that constraint. you can use inheritance
base_class bas_obj;
ext_xyz ext_obj;
if you want to edit function using extended class then you need to define
function as VIRTUAL in bas the base class.This is also called as overriding
https://www.youtube.com/watch?v=dX2ojPL0Y5M
Abstract class is also called as virtual class. i.e. declaring the class with
keyword virtual
these virtual classes cannot be instanced it can just be extended.
i.e. virtual class abc endclass
abc var; XXX not valid
https://www.youtube.com/watch?v=Mb1y-L-ZjD0&list=PL792F8AED9E6F3E9F
TO BE READ:
public,private classes
friend functions
static, const declarations
SYSTEM VERILOG:
cover,assert,sequence,property
https://www.youtube.com/watch?
v=siXo9_fCt7k&index=3&list=PL589BOiAVX7YFqkUsZjcoLVxfkkInhXh3
sequence ##1,##2,##[1:2],
consecutive repitation [*2] start,start,x,x,x,x,x,x or [*3:5]
non consecutive repitation [=2] start,z,z,z,z,start(match ends here),z,z,z,z,
Goto non consecutive operator [->2] start,z,z,z,z,(match ends
here)start,z,z,z,z,
overlapping sequenence implecation operator: |-> just like p(a/b),, ensure
that b is done when a is done
NON overlapping sequenence implecation operator: |=> just like p(a/b),,
ensure that b is done after one cycle when 'a' is done
We can also have named sequences and properties. sequence xyz=nwjkgfdwjgfwg;
end sequence then use xyz everywhere
Every asseration have an action block which will be immediateley exceuted
like error,warning
$onehot,$isunkonown,$rose,$fell,$stable,$past, throughout
Assertion patterns:
How and where to create assertions
- $rose (req) |-> (req throughout (~grant[*0:8] ##1 grant ##1 start)) -
template patterns that can be used by others
- assert property (@posedge clk disable iff (~rst_n)) !(^var_name==1'bX) - x
detection
- assert property (@posedge clk disable iff (~rst_n)) !(var_name<=7 )- valid
range
- assert property ((@posedge clk disable iff (~rst_n)) ($rose(start) |=> (!
start throughout done[->])))- bounded window
For state machines write auxilary state machines in tb and write properties
https://www.youtube.com/watch?v=Es-
rRRI7Bq8&index=1&list=PL589BOiAVX7YFqkUsZjcoLVxfkkInhXh3
linear tb: normal testbench, simulator should create lot of events, bad
linear random tb: good but covers unnesscary cases,no coverages
directed verif: the engineers writes differt test cases in a test case file
and uses those files
but writing all test cases for a big design consumes
lot of time.
What if we get scenarios which we never thought.
Contraint random verif: The tb generates testbenchs automatically
less simulation time
Will not provide info about how well
design has been verified
Functional Coverage (CIT)- item FC: 'a' should take values from 4 to 10 and
'b' should take values from 0 to 1
Cross FC: item_a ,item_b (b=0
iddaga ela 'a' values change agbeku and b=1 iddaga 'a' values change agbeku )
Transition FC: a should take value
only in this order 4-5-6-7-8-9-10
Assertion based FC: by using cover,
property, assert
Note: there are two kinds of monitors active and passive. Active can drive
DUT but passive cant.
Checks operational features : like
push,pop,full,empy in fifo
Arrays
1. Static Array: No run time change
of size}
eg:fixed array
packed and unpacked
memory is allocated in
compile time
cannot change the size in
runtime
2. Dynamic Array : We can change
size in run time-
a. Dynamic array: Can change
the size in run time
not good for sparse data
elements
dynamic array can only
be packed
memory is allocated in
run time
dynamic array has
methods like new,delete to allocate or disallocate memory
b: Associitive
array:https://www.youtube.com/watch?v=qTZJLJ3Gm6Q
https://www.youtube.com/watch?v=Bts4c-sPOiE
VLSI Design:
SRAM DRAM and Memory model basic :https://www.youtube.com/watch?
v=7k_3EAkKfak&list=PLAwxTw4SYaPn79fsplIuZG34KwbkYSedj&t=35
DRAM nead constant refreshing : Destructive reads (we need write back once we read
something)
Miscellaneous:
UVM:
one of the great advantages of UVM is It’s very easy to replace
components without
having to modify the entire testbench, but it’s also due to the concept
of classes and objects
from SystemVerilog.
UVM has many classes which we can derive to build our own
eg: UVM_component,UVM_driver,UVM_agent,UVM_sequencer etc
Note:
1) The interface is a module that holds all the signals of the DUT.
The monitor, the driver and the DUT are all going to be connected
to this module.
package ALU_pkg
`include class_add.svh
`include class_sub.svh
`include class_div.svh
`define mul a*b;
endpackage
import ALU_pkg::*
module tb;
begin end
then work on it
UVM components:
UVM is large set of classes. In these classes many are dependent on one
another.
eg. UVM_test is derived from UVM_component
Each of the component class has these default virtual methods inside
the class defination
1.function void Build_Phase(); when this is called it calls build
phase of all classes in the upper hierarchy.
2.function void Connect_Phase(); when this is called it connects
all the classes
3.task void Run_Phase(); It is the only task as it can take
delays, when this is called it runs the test (like initial block)
4. function void Report_Phase(); reports the data
vsim +UVM_testname=dog
vsim +UVM_testname=cat we can call different test from the command line
how to do it:
class dog extends UVM_test;
`UVM_component_utils(dog); // tell to UVM factory that
there is a test called dog which should be build upon arguments
https://hardwaregeeksblog.files.wordpress.com/2016/12/fifodepthcalculationmadeeasy2
.pdf
Depth=burst-total time to write burst/time to read
single data
For worst case scenario, Difference between the data
rate between write and read should be maximum
PERL:
https://www.youtube.com/watch?v=WEghIXs8F6c
sclars,arrays and hashes
sclars my $var_name='Derek';
"" === qq{}
say "5+4=",5+4;
rand,log,sqrt,int,hex,oct,exp
Ask Vivek:
Accessing multi dimenstional array elements
About clock domain crossing and reset domain crossing
Can we do clk gating in verilog?
breif about MSI,MOSI,MESI,MESIF
queues and fifo's? Where they will use in testbenches?
Given read and write freq, how to calculate FIFO depth?
When were we stalling in MIPS?
Projects:
MIPS is a RISC, harvard architecture
Points in book:
Ring oscilator is one which has odd number of not gates
C=G+PC carry look ahead adder
Note:
test calls constructor of environment
environment calls constructor of driver,moniter and scoreboards
driver calls constructor of covergroup,stimulus
http://www.sunburst-design.com/papers/CummingsSNUG2008Boston_CDC.pdf
Queue methods:
push_back(),push_front,pop_front,pop_back,delete,insert,size
pre/post randomization functions are called when we call randomize(); we can use
this to initalize and randomize and assign to functions
randomize is by default virtual so it always call pre_randomize and post_reandmoize
of child class (lly constarints are virtual)
obj.rand_mode can be used to disable random variable
we can randomize non random variables by passing non rand var through rand function
Built-in method randomize() not only used for randomization, it can be used as
checker. When randomize() method is called by passing null,
randomize() method behaves as checker instead of random generator. It evaluates
all the constraints and returns the status.
obj.randomize() with { Var == 50;} inline contraints
Global constraints :constraints between variables of different objects
If a constraint block is declared as static, then constraint_mode on that block
will effect on all the instancess of that class
Weighted distrubution:
Var dist { 10 := 1; 20 := 2 ; 30 := 2 } var will have
1/5% of 10,,,,, 2/5 % of 20
Implication:
constraint c { (a == 0) -> (b == 1); }
Constraints can have for loops (for constraining each element of array) and
function calls
randcase can provide prob distrubutions
implicit bins,
explicit bins, bins bin_name[7:0]=var_name;
default bins,
ignore_bins,
illegal bins
transition bins 3=>4=>5=>6,bins trans[] = (3,4=>5,6);
wildcard bins wildcard bins trans = (2'b0X => 2'b1X );
bins can be arrays eg. bins bin_name[4]={[0:7]};(creates 4 bins)