0% found this document useful (0 votes)

185 views70 pages

Introduction To High Performance Computing: Unit-I

This document provides an introduction to high performance computing (HPC). It defines HPC as using the most efficient algorithms on the highest performing computers to solve demanding problems. Examples of HPC applications discussed include blood flow simulation, earthquake simulation, and pandemic flu simulation. The document also covers performance metrics like FLOPS and discusses how parallelism can be used to improve computer performance by reducing the number of CPU cycles needed to run a program. Finally, it introduces Flynn's taxonomy of parallel computer architectures.

Uploaded by

Raht

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

185 views70 pages

Introduction To High Performance Computing: Unit-I

Uploaded by

Raht

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Introduction to High

Performance Computing
Unit-I
Syed Rameem Zahra
(Assistant Professor)
Department of CSE, NSUT
What & Why
❑ What is high performance computing (HPC)?
❖ The use of the most efficient algorithms on computers capable of
the highest performance to solve the most demanding problems.
❑ Why HPC?
❖ Numerical simulation to predict the behaviour of physical
systems.
❖ High performance graphics—particularly visualization, and
animation.
❖ Big data analytics for strategic decision making.
❖ Synthesis of molecules for designing medicines.
HPC Examples: Blood Flow in
Human Vascular Network
❑ Cardiovascular disease accounts for
about 50% of deaths in western world;
❑ Formation of arterial disease strongly
correlated to blood flow patterns;
In one minute, the heart pumps the Blood flow involves multiple scales
entire blood supply of 5 quarts
through 60,000 miles of vessels, that
is a quarter of the distance between
the moon and the earth
Computational challenges:
Enormous problem size
HPC Examples
Earthquake simulation

Surface velocity 75 sec after

earthquake

Flu pandemic simulation

300 million people tracked

Density of infected population,

45 days after breakout
How HPC fits into Scientific Computing
Air flow around an
airplane
Physical Processes

Navier-stokes equations Mathematical Models

Algorithms, BCs, solvers,

Application codes, Numerical Solutions HPC
supercomputers

Data Visualization,
Viz software Validation,
Physical insight
Major Applications of Next Generation Supercomputer

Targeted as grand
challenges
Performance Metrics
❑ FLOPS, or FLOP/S: FLoating-point Operations Per
Second
❖ MFLOPS: MegaFLOPS, 10^6 flops
❖ GFLOPS: GigaFLOPS, 10^9 flops, home PC
❖ TFLOPS: TeraGLOPS, 10^12 flops, present-day
supercomputers (www.top500.org)
❖ PFLOPS: PetaFLOPS, 10^15 flops, by 2011
❖ EFLOPS: ExaFLOPS, 10^18 flops, by 2020
❖ MIPS=Mega Instructions per Second = MegaHertz (if only 1IPS)
Note: von Neumann computer -- 0.00083 MIPS
Performance Metrics
❑ Theoretical peak performance (R_theor): maximum
FLOPS a machine can reach in theory.
❖ Clock_rate * no_cpus * no_FPU/CPU
❖ 3GHz, 2 cpus, 1 FPU/CPU, then R_theor = 3x10^9 * 2 = 6
GFLOPS
❑ Real performance (R_real): FLOPS for specific
operations, e.g. vector multiplication
❑ Sustained performance (R_sustained): performance
on an application, e.g. CFD
Not uncommon
R_sustained << R_real << R_theor R_sustained < 10%R_theor
Computer Performance
❑ CPU operates on data. If no data, CPU has to wait;
performance degrades.
❖ typical workstation: 3.2GHz CPU, Memory 667MHz. Memory 5
times slower.
❖ Moore’s law: CPU speed doubles every 18 months
❖ Memory speed increases much much slower;
❑ Fast CPU requires sufficiently fast memory.
❑ Rule of thumb: Memory size in GB=R_theor in GFLOPS
❖ 1CPU cycle (1 FLOPS) handles 1 byte of data
❖ 1MFLOPS needs 1MB of data/memory
❖ 1GFLOPS needs 1GB of data/memory
Many “tricks” designed for performance improvement targets the memory
CPU Performance
❑ Computer time is measured in terms of CPU cycles
❖ Minimum time to execute 1 instruction is 1 CPU cycle
❑ Time to execute a given program:

n_c: total number of CPU cycles

n_i: total number of instructions
CPI = n_c/n_i, average cycles per instruction
t_c: cycle time, for 1GHz, t_c = 1/(10^9Hz) = 10^(-9)sec = 1ns
To Make a Program/Computer Faster…
❑ Reduce cycle time t_c:
❖ Increase clock frequency; however, there is a physical limit
❖ In 1ns, light travels 30cm
❑ Reduce number of instructions n_i:
❖ More efficient algorithms
❖ Better compilers
❑ Reduce CPI -- The key is parallelism.
❖ Instruction-level parallelism. Pipelining technology
❖ Internal parallelism, multiple functional units; superscalar processors;
multi-core processors
❖ External parallelism, multiple CPUs, parallel machine
To Make a Program/Computer Faster…
● use parallelism in a single processor computer
○ Overlap execution of a number of instructions by pipelining, or by
using multiple functional units, or multiple processor “cores”.
○ Overlap operation of different units of a computer.
○ Increase the speed of arithmetic logic unit by exploiting data and/or
temporal parallelism.
● use parallelism in the problem to solve it on a parallel
computer.
○ Use number of interconnected computers to work cooperatively to
solve the problem.
Flynn’s Classification of Parallel Architectures

● M.J. Flynn offered a classification for a computer system’s

organisation based on the number of instructions as well as
data items that are changed at the same time.
● An instruction stream is a collection of instructions read
from memory. A data stream is the result of the actions done
on the data in the processor.
● The term ‘stream’ refers to the flow of data or instructions.
● Parallel processing can happen in the data stream, the
instruction stream, or both.
Flynn’s Classification of Parallel Architectures
SISD
■ They are also called scalar processor i.e., one instruction at a time and
each instruction have only one set of operands.
■ Single instruction: only one instruction stream is being acted on by the CPU
during any one clock cycle.
■ Characteristics:
■ Single data: only one data stream is being used as input during any
one clock cycle.
■ Deterministic execution.
■ Instructions are executed sequentially.
■ This is the oldest and until recently, the most prevalent form of
computer.
■ Example: most PCs,single CPU workstations and mainframes.
SISD
SISD Bottleneck
■ Level of Parallelism is low
■ Data dependency
■ Control dependency
■ Limitation improvements
■ Pipeline
■ Super scalar
■ Super-pipeline scalar
Pipelining
❑ Overlapping execution of multiple instructions
❖ 1 instruction per cycle
❑ Sub-divide instruction into multiple stages;
Processor handles different stages of adjacent
instructions simultaneously
❑ Suppose 4 stages in instruction:
❖ Instruction fetch and decode (IF)
❖ Read data (RD)
❖ Execute (EX)
❖ Write-back results (WB)
instruction
Instruction Pipeline
1 IF RD EX WB
IF RD EX WB
2
3 IF RD EX WB
IF RD EX WB
4
IF RD EX WB
5
IF RD
6 EX WB

7 IF RD EX WB

cycle 1 2 3 4 5 6 7 8 9 10

Depth of pipeline: number of stages in an instruction

After the pipeline is full, 1 result per cycle!, in pipelined system CPI = (n+depth-1)/n
With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28 cycles
Inhibitors of Pipelining
❑ Dependencies between instructions
interrupts pipelining, degrading performance
❖ Control dependence.
❖ Data dependence.
❖ Structural dependency
❑ They are Pipeline hazards
Pipeline hazards
● These are delays in pipeline execution of instructions due to
non-ideal conditions.
○ Non-ideal conditions include:
○ Available resources in a processor are limited.
○ Successive instructions are not independent of one another. The result
generated by an instruction may be required by the next instruction.
○ All programs have branches and loops. Execution of a program is thus not in a
“straight line”. An ideal pipeline assumes a continuous flow of tasks.
● Delays due to resource constraints is known as structural hazard.
● Delays due to data dependency between instructions is known as
data hazard.
● Delays due to branch instructions or control dependency in a
program is known as control hazard.
Structural Dependency
● This dependency arises due to the resource conflict in the pipeline.
● A resource conflict is a situation when more than one instruction
tries to access the same resource in the same cycle.
● A resource can be a register, memory, or ALU.
● Solution for structural dependency:
○ To minimize structural dependency stalls in the pipeline, we use a hardware
mechanism called Renaming.
○ Renaming: According to renaming, we divide the memory into two
independent modules used to store the instruction and data separately called
Code memory (CM) and Data memory (DM) respectively.
○ CM will contain all the instructions and DM will contain all the operands that are
required for the instructions.
Structural Dependency: Example

● In the above scenario, in cycle 4, instructions I1 and I4 are trying to access

same resource (Memory) which introduces a resource conflict.
● To avoid this problem, we have to keep the instruction on wait until the
required resource (memory in our case) becomes available. This wait will
introduce stalls in the pipeline as shown below:
Structural Dependency Solution: Example
Control Dependence
❑ Branching: when an instruction occurs after an
conditional branch; so it is unknown whether that
instruction will be executed beforehand
❖ Loop: for(i=0;i<n;i++)…; do…enddo
❖ Jump: goto …
if(x>y) n=5;
❖ Condition: if…else…
Branching in programs interrupts pipeline 🡪 degrades performance
Avoid excessive branching!
Control Dependence: Example
● Consider the following sequence of instructions in the program:
○ 100: I1
○ 101: I2 (JMP 250)
○ 102: I3
○ .
○ .
○ 250: BI1
● Expected output: I1 -> I2 -> BI1
● To eliminate this problem we need to stop the Instruction fetch until we get target address
of branch instruction. This can be implemented by introducing delay slot until we get the
target address.
Control Dependence
● Solution for Control dependency:
● Branch Prediction is the method through which stalls due to
control dependency can be eliminated.
● In this at 1st stage prediction is done about which branch will be
taken.
● For branch prediction Branch penalty is zero.
● Branch penalty: The number of stalls introduced during the branch
operations in the pipelined processor is known as branch penalty
● Total number of stalls introduced in the pipeline due to branch
instructions = Branch frequency * Branch Penalty
Data Dependence
❑ when an instruction depends on data from a
previous instruction
x = 3*j;
y = x+5.0; // depends on previous instruction
Data Hazards
● Data hazards occur when Examples:
instructions that exhibit data 1) RAW hazard occurs when instruction J tries to
read data before instruction I writes it.
dependence, modify data in Eg:
different stages of a pipeline. I: R2 <- R1 + R3
J: R4 <- R2 + R3
Hazard cause delays in the 2) WAR hazard occurs when instruction J tries to
pipeline. write data before instruction I reads it.
Eg:
● There are mainly three types of
I: R2 <- R1 + R3
data hazards: J: R3 <- R4 + R5
3) WAW hazard occurs when instruction J tries to
○ RAW (Read after Write) [Flow/True
write output before instruction I writes it.
data dependency] Eg:
○ WAR (Write after Read) [Anti-Data I: R2 <- R1 + R3
dependency] J: R2 <- R4 + R5
WAR and WAW hazards occur during the
○ WAW (Write after Write) [Output out-of-order execution of the instructions.
data dependency]
Data Dependence: Solution
● To minimize data dependency stalls in the pipeline, operand forwarding
is used.
● Operand Forwarding: In operand forwarding, we use the interface
registers present between the stages to hold intermediate output so
that dependent instruction can access new value from the interface
register directly
Types of pipeline
● 1. Uniform delay pipeline:
● In this type of pipeline, all the stages will take same time to complete an
operation.
● In uniform delay pipeline, Cycle Time (Tp) = Stage Delay
● If buffers are included between the stages then,
● Cycle Time (Tp) = Stage Delay + Buffer Delay
● 2. Non-Uniform delay pipeline:
● In this type of pipeline, different stages take different time to complete an
operation.
● In this type of pipeline, Cycle Time (Tp) = Maximum (Stage Delay)
● For example, if there are 4 stages with delays, 1 ns, 2 ns, 3 ns, and 4 ns,
then Tp = Maximum (1 ns, 2 ns, 3 ns, 4 ns) = 4 ns
● If buffers are included between the stages, Tp = Maximum (Stage delay +
Buffer delay)
Performance of pipeline with stalls
MISD
■ Each processing unit operates on the data independently
through independent instruction streams as shown in
following figure a single data stream is forwarded to
different processing unit which are connected to different
control unit and execute instruction given to it by control
unit to which it is attached.
■ Thus, in these computers same data flow through a linear
array of processors executing different instruction streams.
■ This architecture is also known as systolic arrays for
pipelined execution of specific instructions.
MISD
MISD bottleneck
■ Low level of parallelism
■ High synchronizations
■ High bandwidth required
■ CISC bottleneck
■ High complexity
SIMD
■ This is a type of parallel computer.
■ Single instruction: All processing units execute the same instruction issued by the
control unit at any given clock cycle where there are multiple processors executing
instruction given by one control unit.
■ Multiple data: Each processing unit can operate on a different data element, the
processor are connected to shared memory or interconnection network providing
multiple data to processing unit.
■ This type of machine typically has an instruction dispatcher, a very high- bandwidth
internal network, and a very large array of very small-capacity instruction units. Thus,
single instruction is executed by different processing unit on different set of data.
■ Best suited for specialized problems characterized by a high degree of regularity, such
as image processing and vector computation.
■ Synchronous (lockstep) and deterministic execution
SIMD
SIMD
■ A wide variety of applications can be solved
by parallel algorithms with SIMD
■ only problems that can be divided into sub
problems, all of those can be solved
simultaneously by the same set of instructions
■ This algorithms are typical easy to implement
SIMD
■ Example of
■ Ordinarily desktop and business applications
■ Word processor, database , OS and many more
■ Multimedia applications
■ 2D and 3D image processing, Game and etc
■ Scientific applications
■ CAD, Simulations
Example of CPU with SIMD ext
■ Intel P4 & AMD Althon, x86 CPU
■ 8 x 128 bits SIMD registers
■ G5 Vector CPU with SIMD extension
■ 32 x 128 bits registers
■ Playstation II
■ 2 vector units with SIMD extension
SIMD operations
SIMD
■ SIMD instructions supports
■ Load and store
■ Integer
■ Floating point
■ Logical and Arithmetic instructions
■ Additional instruction (optional)
■ Cache instructions to support different locality for
different type of application characteristic
MIMD
■ Multiple Instructions: Every Processor may be executing a different
instruction stream.
■ Multiple Data: every processor may be working with a different data
stream, multiple data stream is provided by shared memory.
■ Can be categorized as loosely coupled or tightly coupled depending
on sharing of data and control.
■ Execution can be synchronous or asynchronous, deterministic or
non- deterministic.
■ Examples: most current supercomputers, networked parallel
computer " grids" and multiprocessor SMP computers - including
some types of PCs.
MIMD
■ Requires
■ Synchronization
■ Inter-process communications
■ Parallel algorithms
■ Those algorithms are difficult to design, analyze
and implement
MIMD
MIMD
Kinds of parallelism
Temporal Parallelism
Temporal Parallelism

● This method of parallel processing is appropriate if:

○ The jobs to be carried out are identical.
○ A job can be divided into many independent tasks (i.e., each task can
be done independent of other tasks) and each can be performed by a
different teacher.
○ The time taken for each task is same.
○ The time taken to send a job from one teacher to the next is negligible
compared to the time needed to do a task.
○ The number of tasks is much smaller as compared to the total number
of jobs to be done.
Temporal Parallelism
● Let the number of jobs = n
● Let the time to do a job = p
● Let each job be divisible into k tasks and let each task be done by a
different individual.
● Let the time for doing each task = p/k.
● Time to complete n jobs with no pipeline processing = np.
● Time to complete n jobs with a pipeline organization of k individual

● Speedup due to pipeline processing

Problems with Temporal Parallelism
● Synchronization
● Bubbles in pipeline
● Fault tolerance
● Inter-task communication
● Scalability
● In spite of these problems, this method is a very effective technique of using parallelism as it
is easy to perceive the possibility of using temporal parallelism in many jobs.
● Pipelining is used extensively in processor design. It was the main technique used by vector
supercomputers such as CRAY to attain their high speed.
Data Parallelism
Data Parallelism
● Let the time to distribute the jobs to k individuals be kq. Observe that
this time is proportional to the number of individuals.
● The time to complete n jobs by a single individuals = np
● The time to complete n jobs by k individuals = kq + np/k
● Speedup due to parallel processing =
Data Parallelism

● The main advantages of this method are:

○ There is no synchronization required between teachers. Each
teacher can correct papers independently at his own pace.
○ The problem of bubbles is absent. If a question is unanswered in a
paper it only reduces the time to correct that paper.
○ This method is more fault tolerant. One of the teachers can take a
coffee break without affecting the work of other teachers.
○ There is no communication required between teachers as each
teacher works independently. Thus, there is no inter-task
communication delay.
Data Parallelism
● The main disadvantages of this method are:
○ The assignment of jobs to each teacher is pre-decided. This is called a
static assignment. Thus, if an individual is slow then the completion time
of the total job will be slowed down.
○ We must be able to divide the set of jobs into subsets of mutually
independent jobs. Each subset should take the same time to complete.
○ Each individual must be capable of correcting answers to all questions.
This is to be contrasted with pipelining in which each individual
specialized in correcting the answer to only one question.
○ The time taken to divide a set of jobs into equal subsets of jobs should
be small. Further, the number of subsets should be small as compared to
the number of jobs.
Mixed Parallelism: Parallel Pipeline Processing
Mixed Parallelism: Parallel Pipeline Processing

● Even though this method reduces the time to complete the set of
jobs, it also has the disadvantages of both temporal parallelism and
to some extent that of data parallelism.
● The method is effective only if the number of jobs given to each
pipeline is much larger than the number of stages in the pipeline.
● Multiple pipeline processing was used in supercomputers such as
Cray and NEC-SX as this method is very efficient for numerical
computing in which a number of long vectors and large matrices
are used as data and could be processed simultaneously.
Other Parallelisms
● Data Parallelism with Dynamic Assignment
○ Here a head examiner gives one answer paper to each teacher and
keeps the rest with him. All teachers simultaneously correct the paper
given to them. A teacher who completes correction goes to the head
examiner for another paper which is given to him for correction. If a
second teacher completes correction at the same time, then he
queues up in front of the head examiner and waits for his turn to get
an answer paper. The procedure is repeated till all the answer papers
are corrected.
● Data Parallelism with Quasi-dynamic Scheduling
○ giving each teacher unequal sets of answer papers to correct.
Other Parallelisms

● Specialist Data Parallelism

Other Parallelisms
● Coarse Grained Specialist Temporal
Parallel Processing
Detecting Parallelism: Bernstein’s Conditions
● In 1966, Bernstein revealed a set of conditions based on which two
processes can execute in parallel.
● We define the input set Ii, of a process Pi, as the set of all input variables
needed to execute the process (aka read set).
● Similarly, the output set Oi consists of all output variables generated after
execution of the process Pi (aka write set).
● Input variables are essentially operands which can be fetched from
memory or registers, and output variables are the results to be stored in
working registers or memory locations.
● Now, consider two processes P1 and P2 with their input sets I1 and I2 and
output sets O1 and O2, respectively. These two processes can execute in
parallel and are denoted P1 || P2 if they are independent and hence create
deterministic results.
Detecting Parallelism: Bernstein’s Conditions
● Formally, these conditions are stated as follows:

● These three conditions are known as Bernstein’s conditions.

● In terms of data dependencies, Bernstein's conditions simply imply that two
processes can execute in parallel if they are flow-independent,
anti-independent, and output-independent.
Detecting Parallelism: Example
● Consider the simple casein which
each process is a single HLL
statement. We want to detect the
parallelism embedded in the following
five statements labeled PI, P2, P3,
P4, and P5 in program order
Processor Types
❑ Vector processor:
❖ Cray X1/T90; NEC SX#; Japan Earth Simulator;
Early Cray machines; Japan Life Simulator
(hybrid)
❑ Scalar processor:
❖ CISC: Complex Instruction Set Computer
• Intel 80x86 (IA32)
❖ RISC: Reduced Instruction Set Computer
• Sun SPARC, IBM Power #, SGI MIPS
CISC Processor
❑ CISC
❖ Complex instructions; Large number of
instructions; Can complete more complicated
functions at instruction level
❖ Instruction actually invokes microcode.
Microcodes are small programs in processor
memory
❖ Slower; Many instructions access memory;
varying instruction length; allow no pipelining;
RISC Processor
❑ No microcode
❑ Simple instructions; Fewer instructions; Fast
❑ Only load and store instructions access
memory
❑ Common instruction word length
❑ Allows pipelining
Almost all present-day high performance computers
use RISC processors
Locality of References
❑ Spatial/Temporal locality
❖ If processor executes an instruction at time t, it is
likely to execute an adjacent/next instruction at
(t+delta_t);
❖ If processor accesses a memory location/data
item x at time t, it is likely to access an adjacent
memory location/data item (x+delta_x) at
(t+delta_t);
Pipelining, Caching and many other techniques all based on
the locality of references
The Parallel Random Access Machine (PRAM)

The PRAM is one of the popular

models for designing parallel
algorithms, consists of the following:
● A set of N(P1, P2, ..., PN)
identical processors. In principle,
N is unbounded.
● A memory with M locations which
is shared by all the N processors.
Again, in principle, M is
unbounded.
● An MAU which allows the
processors to access the shared
memory.
The Parallel Random Access Machine (PRAM)

● The PRAM model can be subdivided into 4 categories based on

the way simultaneous memory accesses are handled.
○ Exclusive Read Exclusive Write (EREW) PRAM
■ every access to a memory location (read or write) has to be exclusive.
○ Concurrent Read Exclusive Write (CREW) PRAM
■ Only write operations to a memory location are exclusive.
○ Exclusive Read Concurrent Write (ERCW) PRAM
■ Multiple processors can concurrently write into the same memory
location.
○ Concurrent Read Concurrent Write (CRCW) PRAM
■ Allows both multiple read and multiple write operations to a memory
location.
The Parallel Random Access Machine (PRAM)

● There are many methods to implement the PRAM model, but

the most prominent ones are:
○ Shared memory model
○ Message passing model
○ Data parallel model

MCA - HW - Lecture 5and6 - Prelim
No ratings yet
MCA - HW - Lecture 5and6 - Prelim
107 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
L04 Pipelining
No ratings yet
L04 Pipelining
38 pages
5.1-5.3 Pipelining and Parallel Processing
No ratings yet
5.1-5.3 Pipelining and Parallel Processing
56 pages
Chapter 1 Edit
No ratings yet
Chapter 1 Edit
463 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages
Lecture On Global Informatics and Electronics
No ratings yet
Lecture On Global Informatics and Electronics
45 pages
740 Fall10 Lecture4 Afterlecture Pipelining
No ratings yet
740 Fall10 Lecture4 Afterlecture Pipelining
24 pages
4 - Performance Issues
No ratings yet
4 - Performance Issues
48 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
CH02 COA10e
No ratings yet
CH02 COA10e
67 pages
Module 2
No ratings yet
Module 2
127 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Lecture 2 - Parallel Programming Platforms (Part I)
No ratings yet
Lecture 2 - Parallel Programming Platforms (Part I)
44 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
L1.0 HPC Overview
No ratings yet
L1.0 HPC Overview
58 pages
Unit 5
No ratings yet
Unit 5
36 pages
Unit 5
No ratings yet
Unit 5
51 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Unit 5
No ratings yet
Unit 5
44 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
01 - Gaddis Python - Lecture - PPT - ch01
No ratings yet
01 - Gaddis Python - Lecture - PPT - ch01
30 pages
Chapter 2
No ratings yet
Chapter 2
14 pages
Instruction Pipelining and SuperScalar Development - 2019
No ratings yet
Instruction Pipelining and SuperScalar Development - 2019
53 pages
ACA CIE-1 Notes
No ratings yet
ACA CIE-1 Notes
4 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
FIT9134 Week11
No ratings yet
FIT9134 Week11
21 pages
Pipelining, Introduction To Parallel Processing and Operating System
No ratings yet
Pipelining, Introduction To Parallel Processing and Operating System
50 pages
CH02 COA10e.performance Issues
No ratings yet
CH02 COA10e.performance Issues
19 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Coa Unit 5
No ratings yet
Coa Unit 5
20 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Final
No ratings yet
Final
26 pages
Existential Questions On The CPU
No ratings yet
Existential Questions On The CPU
13 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
CA Final PDF
No ratings yet
CA Final PDF
13 pages
Mainboard D3401/D3402/D3417
No ratings yet
Mainboard D3401/D3402/D3417
29 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Compiler Vs Interpreter: Complete Difference Between Compiler and Interpreter
No ratings yet
Compiler Vs Interpreter: Complete Difference Between Compiler and Interpreter
4 pages
8279 Pkdi
No ratings yet
8279 Pkdi
19 pages
Chapter 1 Edit PDF
No ratings yet
Chapter 1 Edit PDF
40 pages
Touch Typing Lesson 2 - The Top Row - How To Type - Free Typing Lessons, Typing Practice and Typing Tests.
No ratings yet
Touch Typing Lesson 2 - The Top Row - How To Type - Free Typing Lessons, Typing Practice and Typing Tests.
3 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Module 2
No ratings yet
Module 2
52 pages
Jawaharlal Nehru MC, Sawangi, Wardha
No ratings yet
Jawaharlal Nehru MC, Sawangi, Wardha
5 pages
Script Hack Cryptotab
43% (7)
Script Hack Cryptotab
3 pages
Parallelism
No ratings yet
Parallelism
22 pages
2023 MST Solution
No ratings yet
2023 MST Solution
9 pages
Module 4
No ratings yet
Module 4
12 pages
MODULE 5 Caches
No ratings yet
MODULE 5 Caches
49 pages
UNIT 4 - MEMORY MANAGEMENT Lecture 1 and 2
No ratings yet
UNIT 4 - MEMORY MANAGEMENT Lecture 1 and 2
70 pages
Computer P1
No ratings yet
Computer P1
14 pages
MIB Counter
No ratings yet
MIB Counter
31 pages
Q1 Lesson 1 - Introduction To Computer & Its Characteristics
No ratings yet
Q1 Lesson 1 - Introduction To Computer & Its Characteristics
23 pages
English: Instructions
No ratings yet
English: Instructions
11 pages
Process Description and Control: Operating Systems: Internals and Design Principles
No ratings yet
Process Description and Control: Operating Systems: Internals and Design Principles
53 pages
Instructions Set of 8086
No ratings yet
Instructions Set of 8086
63 pages
JVC MP-XP5220 7220 筆電..Unlimited SM
No ratings yet
JVC MP-XP5220 7220 筆電..Unlimited SM
10 pages
Huawei FusionServer E9000 Converged Infrastructure Blade Server Pre-Sales FAQs
No ratings yet
Huawei FusionServer E9000 Converged Infrastructure Blade Server Pre-Sales FAQs
38 pages
PLL Pin Mod
No ratings yet
PLL Pin Mod
14 pages
Microcontroller 8051
No ratings yet
Microcontroller 8051
27 pages
Lecture 1 Second Course
No ratings yet
Lecture 1 Second Course
16 pages
Using Nanodlp Slicer For Bean 3d Printer - Version 0.3.1 PDF
No ratings yet
Using Nanodlp Slicer For Bean 3d Printer - Version 0.3.1 PDF
16 pages
Appendix F: Lx200 Personal Computer Control Telescope Panel Connector
No ratings yet
Appendix F: Lx200 Personal Computer Control Telescope Panel Connector
8 pages
Power Struggles: Revisiting The RISC vs. CISC Debate On Contemporary ARM and x86 Architectures
No ratings yet
Power Struggles: Revisiting The RISC vs. CISC Debate On Contemporary ARM and x86 Architectures
12 pages
EN - Igame GeForce GTX 1660 Ultra 6G
No ratings yet
EN - Igame GeForce GTX 1660 Ultra 6G
2 pages
Specifications - HP DV6-3010US 2
No ratings yet
Specifications - HP DV6-3010US 2
3 pages
6306 Line Matrix Printer: Economical Solutions For High-Volume, Mission Critical Impact Printing
No ratings yet
6306 Line Matrix Printer: Economical Solutions For High-Volume, Mission Critical Impact Printing
3 pages
Acecore Mil-Std-1553 Intellectual Property (Ip) Core: Features
No ratings yet
Acecore Mil-Std-1553 Intellectual Property (Ip) Core: Features
4 pages
IBM BIOS POST Error Code List
No ratings yet
IBM BIOS POST Error Code List
4 pages
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet

Introduction To High Performance Computing: Unit-I

Uploaded by

Introduction To High Performance Computing: Unit-I

Uploaded by

Introduction to High

Surface velocity 75 sec after

Flu pandemic simulation

Density of infected population,

Navier-stokes equations Mathematical Models

Algorithms, BCs, solvers,

n_c: total number of CPU cycles

● M.J. Flynn offered a classification for a computer system’s

Depth of pipeline: number of stages in an instruction

● In the above scenario, in cycle 4, instructions I1 and I4 are trying to access

● This method of parallel processing is appropriate if:

● Speedup due to pipeline processing

● The main advantages of this method are:

● Specialist Data Parallelism

● These three conditions are known as Bernstein’s conditions.

The PRAM is one of the popular

● The PRAM model can be subdivided into 4 categories based on

● There are many methods to implement the PRAM model, but

You might also like