67% found this document useful (3 votes)
634 views47 pages

SPPU - BE - HPC - Unit 1 Notes

High Performance Computing Unit 1 Notes - Slides with Explanation

Uploaded by

vrjorwekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
67% found this document useful (3 votes)
634 views47 pages

SPPU - BE - HPC - Unit 1 Notes

High Performance Computing Unit 1 Notes - Slides with Explanation

Uploaded by

vrjorwekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 47
High Performance Computing (410250) f. Vaishali Jorwekar What comes in your mind when you see these 3 pictures of computers. of. Vaishali Jorwekar Personal Laptop Gaming Laptop Super Computer First laptop on the left is personal laptop, which we use for our day to day use. Gaming laptop has higher configuration and graphics card for high definition games. ‘Super computers used by scientist and big companies for complex mathematical modeling. So the main differentiating factor is computing power. computing power refers to how fast and capable a computer is in performing tasks and calculations. of. Vaishali Jorwekar Personal Laptop AMD Ryzen 5 7530U processor [12 Threads | Speed 116 MB L3 Cache upto IMemory: Miz, duat-channet capable upgradable upto 40GB | Storage: 512GB SSD M2 Gaming Laptop Processor: 13th Gen Intel Core |19-13980HX Processor 2.2 Gh (Gem Cache, up to 24 Memory: 16GB (8GB SO-DIMM *2) DDRS 4800 Miz Support {Upto 3260 2¢50-DINM slots ‘Storage: 178 PCIe 4.0 NVMe M2ssD Super Computer Peak Performance: 200 Pops Number of Nodes! 4508 emery pee Node: 512 GB DORE + 96 G3 HAN2 1250 PB IBM Spectum Scale GPFS 2.5 TBis Power Consumption: 13 Operating Syston: ‘Rod Ha Enterprise Linuk (RHEL) version 7.4 I have sample specifications here, and | want to highlight differences in key computing here. As the computation need increases, processors requirements also increases, which is met through, increasing number of processors and cores, cycle frequency. Typical personal computers will have 6 cores up to 16GB RAM, whereas Gaming laptop will have igh number of cores and RAM. But if you compare super computer, you will see number of processors, number is in thousands. And RAM in petabytes. of. Vaishali Jorwekar High Performance Computing High Performance Computing (HPC) refers to the use of powerful computers and parallel processing techniques to solve complex problems or perform tasks at a much faster rate than traditional computers. High Performance Computing (HPC) refers to the use of powerful computers and parallel processing techniques to solve complex problems or perform tasks at a much faster rate than traditional computers. We saw in last slide, what makes computers powerful, that is number of processors, its cores, frequency and RAM. In first chapter we will see details about parallel processing techniques. of. Vaishali Jorwekar Application of High Performance Computing ‘+ Financial institutions ~ Transactions and card fault detection + Bio-sciences and the human genome — Drug discovery, disease detection / prevention + Computer aided engineering - automotive design and testing, transportation commerce, structural outlook, mechanical design + Chemical engineering -process and molecular design next line ‘+ Digital content creation and distribution-computer aided graphics in film and media * Economics / financial-Wall Street risk analysis, portfolio management, automated trading + Electronic design and automation- electronic component design + Geo sciences and geo engineering - oil and gas exploration and reservoir modelling ‘+ Mechanical design and drafting-2D and 3D design and verification, mechanical modelling + Defense and energy-nuclear stewardship, basic and applied research + Government labs, universities/academic-basic and applied research + Meteorological departments-weather forecasting Lets look at some of the application of high performance computing. of. Vaishali Jorwekar Parallel Processing A parallel computer is a set of processors that are able to work cooperatively to solve a computational problem. Parallel computing is a form of computation in which many instructions are carried out simultaneously operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (in parallel) Here's a simplified example to help illustrate the concept of Parallel Processing. Imagine you have a really challenging puzzle to solve, and you want to do it as quickly as. possible. If you try to solve it alone, it might take a long time. However, if you have a group of friends working together simultaneously, each focusing on a different part of the puzzle, you can finish much faster. In the context of computing, traditional computers are like individuals trying to solve the puzzle on their own. High Performance Computing, on the other hand, is like having a team of super-fast computers working together to tackle a complex problem. of. Vaishali Jorwekar Serial Processing Parallel Processing — | ~ WMT wu I! | |- Wut | ‘Tobe run ona single computer having a single CPU: Tobe run using multiple CPUs: + Aproblemis broken into a discrete series of instructions + Aproblemis broken into discrete parts that can be solved + Instructions are executed one after another ‘concurrently + Only one instruction may execute at any moment in time. + Each partis further broken down to a series of instructions + Instructions from each part execute simultaneously on different CPUs In serial programming, problem is broken into series of instructions. Just recall your C programs, and each line will have some instructions. This instructions will run one after another. And only one instruction will get executed at any moment. It is nothing but serial programming. First thing for parallel processing, we will need multiple CPUs. So Problem is broken into part that can be solved parallelly. Each part is further broken down to a series of instructions. Instructions from each part execute simultaneously on different CPUs. of. Vaishali Jorwekar Serial Processing Parallel Processing Instructions | wil |B Count from 1 to 1000. re Using @ regular computer, you would start at 1 and — With HPC, you could divide the task among multiple processors. incrementally count each number one by one until you reach For instance, if you have 10 processors, each processor could be 1000. This process might take some time, but it's manageable responsible for counting a range of 100 numbers. So, Processor for a personal computer. -Lcounts from 1 to 100, Processor 2 from 101 to 200, and so on. AAU processors work simultaneously, and the entire task of ‘counting from 1 to 1000 is completed much faster compared to ‘personal computer. Lets understand this concept with very simple example. Suppose you want to count from 1 to 1000. Using a regular computer, you would start at 1 and incrementally count each number one by one until you reach 1000. This process might take some time, but it's manageable for a personal computer. of. Vaishali Jorwekar Motivating Parallelism Reasons for Growth: + Advancements in specifying and coordinating complex concurrent tasks. + Portable algorithms facilitating parallel processing. + Specialized execution environments and software development toolkits. Reasons: + Increased Computational Power + Enhanced Memory/Disk Speed + Improved Data Communication In recent years, there has been a big improvement in how computers handle multiple tasks that is parallel processing. This is because we've gotten better at organizing and managing complex tasks happening at once, creating portable algorithms (sets of instructions), using special environments for executing tasks, and developing toolkits for making software. This progress is based on three main reasons: Lincreased Computational Power: Modern computers, equipped with CMOS chip-based processors and advanced networking, have become significantly more powerful. This has driven the development of applications capable of handling multiple tasks simultaneously. 2.Enhanced Memory/Disk Speed: Progress in hardware interfaces has expedited the transi from microprocessor creation to the development of entire machines that efficiently execute parallel tasks. 3.Improved Data Communication: Standardization of programming environments has seen notable advancements. This ensures that applications designed for parallel processing remain relevant and useful for an extended period. n of. Vaishali Jorwekar Modern Processor f. Vaishali Jorwekar Stored- program computer architecture Stored program computer architecture is a design where instructions and data are stored in the same ‘memory, allowing a central processing unit to sequentially fetch, decode, and execute instructions, enabling versatile programmability. Central Processing Unit Input Device ouput Device "Stored-program computer architecture is like having a recipe book for your computer. In this analogy, the recipe book Is the computer's memory, and the chef Is the central processing unit (CPU). Let's break it down: Memory Unit: Just like a recipe book contains both instructions and a list of ingredients, the computer's memory stores both program instructions and data. Chef (CPU): The CPU acts like a chef following the instructions in the recipe book. It fetches each step, processes it, and moves on to the next one. Fetching and Execution: Imagine the CPU as a chef turning the pages of the recipe book (fetching), reading the instructions (decoding), and then cooking accordingly (execution). Example is personal computer of. Vaishali Jorwekar General-purpose Cache-based Microprocessor architecture General-purpose Cache-based Microprocessor architecture is a design incorporating a cache memory hierarchy to enhance data access speed and overall performance in executing a wide range of computational tasks. FE] cote ener cpu Pamary Memory Secondary Menor] Word Transfer Pres me cru Cache Main Memory Fast Slow Again lets understand with same analogy of chef and kitchen. Microprocessor (Chef): The microprocessor is like the chef, responsible for executing instructions and processing data. Cache (Countertop): Now, think of the cache as the countertop near the chef. This is where the chef keeps ingredients they use frequently. Main Memory (Pantry): The main memory is like the pantry, storing a larger quantity of ingredients. However, it takes a bit more time for the chef to go to the pantry to get less frequently used ingredients. Fetching Ingredients (Data): When the chef needs an ingredient (data), here's what happens: First, the chef checks the countertop (cache) for commonly used ingredients. If the ingredient is on the countertop (in the cache), great! It's quickly accessed. If not, the chef goes to the pantry (main memory) to retrieve the ingredient. Everyday Products: Smartphones and Laptops: Just like a chef needs quick access to ingredients, your smartphone and laptop use cache memory to store frequently accessed data and instructions for faster processing. of. Vaishali Jorwekar Web Browsing: When you load a webpage, the browser uses cache memory to store elements of the page for quicker retrieval. It's like having the ingredients ready for the chef without going to the pantry every time. of. Vaishali Jorwekar Parallel Programming Platforms of. Vaishali Jorwekar Explicit Parallelism Implicit Parallelism + Programmer specifically defines and instructs + Automatically identifies and executes tasks the system on parallel tasks. concurrently without explicit instructions from + Programmer actively Incorporate parallel the programmer. constructs or directives into the code. + Programmer write regular, step-by-step code + System follows the programmer's explicit without specific parallel constructs. instructions for parallel execution. + Compiler, runtime system, and hardware work together to find and exploit parallel opportunities. Implicit parallelism is like type of parallelism in computing that automatically handles multiple tasks at the same time without you needing to explicitly tell it to. It means you can write your programs in a regular, step-by-step way, and behind the scenes, the computer's compiler and hardware work together to find opportunities to speed things up by doing tasks simultaneously. So, as an engineer, you focus on your code's logic, and the system takes care of making it run faster using parallel processing, all without you having to add any special parallel instructions. of. Vaishali Jorwekar Implicit Parallelism - Pipelining Execution Pipelining in High-Performance Computing + Maximizing Processor Utilization © Utilize ALU, buses, registers, etc., continuously. + Pipelining Concept I © Instructions flow through the processor like a =< pipe. © Move through stages to accomplish operations. + Continuous Processor Usage © Each unit handles an instruction, keeping the processor busy. Imagine your processor is like a well-designed assembly line. Each part of the processor—like the ALU, buses, and registers—has a specific job. The goal? Keep all these parts busy all the time. So, what's pipelining? It's like turning your processor into a pipe. Instructions flow through it, moving from one stage to the next to get the job done. This way, each part of the processor is always working on something. No downtime. In simpler terms, it's like a well-oiled machine where instructions smoothly move through different stages, making sure your processor is always doing something useful." of. Vaishali Jorwekar Overlapping Execution with Pipelining coffe Implicit Parallelism - Pipelining Execution + Non-Pipelined Approach © Fetch, decode, read, execute, and write sequentially © Hardware idle during waiting periods. + Pipelining Technique © Overlap execution of several instructions. © Two-stage pipelining example: Fetch and Execute. + Benefits © Faster execution by fetching next instruction during the current one’s execution, © Allunits busy, preventing idle time "Now, let's explore why pipelining is good and how it improves the efficiency of our processors. In the past, processors followed a step-by-step approach—fetch, decode, read, execute, and write, one after another. The drawback? Many components of the hardware would remain inactive, patiently waiting for others to complete their tasks. In pipelining approach, It's like managing multiple instructions simultaneously. Picture this: accomplishing two tasks in just two stages—fetching the next instruction while executing the current one, It's an intelligent method to overlap tasks and maintain a smooth workflow. What's the result? Quicker execution! Every part of the processor remains engaged, avoiding any downtime. Think of it as orchestrating a production line where everyone has a role, and the line keeps moving without interruptions. of. Vaishali Jorwekar + From Scalar to Superscalar © Scalar processors had one pipelined unit for {teger and one for floating-point operations. for Parallelism © Single pipeline isn’t enough for parallelism. © Pipelines enable parallelism by having multiple instructions at different stages. © Superscalar processors execute more than one instruction per clock cycle. © Fetch and decode multiple instructions simultaneously, ¢ Implicit Parallelism - Superscalar Execution So, back in the day, processors were scalar, meaning they had one pipeline for integer operations and one for floating-point operations. But designers realized that having just one pipeline wasn't cutting it for getting things done faster. We needed more parallelism. that's where superscalar came into picture. It's like having a processor that can do more than one thing at a time during a single clock cycle. Imagine fetching and decoding multiple instructions simultaneously. That's the essence of superscalar - making our processors more efficient by doing multiple tasks at once." of. Vaishali Jorwekar + Instruction Level Parallelism (ILP) ‘© Superscalar architecture exploits Instruction Level Parallelism (ILP) ‘© Multiple pipelines for various instructions (eg, integer and floating-point) + Complexity Considerations ‘© Superscalar scheduler complexity and hardware cost are crucial in processor design. + VLIW Solution © Very Long Instruction Word (VLIW) processors use compile-time analysis. ‘© Bundling instructions for concurrent execution, addressing complexity. Implicit Parallelism - Superscalar Execution Integer regstor fle Floating-point ogister fe ir [on Pipelined integer functional units Pipelined floating point functional units Lets start with Instruction Level Parallelism (ILP). We've got multiple pipelines for different instructions like arithmetic, load, and store. It's about taking advantage of parallelism to speed things up. Now, here's the catch - making a superscalar processor is not easy. It's complex, and the hardware cost is something we really need to think about in processor design. To tackle this, we have something called VLIW or Very Long Instruction Word processors. They use a clever trick at compile time to identify and bundle together instructions that can be done at the same time. It's like putting a bunch of instructions in a very long instruction word to simplify the process. of. Vaishali Jorwekar Implicit Parallelism - VLIW Processor Structure + Need for Separate Units © To perform multiple operations in one execution stage, separate units for each operation are essential. + VLIW Architecture © Visual representation of separate units for operations (Floating Point Add, Multiply, Branching, Integer ALU). © VLIW (Very Long Instruction Word) executes more than one basic instruction at a time. ‘© Multiple operations stored in a single instruction word, When we want to do multiple things at once in a single execution stage, we need separate units for ‘each operation. Picture this: for floating point addition, multiplication, branching, and integer ALU, we've got dedicated units. Check out Fig. 1.4.3 for a visual on this. Now, VLIW stands for Very Long Instruction Word. It's a way for our processors to handle more than one basic instruction at a time. How? By storing multiple operations in a single instruction word. So, when we issue one instruction, multiple operations kick off simultaneously during the execution cycle of the pipelining process. Simple, right?" of. Vaishali Jorwekar Implicit Parallelism - VLIW Processor Structure Execution and Compiler Role + Simultaneous Operations © VLIW executes multiple operations simultaneously with one instruction. + Compiler's Role © Compiler identifies parallelism, schedules dependency-free code. © Resolves dependencies among instructions at ‘compile time. + Characteristics © Multiple independent operations in a VLIW instruction, no flow dependences. So, VLIW does multiple operations all at once with one instruction—no waiting around. But here's the trick: the compiler is crucial. It spots where we can run things in parallel and arranges the code to avoid any dependencies. So, the compiler is very important, making sure everything plays in harmony, It identifies and schedules operations that can run side by side, resolving any issues before the program even runs. One more thing - in a VLIW instruction, all these operations are independent; they don't rely on each other. It's like having a set of tasks that can be done simultaneously without any fuss. of. Vaishali Jorwekar Dichotomy of Parallel Computing Platforms Division based on logical and physical organization of parallel platforms Physical organization is the actual hardware organization of a platform. logical organization refers to a programmer's view of the platform. Control Structure The Communication Model + The various ways of expressing parallel + The mechanisms for specifying tasks is known as control structure. interaction between the parallel tasks is called as communication model. There are several platforms which facilitates parallel computing. In this section the division based on logical and physical organization of parallel platforms will be discussed. Physical organization is the actual hardware organization of a platform. logical organization refers to a programmer's view of the platform. From programmers perspective the two important components of parallel computing are: Control Structure and The Communication Model. of. Vaishali Jorwekar Physical Organization of Parallel Platforms Evolution Lets start with at conventional architecture, representing the traditional uni-processor system. While some parallel features can improve a single processor's speed, there are limitations. The foundation of processor architecture traces back to the Von Neumann Computer, characterized by its CPU, Memory, and I/O devices. This system follows the Von Neumann architecture, where the CPU consists of Arithmetic and Control units, operating on the stored program concept. Both program and data share the same memory unit, each location having a unique address. Execution proceeds sequentially unless the program explicitly alters this flow. Fig. 1.8.2 marks the initial steps toward parallelism, introducing lookahead, overlapping fetch and execute, and parallelism in functions. This latter concept involves two mechanisms: pipelining and multiple functional units. In the second mechanism, various functional units operate simultaneously, enhancing processing speed. Vector instructions, akin to massive arrays of data with a common operation, were initially managed by pipeline processors controlled by software looping. Subsequently, explicit processors tailored for vector instructions emerged. Two variations in vector processing include memory-to-memory and register-to-register, with the former utilizing of. Vaishali Jorwekar memory for operand storage and the latter using registers. The evolution of register-to-register architecture led to the creation of two processor types: Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD). These developments signify the gradual integration of parallelism in processors, contributing to enhanced processing capabilities. of. Vaishali Jorwekar Physical Organization of Parallel Platforms Parallel Random Access Machine (PRAM) Various PRAM models differ in how they handle read or write conflicts + EREW : Exclusive Read Exclusive Write p processors can simultaneously read and write the content of p distinct memory locations. + CREW: Concurrent Read Exclusive Write p processors can simultaneously read the content of p! memory locations, where p'

You might also like