Loop Level Parallelism in Computer Architecture
Last Updated :
22 Jun, 2022
Since the beginning of multiprocessors, programmers have faced the challenge of how to take advantage of the power of process available. Sometimes parallelism is available but it is present in a form that is too complicated for the programmer to think about. In addition, there exists a large sequential code that has for years has incremental performance improvements afforded by the advancement of single-core execution. For a long time, automatic parallelization has been seen as a good solution to some of these challenges. Parallelization removes the programmer's burden of expressing and understanding the parallelism existing in the algorithm.
Loop-level parallelism in computer architecture helps us with taking out parallel tasks within the loops in order to speed up the process. The utility for this parallelism arises where data is stored in random access data structures like arrays. A program that runs in sequence will iterate over the array and perform operations on indices at a time, a program that has loop-level parallelism will use multi-threads/ multi-processes that operate on the indices at the same time or at different times.
Loop Level Parallelism Types:
- DO-ALL parallelism(Independent multithreading (IMT))
- DO-ACROSS parallelism(Cyclic multithreading (CMT))
- DO-PIPE parallelism(Pipelined multithreading (PMT))
1. DO-ALL parallelism(Independent multi-threading (IMT)):
In DO-ALL parallelism every iteration of the loop is executed in parallel and completely independently with no inter-thread communication. The iterations are assigned to threads in a round-robin fashion, for example, if we have 4 cores then core 0 will execute iterations 0, 4, 8, 12, etc. (see Figure). This type of parallelization is possible only when the loop does not contain loop-carried dependencies or can be changed so that no conflicts occur between simultaneous iterations that are executing. Loops which can be parallelized in this way are likely to experience speedups since there is no overhead of inter-thread communication. However, the lack of communication also limits the applicability of this technique as many loops will not be amenable to this form of parallelization.
DO-ALL parallelism(Independent multithreading (IMT))2. DO-ACROSS parallelism(Cyclic multi-threading (CMT)):
In DO-ACROSS parallelism, like Independent multi-threading, assigns iterations to threads in a round-robin manner. Optimization techniques described to increase parallelism in Independent multi-threading loops are also available in Cyclic multi-threading. In this technique, dependencies are identified by the compiler and the beginning of each loop iteration is delayed till all dependencies from previous iterations are satisfied. In this manner, the parallel portion of one iteration is overlapped with the sequential portion of the subsequent iteration. As a result, it ends up in parallel execution. For example, in the figure the statement x = x->next; causes a loop-carried dependence since it cannot be evaluated until the statement has been completed in the previous iteration. Once all cores have started their first iteration, this can approach linear speedup if the parallel part of the loop is very large to allow full utilization of the cores.
DO-ACROSS parallelism(Cyclic multi-threading (CMT))3. DO-PIPE parallelism(Pipeline multi-threading (PMT)):
DO-PIPE parallelism is the way for parallelization loops with cross-iteration dependencies. In this approach, the loop body is divided into a number of pipeline stages with each pipeline stage being assigned to a different core. Each iteration of the loop is then distributed across the cores with each stage of the loop being executed by the core which was assigned that pipeline stage. Each individual core only executes the code associated with the stage which was allocated to it. For instance, in the figure the loop body is divided into 4 stages: A, B, C, and D. Each iteration is distributed across all four cores but each stage is only executed by one core.
DO-PIPE parallelism(Pipeline multi-threading (PMT))
Similar Reads
Hardware architecture (parallel computing) Let's discuss about parallel computing and hardware architecture of parallel computing in this post. Note that there are two types of computing but we only learn parallel computing here. As we are going to learn parallel computing for that we should know following terms. Era of computing - The two f
3 min read
Instruction Level Parallelism Instruction Level Parallelism (ILP) is used to refer to the architecture in which multiple operations can be performed parallelly in a particular process, with its own set of resources - address space, registers, identifiers, state, and program counters. It refers to the compiler design techniques a
5 min read
Computer Architecture | Flynn's taxonomy Parallel computing is a computing paradigm where jobs are broken into discrete parts that can be executed concurrently. Each part is further broken down into a series of instructions. Instructions from each part execute simultaneously on different CPUs. Parallel systems deal with the simultaneous us
5 min read
Co-Processor | Computer Architecture Introduction :If in microprocessor chip, new circuitry can be added with special purpose to perform special tasks or to perform operations on numbers in order to offload the work of the core CPU. The CPU can then work faster. We may use a conveyor belt to do some extra work while motor is running. S
4 min read
Computer Organization and Architecture Tutorial In this Computer Organization and Architecture Tutorial, youâll learn all the basic to advanced concepts like pipelining, microprogrammed control, computer architecture, instruction design, and format. Computer Organization and Architecture is used to design computer systems. Computer architecture I
5 min read
Parallel Algorithm Models in Parallel Computing Parallel Computing is defined as the process of distributing a larger task into a small number of independent tasks and then solving them using multiple processing elements simultaneously. Parallel computing is more efficient than the serial approach as it requires less computation time. Â Parallel
7 min read