Multicore Architecture
Multicore Architecture
MULTICORE ARCHITECTURE
Course Plan
• Introduction to Multi-Core Architecture
• System Overview of Threading
• Fundamental Concepts of Parallel Programming
• Threading and Parallel Programming
Constructs
• Threading APIs
• Open MP: A Portable Solution for Threading
• Solutions to Common Parallel Programming
problems
Course content
• This course content is organized into three major
sections.
• The first section (Chapters 1–4) presents an
introduction to software threading.
• This section includes background material on, why
chipmakers have shifted to multi-core architectures,
how threads work, how to measure the performance
improvements achieved by a particular threading
implementation.
• Overall understanding why hardware platforms are
evolving in the way that they are and understanding
the basic principles required to write parallel programs.
Course content
• On the server side, the provider must be able to receive the original
broadcast, encode/compress it in near real-time, and then send it over
the network to potentially hundreds of thousands of clients.
Motivation for Concurrency
• A system designer who is looking to build a computer system capable of
streaming a Web broadcast might look at the system as it’s shown,
• In this equation, S is the time spent executing the serial portion of the parallelized
version and n is the number of processor cores.
• Setting n = ∞ in Equation 1.1,
/Compiler directive/
// OpenMP program to print Hello World
from multiple threads
// using C language
// OpenMP header
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
omp_get_thread_num());
}
// Ending of parallel region
}
“Hello World” Program Using Pthreads
• As can be seen, the OpenMP code has no function that corresponds to thread
creation. This is because OpenMP creates threads automatically in the background.
• In Pthreads, where a call to pthread_create() actually creates a single thread and
points it at the work to be done in PrintHello().
Threads inside the OS
• The key to viewing threads from the perspective of a modern operating
system is to recognize that operating systems are partitioned into two
distinct layers: the user-level partition (where applications are run) and
the kernel-level partition (where system oriented activities occur).
Contd….
• The kernel is the nucleus of the operating system and
maintains tables to keep track of processes and threads.
• Threading libraries such as OpenMP and Pthreads (POSIX
standard threads) use kernel-level threads.
• User-level threads, which are called fibers on the Windows
platform, require the programmer to create the entire
management infrastructure for the threads and to manually
schedule their execution.
• Kernel-level threads provide better performance, and multiple
kernel threads from the same process can execute on
different processors or cores.
User-level Threads
User-level threads are mapped to kernel threads; and so, when they are
executing, the processor knows them only as kernel-level threads.
Contd....
• Below figure shows the relationship between processors,
processes, and threads in modern operating systems.
• A processor runs threads from one or more processes, each of
which contains one or more threads.
• A program has one or more processes, each of which contains
one or more threads, each of which is mapped to a processor
by the scheduler in the operating system.
Various mapping models are used between threads and processors
Many to one (M:1)
In programming terms,
Contd....
Data Decomposition
• Data decomposition, also known as data-level
parallelism, breaks down tasks by the data they
work on rather than by the nature of the task.
• Programs that are broken down via data
decomposition generally have many threads
performing the same work, just on different data
items.
• For example, consider recalculating the values in
a large spreadsheet. Rather than have one thread
perform all the calculations, data decomposition
would suggest having two threads, each
performing half the calculations, or n threads
performing 1/nth the work.
Contd....
• If the gardeners used the principle of data
decomposition to divide their work, they
would both mow half the property and then
both weed half the flower beds.
• As in computing, determining which form of
decomposition is more effective depends a lot
on the constraints of the system.
• For example, if the area to mow is so small
that it does not need two mowers, that task
would be better done by just one gardener-
that is, task decomposition is the best choice
Contd....
In this pattern, the problem is decomposed into a set of tasks that operate independently.
Problems that fit into this pattern include the so-called embarrassingly parallel problems,
those where there are no dependencies between threads.
Divide and Conquer
• Geometric Decomposition Pattern
• Pipeline Pattern
• Wavefront Pattern
A Motivating Problem: Error Diffusion
The scope of synchronization is broad. Proper synchronization orders the updates to data and
provides an expected outcome. In Figure 4.2, shared data d can get access by threads Ti and Tj
at time ti, tj, tk, tl, where ti ≠ tj ≠ tk ≠ tl and a proper synchronization maintains the order to
update d at these instances and considers the state of d as a synchronization function of time.
This synchronization function, s, represents the behavior of a synchronized construct with
respect to the execution time of a thread.
Synchronization operations
in an actual multi-threaded implementation
Synchronization Primitives
Group activities
OpenMP
A Portable Solution for Threading
• OpenMP plays a key role by providing an easy method for threading
applications without burdening the programmer with the
complications of creating, synchronizing, load balancing, and
destroying threads.
• The OpenMP standard was formulated in 1997 as an API for writing
portable, multithreaded applications.
• The current version is OpenMP Version 2.5, which supports Fortran,
C, and C++. Intel C++ and Fortran compilers support the OpenMP
Version 2.5 standard.
• The OpenMP programming model provides a platform-
independent set of compiler pragmas, directives, function calls, and
environment variables that explicitly instruct the compiler how and
where to use parallelism in the application.
• Many loops can be threaded by inserting only one pragma right
before the loop. The full potential of OpenMP is realized when it is
used to thread the most time consuming loops.
Contd....
• The simplest way to create parallelism in OpenMP is to use
the parallel pragma.
Contd....
ws multiple threads
Contd....
Contd....
• The for loop construct (or simply the loop construct)
specifies that the iterations of the following for loop will
be executed in parallel. The iterations of the loop are
distributed among multiple threads.
• Here, OpenMP first creates several threads. Then, the iterations of the
loop are divided among the threads.
• Once the loop is finished, the sections are divided among the threads so
that each section is executed exactly once, but in parallel with the other
sections.
• If the program contains more sections than threads, the remaining
sections get scheduled as threads finish their previous sections.
Thank you