Questions On Chapter 1 and 2 Color New V2
Questions On Chapter 1 and 2 Color New V2
لو في اي غلط ياشباب ( قولوا ) في اي سؤال عشان اعدلوا طبيعي الواحد يسهوا
3. Describe how n processors can be used to compute and add n values in an efficient way ?
➢ We can pair the cores so that while core 0 adds in the result of core 1, core 2 can add in the
result of core 3, core 4 can add in the result of core 5, and so on,
Then we can repeat the process with only the even-ranked cores :
0 adds in the result of 2, 4 , adds in the result of 6, and so on.
Now cores divisible by 4 repeats the process, and so on
5. Suppose the main memory consists of 16 lines with indexes 0–15, and the cache consists of 4 lines
with indexes 0–3. Where lines should be stored using direct, fully associative, and 2-way mapping .
➢ In a fully associative cache , line 0 can be assigned to cache location 0, 1, 2, or 3.
In a direct mapped cache, we might assign lines by looking at their remainder after division by 4. So lines 0,
4, 8, and 12 would be mapped to cache index 0, lines 1 , 5 , 9 , and 13 would be mapped to cache index 1,
and so on , In a two way set associative cache, we might group the cache into two sets: indexes 0 and 1
form one set , set 0 — and indexes 2 and 3 form another — set 1 , So we could use the remainder of the
main memory index modulo 2, and cache line 0 would be mapped to either cache index 0 or cache index 1 .
6. With an example show how matrix processing can be efficient by considering how cache works ?
double A[MAX][MAX], x[MAX], y[MAX];
...
/∗ Initialize A and x, assign y = 0 ∗/
...
/∗ First pair of loops ∗/
for (i = 0; i < MAX; i++)
for (j = 0; j < MAX; j++)
y[i] += A[i][j]∗x[j];
...
/∗ Assign y = 0 ∗/
...
/∗ Second pair of loops ∗/
for (j = 0; j < MAX; j++)
for (i = 0; i < MAX; i++)
y[i] += A[i][j]∗x[j];
Suppose that we have a cache with a maximum size of 8 Elements of ‘A’ ( 2 lines ), and
it’s a direct-mapped, and MAX is 4, and ‘A’ is stored in memory like this LOOK AT TABLE ABOVE
First pair of loops: 4 Misses & 2 Lines are evicted due to the size of the cash (2 Lines only) .
Second pair of loops: 16 Misses & 14 Lines are evicted due to the size of the cash (2 Lines only). If MAX is 1000 The
first pair of loops is much faster, approximately three times faster Lines only). If MAX is 1000 The first pair of loops is
much faster, approximately three times faster than the second pair
7. What is the drawback of page table and how this drawback can be solved ?
➢ is that it can double the time needed to access a location in main memory.
In order to address this issue, processors have a special address translation cache called a translation-
lookaside buffer, or TLB. It caches a small number of entries ( typically 16–512 ) from the page table in very
fast memory. Using the principle of spatial and temporal locality, we would expect that most of our memory
references will be to pages whose physical address is stored in the TLB, and the number of memory
references that require accesses to the page table in main memory will be substantially reduced
• Speculation ➢ In speculation, the compiler or the processor makes a guess about an instruction, and
then executes the instruction on the basis of the guess
Example : in the following code, the system might predict that the outcome of z = x + y will give z a
positive value, and, as a consequence, it will assign w = x.
z = x + y;
if (z > 0)
w = x;
else
w = y;
13. Draw the architecture for UMA and NUMA multicore systems and describe them.
- UMA ( uniform memory access )systems are usually easier to program, since the programmer doesn’t
need to worry about different access times for different memory locations.
- NUMA ( nonuniform memory access ) systems have the potential to use larger amounts of memory than
UMA systems.
- The lines are bidirectional communication links, the squares are cores or memory modules, and the circles
are switches.
15. Draw ring, toroidal mesh, fully connected network, hypercube, crossbar, and omega network for
eight processors and compute the bisection width for each interconnect.
- The bisection width of a p × p crossbar is p and the bisection width of an omega network is p/2
16. Show the difference between crossbar and omega network switches.
➢ In the crossbar as long as two processors don’t attempt to communicate with the
same processor, all the processors can simultaneously communicate with another
processor. in omega network The switches are two-by-two crossbars, Observe that unlike the
crossbar, there are communications that cannot occur simultaneously
18. Describe with example how cache coherence problem can happen?
➢ the caches we described for single processor systems provide no mechanism for insuring that when the
caches of multiple processors store the same variable, an update by one processor to the cached variable is
“seen” by the other processors. That is, that the cached value stored by the other processors is also updated
19. Explain the concepts: snooping cache coherence, directory-based cache coherence, and false
sharing ?
• Snooping cache coherence ➢ The idea behind snooping comes from bus-based systems : When the
cores share a bus, any signal transmitted on the bus can be “seen” by all the cores connected to the bus.
snooping cache coherence isn’t scalable.
snooping cache coherence is clearly a problem since a broadcast across the interconnect will be very slow
relative to the speed of accessing local memory.
• Directory-based cache coherence ➢ Solve a problem of ( broadcast across the interconnect will be very
slow relative to the speed of accessing local memory) through the use data structure called a directory. The
directory stores the status of each cache line. Typically, this data structure is distributed.
• False sharing ➢ does not cause incorrect results. However, it can ruin the performance of a program by
causing many more accesses to memory than necessary
CODE :
printf("Thread %d > my val = %d\n", my rank, my x);
Then the output could be:
Thread 0 > my val = 7
Thread 1 > my val = 19
but it could also be :
Thread 1 > my val = 19
Thread 0 > my val = 7
24. How mutex and busy waiting can be used to ensure only one thread executes certain instructions
at a time ?
➢ ( mutual exclusion lock or mutex or lock ) is a special type of object that has support in the underlying
hardware. The basic idea is that each critical section is protected by a lock. Before a thread can execute the
code in the critical section, it must “obtain” the mutex by calling a mutex function In busy-waiting, a thread
enters a loop whose sole purpose is to test a condition.
25. Calling functions designed for serial programs can be problematic in parallel program, give an
example.
➢ the C string library function strtok splits an input string into substrings. When it’s first called, it’s passed a
string, and on subsequent calls it returns successive substrings. This can be arranged through the use of a
static char variable that refers to the string that was passed on the first call. Now suppose two threads are
splitting strings into substrings. Clearly, if, for example, thread 0 makes its first call to strtok, and then
thread 1 makes its first call to strtok before thread 0 has completed splitting its string, then
thread 0’s string will be lost or overwritten, and, on subsequent calls it may get
substrings of thread 1’s strings.
26. Write lines of code for sending a message from process 0 to process 1.
char message[100];
...
my rank = Get rank();
if (my rank == 0) {
sprintf(message, "Greetings from process0");
Send(message, MSG CHAR, 100, 0);
} else if (my rank == 1) {
Receive(message, MSG CHAR, 100, 1);
printf("Process 1 > Received: %snn", message);
}
27. Give examples of functions for collective communication.
- MPI_Reduce() { Reduction (all to one ) }
- MPI_Allreduce() { Reduction (all to all ) }
- MPI_Bcast() { Broadcast (one to all) }
- MPI_Scatter() { Distribute data (one to all) }
- MPI_Gather() { Collect data (all to one ) }
- MPI_Allgather() { Collect data (all to all ) }
30. A lot of issues can appear with input and output in parallel system, give some rules to avoid
these issues.
➢ In distributed-memory programs, only process 0 will access stdin. In shared memory programs, only
the master thread or thread 0 will access stdin In both distributed-memory and shared-memory
programs, all the processes/ threads can access std out and stderr Only a single process/thread will
attempt to access any single file other than stdin, std out, or stderr
34. How running time of serial and parallel programs can be computed ?
36. Apply the foster’s methodology for making histogram out of data.
BIG Question Answer in pages ( 66,67,68,69,70) of chapter2 , with title( 2.7.1 An example)