0% found this document useful (0 votes)
88 views15 pages

Home Work 3: Class: M.C.A SECTION: RE3004 Course Code: CAP211

The document discusses compiler optimizations in RISC systems. It explains that the optimization process involves initial hand tuning by the user, followed by preprocessing and compiler front-end optimizations of the intermediate language code. The compiler back-end then applies additional optimizations and generates optimized object code. As an example, the document describes how simply interchanging loops in an array multiplication program can improve performance by optimizing memory access patterns.

Uploaded by

Vikas Gupta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views15 pages

Home Work 3: Class: M.C.A SECTION: RE3004 Course Code: CAP211

The document discusses compiler optimizations in RISC systems. It explains that the optimization process involves initial hand tuning by the user, followed by preprocessing and compiler front-end optimizations of the intermediate language code. The compiler back-end then applies additional optimizations and generates optimized object code. As an example, the document describes how simply interchanging loops in an array multiplication program can improve performance by optimizing memory access patterns.

Uploaded by

Vikas Gupta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 15

HOME WORK 3

ON

COMPUTER ORGANIZATION AND ARCHITECTURE

Class: M.C.A

SECTION: RE3004

Course Code: CAP211

Submitted To: Submitted by:


Lect. Richa Malhotra Vikas Gupta
Part – A

Q1: Give a comparative study of RISC and CISC architectures.

RISC vs CISC is a topic quite popular on the Net. Everytime Intel (CISC)
or Apple (RISC) introduces a new CPU, the topic pops up again. But
what are CISC and RISC exactly, and is one of them reallybetter?

Explain in simple terms what RISC and CISC are and what the future
might bring for the both of them. This article is by no means intended as
an article pro-RISC or pro-CISC. You draw your own conclusions …

CISC:-

Pronounced sisk, and stands for Complex Instruction Set Computer. Most
PC's use CPU based on this architecture.
Typically CISC chips have a
large amount of different and complex instructions. The philosophy
behind it is that hardware is always faster than software, therefore one
should make a powerful instructionset, which provides programmers with
assembly instructions to do a lot with short programs.
In common
CISC chips are relatively slow (compared to RISC chips) per instruction,
but use little instructions.

RISC:-

Pronounced risk, and stands for Reduced Instruction Set Computer. RISC
chips evolved around the mid-1980 as a reaction at CISC chips. The
philosophy behind it is that almost no one uses complex assembly
language instructions as used by CISC, and people mostly use compilers
which never use complex instructions. Apple for instance uses RISC
chips.
Therefore fewer, simpler and faster instructions would
be better, than the large, complex and slower CISC instructions.
However, more instructions are needed to accomplish a task. An other
advantage of RISC is that - in theory - because of the more simple
instructions, RISC chips require fewer transistors.

Finally, it's easier to write


powerful optimised compilers, since fewer instructions exist.
Characteristics of RISC vs CISC
RISC CISC
Simple instructions taking 1 cycle
Complex instruction taking
multiple cycles
Only LOAD/STORE reference Any instruction can references
memory memory
High pipelined Not pipelined or less pipelined
Instructions are executed by HW Instructions are interpreted by
microprogram
Fixed format instructions Variable format instructions
Few instructions and modes Many instructions and modes
Complexity is in the compiler Complexity is in the microprogram

There is still considerable controversy among experts about which


architecture is better. Some say that RISC is cheaper and faster and
therefore the architecture of the future.
Others note that by making the
hardware simpler, RISC puts a greater burden on the software. Software
needs to become more complex. Software developers need to write more
lines for the same tasks.
Therefore they argue that RISC is not the
architecture of the future, since conventional CISC chips are becoming
faster and cheaper anyway.
RISC has now existed more than 10 years
and hasn't been able to kick CISC out of the market. If we forget about
the embedded market and mainly look at the market for PC's,workstations
and servers I guess a least 75% of the processors are based on the CISC
architecture. Most of them the x86 standard (Intel, AMD, etc.), but even
in the mainframe territory CISC is dominant via the IBM/390 chip. Looks
like CISC is here to stay …
Is RISC than really not better? The answer isn't quite that
simple. RISC and CISC architectures are becoming more and more alike.
Many of today's RISC chips support just as many instructions as
yesterday's CISC chips. The PowerPC 601,

For example, supports more instructions than the Pentium. Yet the 601 is
considered a RISC chip, while the Pentium is definitely CISC. Further
more today's CISC chips use many techniques formerly associated with
RISC chips.

Q2: Taking a suitable example, illustrate how compiler based


optimization is performed in RISC systems.

Ans. The numerical solution of Maxwell’s equations is a computationally


intensive task and use of high-performance parallel computing facilities
are necessary for the larger class of practical problems in scattering,
propagation and antenna modeling. It is therefore necessary to carefully
consider algorithm optimizations aimed at improving the code’s run time
performance on the computing platform employed. Although some
performance improvement can be derived from compiler-level
optimizations, further speed-up may involve manual effort in algorithm
restructuring, data layout , and parallelization.
Typically, the most
dramatic speed-up after code optimization is achieved by concentrating
on the matrix fill and solution step (LU decomposition or the inverse FFT
employed in the iterative solver) of the code. We therefore concentrate on
optimization techniques which emphasize performance improvements for
these steps and are aimed at reducing both CPU and wall-clock time.

Optimization Process Diagram

Source code

1. Hand-tuning 2. Preprocessor
3. Compiler front end
optimization Optimization
Preprocessor-
Hand-tuned Intermediate
tuned source
source code language code
code
4. Back end optimization

Object code

The optimization process

At the hand-tuning stage the user performs a number of optimizations at


the source-code level. Examples of such optimizations include reordering
programming statements or expressions and changing the memory access
patterns of loops [2, 3].

Next, an optimizing preprocessor, if available, takes the source code and


performs transformations enabled by user-selectable switches. Typical
examples are dead code elimination, inlining, interprocedural analysis,
library-call generation, etc.

The output source code is also optimized to take advantage of


architectural features of the host system. Then the output source code is
submitted to the compiler whose front end translates it into intermediate
language (IL).

Its back end optimizer then translates the IL code into machine language
and in this process it may apply a wide range of optimizations at the IL
level, depending on the user-selectable flag settings which invoke
specific sets of optimizations.

The sufficiency of the overall program performance is subsequently


assessed either intuitively, or more formally, e.g. by using different
performance characterization tools which provide bounds on the
achievable performance of the code [4, 5].

A performance bound hierarchy model may successively include the


effects of machine peak performance, high level application workloads,
compiler-inserted overhead, compiler generated instruction schedule,
cache effects, etc., and may be applied to loops, procedures, sections or
entire codes.
If later effects can be reduced or eliminated, performance may be made
to approach earlier bounds which represent the potential performance of
the application, if the effects of all later levels are eliminated.

Once a program has been sufficiently optimized for a single processor,


the next step is to assess whether the application code can take advantage
of multi-processor computing platforms. If so, the application is then
parallelized and optimized for parallel execution.

Optimization Examples

It is hard to overstress the importance of optimization. The examples


below demonstrate some simple techniques that can significantly improve
code performance and speed-up. Details on the employed techniques
themselves and further examples may be found in [2, 3].

An Array-Processing Example
Consider the two codes shown in Figure 3 which perform element-wise
array multiplication [6]. These codes are clearly seen to be functionally
equivalent.
do i=1,n do j=1,n
do j=1,n do i=1,n
c(i,j)=c(i,j)+a(i,j)*b(i,j) c(i,j)=c(i,j)+a(i,j)*b(i,j)
enddo enddo
enddo enddo
a) stride_n.f b) stride1.f

FIGURE:- The Array Multiplication. (a) Original. (b) Loop


interchanged

The only difference between the two examples in Figure is that the array
elements are referenced in a different order. All runs were made on an
IBM RISC System/6000, Model 530 with a 64KB cache. The arrays were
all declared REAL*8. A timing loop was inserted around the loops in the
examples so that the reported time is the average of 50 million inner loop
iterations.

Figure shows this time (in microseconds), as a function of n.

5
4.5 stride1
4 striden
Time( (Microsecond)

3.5
3
2.5
2
1.5
1
0.5
0
10 25 37 50 100 166 200 333 500
n

FIGURE: Performance on a RISC Sytem6000 Model 530 with 64 KB


data cache

As seen in Figure , the performance differs significantly between the two


codes. For small n, there is little difference in performance, but as n
grows, stride1 runs significantly faster then stride_n. In FORTRAN,
arrays are stored in “column major order”, implying that the leftmost
subscript changes more rapidly as memory-adjacent elements are
accessed.
In the stride1 routine, successive iterations of the inner loop access array
elements that are adjacent in memory. That is the array elements are
accessed in the same order as stored in memory.

However, in stride_n successive iterations of the inner loop access array


elements that are stored in memory n entries apart (one array column) in
memory. In this case the arrays are said to be accessed with stride n.

When a single element is read into the processor, adjacent elements


(comprising one “cache line”) are automatically brought into the high-
speed cache memory along with it. The user has no choice regarding this
automatic procedure of cache loading. Clearly, if all entries brought into
the cache are soon referenced(as in stride1), there is a memory access
delay only for the first element in each cache line that the processor reads
in.

However, if other entries in this line are referenced much later (as in
stride_n), the line with the referenced entries may get replaced in the
cache before they are referenced; referencing an element that is in the
cache is called a cache hit, otherwise the reference is a cache miss and
suffers a delay called the miss penalty. The advantage of the stride1 code
is that there is roughly one miss per cache line of elements accessed,

whereas almost every access to an element is a miss in stride_n code –


unless n is small enough so that entire array fits in cache and remains
there indefinitely. This scenario is easily seen in Figure 4, where both
stride1 and stride_n versions take the same time to run for n ≤50 : 3
arrays of 50 × 50 × 8 bytes = 60 KB < Size of the Cache (64KB).

Obviously, an understanding of the machine’s cache structure is


important in writing code routines that have the best potential for
optimum performance.1

Q3: Can instructions be executed in a pipeline? If yes, take a example


instruction and execute it using instruction pipeline.

Ans.An instruction pipeline is very similar to a manufacturing assembly


line. Imagine an assembly line partitioned into four stages:

1st stage receives some parts, performs its assembly task, and passes the
1
results to the second stage;

2nd stage takes the partially assembled product from the first stage,
performs its task, and passes its work to the third stage;

3rd stage does its work, passing the results to the last stage, which
completes the task and outputs its results.

As the first piece moves from the first stage to the second stage, a new set
of parts for a new piece enters the first stage. Ultimately, every stage
processes a piece simultaneously. This is how time is saved. Each product
requires the same amount of time to be processed (actually slightly more,
to account for the transfers between stages), but products are
manufactured more quickly because several are being created at the same
time.

Consider a non pipelined machine with 6 execution stages of lengths


50 ns, 50 ns, 60 ns, 60 ns, 50 ns, and 50 ns.
- Find the instruction latency on this machine.
- How much time does it take to execute 100 instructions?

Instruction latency = 50+50+60+60+50+50= 320 ns


Time to execute 100 instructions = 100*320 = 32000 ns

Suppose we introduce pipelining on this machine. Assume that when


introducing pipelining, the clock skew adds 5ns of overhead to each
execution stage.
- What is the instruction latency on the pipelined machine?
- How much time does it take to execute 100 instructions?

Solution:

Remember that in the pipelined implementation, the length of the pipe


stages must all be the same, i.e., the speed of the slowest stage plus
overhead. With 5ns overhead it comes to:
The length of pipelined stage = MAX(lengths of unpipelined stages) +
overhead = 60 + 5 = 65 ns
Instruction latency = 6x65 ns =390ns
Time to execute 100 instructions = 65*6*1 + 65*1*99 = 390 + 6435 =
6825 ns

Part - B

.
Q4: How RISC pipelines are implemented in RISC environment?

Ans. RISC AND PIPELINING

One of the major advantages of RISC instruction is the complexity of a


pipeline implementation.

RISC design features that make pipelining easy include.

(a)Single length instruction


(b)Relative few instruction format.
(c)Load and store instruction set
(d)Operand must be aligned in memory.
This is one possible configuration of an RISC pipeline, the pipeline
implemented in the SPARC MB86900 CPU.

The IBM 801, the first RISC computer, also uses a four-stage instruction
pipeline.
Other processors, such as the RISC II, use only three stages; they
combine the execute and store result operations in to a single stage.

There are three or four stages of RISC pipeline:-

Note that each stage has a register that latches its data at the end of the
stage to synchronize data flow between stages.

Q5: Give the super scalar architectures of Pentium processor.

Ans. Let us explain superscalar architecture of the Pentium processor


with the help of the example:-

Intel Pentium ("P5" / "P54C")


Intel's new fifth-generation chip was expected to be called the 586,following
their earlier naming conventions. However, with the rise of AMD and Cyrix,
Intel wanted to be able to register as a trademark the name of their new CPU
and numbers can't be trademarked. Thus, the Pentium was born. It is now
one of the most recognized trademarks in the computer world, one reason
why Intel doesn't seem to ever want to make another processor whose name
doesn't have "Pentium" in it somewhere.
The Pentium is the defining processor of the fifth generation. It has in fact
had several generations itself; the first Pentiums are different in many ways
from the latest ones. It has been the target for compatibility for AMD's K5
and Cyrix's 6x86 chips, as well as generations that have followed. The chip
itself is instruction set compatible with earlier x86 CPUs, although it does
include a few new (rarely used) instructions.

The Pentium provides greatly increased performance over the 486 chips that
precede it, due to several architectural changes. Roughly speaking,
a Pentium chip is double the speed of a 486 chip of the same clock speed. In
addition, the Pentium goes to much higher clock speeds than the 486 ever
did. The following are the key architectural enhancements made in the
Pentium over the 486-class chips (note that some of these are present in
Cyrix's 5x86 processor, but that chip was developed after the Pentium):

Superscalar Architecture: The Pentium is the first superscalar processor; it


uses two parallel execution units. Some people have likened the Pentium to
being a pair of 486s in the same chip for this reason, though this really isn't
totally accurate. It is really only partially superscalar because the second
execution unit doesn't have all the capabilities of the first; some instructions
won't run in the second pipeline. In order to take advantage of the dual
pipelines, code must be optimized to arrange the instructions in a way that
will let both pipelines run at the same time. This is why you sometimes see
reference to "Pentium optimization". Regardless, the performance is much
higher than the single pipeline of the 486.

Wider Data Bus: The Pentium's data bus is doubled to 64 bits, providing
double the bandwidth for transfers to and from memory.

Much Faster Memory Bus: Most Pentiums run on 60 or 66 MHz system


buses; most 486s run on 33 MHz system buses. This greatly improves
performance. Pentium motherboards also incorporate other performance-
enhancing features, such as pipelined burst cache. The Pentium processor
was also the first specifically designed to work with the (then new) PCI bus.

Branch Prediction: The Pentium uses branch prediction to prevent pipeline


stalls when branches are encountered.

Integrated Power Management: All Pentiums have built in SMM power


management (optional on most of the 486s).

Split Level 1 Cache: The Pentium uses a split level 1 cache, 8 KB each for
data and instructions. The cache was split so that the data and instruction
caches could be individually tuned for their specific use.
Improved Floating Point Unit: The floating point unit of the Pentium is
significantly faster than that of the 486.

The Pentium is available in a wide variety of speeds, and in regular and


Over Drive versions. It is also available in several packaging styles,
although the pin grid array (PGA) is still the most prevalent. The original
Pentiums, the 60 and 66 MHz versions, were very different than the later
versions that are used in most PCs; they used older, 5 volt technology and
significant problems with heat. Intel solved this with later (75-200 MHz)
versions by going to a smaller circuit size and 3.3 volt power.

Pentiums use three different sockets. The original Pentium 60 and 66 use
Socket 4. Pentiums from 75 to 133 will fit in either socket 5 or socket 7;
Pentium 150s, 166s and 200s require Socket 7. Intel makes Pentium Over
Drives that allow the use of faster Pentiums in older Pentium sockets (in
addition to Over Drives that go in 486 motherboards).

The Pentium processor achieved a certain level of "fame" as a result of the


bug that was discovered in its floating point unit not long after it was
released. This is commonly known as the "FDIV" bug after the instruction
(floating point divide) that it most commonly turns up in. While bugs in
processors are relatively common, they usually are minor and don't have a
direct impact on computation results. This one did, and achieved great
notoriety in part because Intel didn't own up to the problem and offer to
correct it immediately. Intel does offer a replacement on affected processors,
which were only found in early versions (60 to 100) sold in 1994 and earlier.

If you suspect your Pentium of having the FDIV bug, try this computation
test using a spreadsheet or calculator program: take the number 4,195,835
and divide it by 3,145,727. Then take the result and multiply it by the same
number again (3,145,727). You should of course get the same 4,195,835
back that you started with. On a PC with the FDIV bug you will get
4,195,579 (an error of 256), but beware that some operating systems and
applications have been patched to compensate for this bug, so a simple math
test isn't necessarily conclusive. Try looking at this page on Intel's web
site for replacement information, if you suspect that you have an FDIV bug
on your older Pentium chip.

For many years, the Pentium processor was the mainstream processor of
choice, but finally the Pentium with MMX has driven it to the economy
market. With the regular Pentium maxing out at 200 MHz and the Pentium
with MMX 166 dropping well below $200, the "Pentium Classic" doesn't
make nearly as much sense as it used to for new PCs. The 60 and 66 are
obsolete due to their slow speed and older technology, and the 75 to 150 are
obsolete because their performance is much lower than the 166 and 200, for
almost the same amount of money.

The entire classic Pentium line is now technically obsolete, due to the
availability of inexpensive, faster Pentium with MMX chips (as well as
comparable offerings from AMD and Cyrix). The non-MMX Pentium is no
longer generally used in new systems. However, since the Pentium with
MMX requires split rail voltage, the classic Pentium 200 remains a great
chip for those who have socket 7 motherboards and want to upgrade, but
who do not have split rail voltage support.

Q6: Is there any difference between the working of vector processors


and array processors? Differentiate between SIMD and MIMD array
processors.

Ans. Array processors are able to efficiently handle large amounts of


data, but since the function requires that the CPU be more complex,
simpler operations are more difficult to perform. Differences between
scalar and array processors became less pronounced with the
introduction of microprocessors in 1994.

Vector and array processing are essentially the same because, with slight
and rare differences, a vector processor and an array processor are the
same type of processor. A processor, or central processing unit (CPU),
is a computer chip that handles most of the information and functions
processed through a computer.

Vector processor and Array processor are just the same thing, its a CPU
design where instruction are set includes operations that can perform
mathematical operations on multiple data elements simultaneously.

SIMD processor organization

• This type of machine typically has an instruction dispatcher, a very high


bandwidth internal network, and a very large array of very small-capacity
instruction units.

• Thus single instruction is executed by different processing unit on


different set of data as shown in figure.
• Best suited for specialized problems characterized by a high degree of
regularity, such as image processing and vector computation.

• Synchronous (lockstep) and deterministic execution

• Two varieties: Processor Arrays e.g., Connection Machine CM-2,


Maspar MP-1, MP-2 and Vector Pipelines processor e.g., IBM 9000,
Cray C90, Fujitsu VP, NEC SX-2, Hitachi S82

Prev instruct Prev instruct


Load A(1) Load A(2)
Load B(1) Load B(2)
C(1)=A(1)* C(2)=A(2)*
B(1) B(2)
Store C(1) Store C(2)
Next instruction Next instruction
Prev instruct
Load A(n)
Load B(n)
C(n)=A(n)*
B(n)
Store C(n)
Next instruction

P1 P2 Pn

Execution of instructions in SIMD processor

Multiple instruction stream, multiple data stream (MIMD)

• Multiple Instruction: every processor may be executing a different


instruction stream

• Multiple Data: every processor may be working with a different data


stream as shown in figure multiple data stream is provided by shared
memory.

• Can be categorized as loosely coupled or tightly coupled depending on


sharing of data and control
• Execution can be synchronous or asynchronous, deterministic or
nondeterministic

Prev instruct Prev instruct Prev instruct


Load A(1) Call funcD Do 10 i=1,n
Load B(1) X=y*z Alpha=w**3
C(1)=A(1)* Sum=x*2 Zeta=c(i)
B(1)
Store C(1) Call sub1(i,j) 10 continue
Next instruction Next instruction Next instruction

Execution of instructions MIMD processor

You might also like