0% found this document useful (0 votes)
11 views10 pages

ICProc

paper

Uploaded by

brown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

ICProc

paper

Uploaded by

brown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Influence of Input/Output Operations on the Performance

of Dynamically Scheduled Processors

Carlos Rioja del Río, José María Rodríguez


Corral, Lenguajes y Sistemas Informáticos,
Universidad de Cádiz, Escuela Superior de
Ingeniería, C/ Chile 1, 11002 Cádiz, Spain. Arturo Morgado Estévez, Fernando Pérez
PeñaIngeniería de Sistemas y Automática,
Tecnología Electrónica y Electrónica, Universidad
Antón Civit Balcells, Arquitectura y Tecnología de de Cádiz,Escuela Superior de Ingeniería, C/ Chile
Computadores, Universidad de Sevilla, Escuela 1, 11002 Cádiz, Spain
Superior de Ingeniería Informática, Avda. Reina
Mercedes s/n, 41012 Sevilla, Spain.

Abstract— Nowadays, computers are frequently equipped memory and I/O peripherals). Finally, memories [1][11][12]
with peripherals that transfer great amounts of data between and caches [1][13] have also evolved.
them and the system memory using direct memory access
techniques (e.g. digital video cameras, high speed networks, ...). However, as peripherals become faster and are able
Those peripherals prevent the processor from accessing system to transfer large blocks of data directly to system memory,
memory for significant periods of time (i.e. while they are the influence of I/O operations on processor performance
communicating with system memory in order to send or receive gets more importance and, thus, it must be seriously
data blocks). In this paper we study the negative effects that I/O analysed. In order to study this negative effect, we will
operations from computer peripherals have on the performance
develop two simulation programs [14]: The first one
of dynamically scheduled processors. With the help of an object-
(processor and system memory simulator) will be configured
oriented tool (DESP-C++) used to make discrete event
simulators, we have developed a configurable software that with a set of typical parameters for each type of processor.
simulates a computer processor and main memory as well as the The second one (I/O subsystem simulator) generates a set of
I/O scenarios where the peripherals operate. This software has I/O scenarios for the different processor and system memory
been used to analyse the performance of four different models selected. Then, once the sample code fragment have
processors in four I/O scenarios: video capture, video capture been chosen, the simulation result will depend on two
and playback, high speed network and serial transmission. variables: the processor type (depending on the issue degree
and if it is statically or dynamically scheduled) and the I/O
1. INTRODUCTION scenario used.
In the last two decades processor architecture has
experimented great advances. After sequential processors, 2. RELATED WORK
pipeline processors [1][2] were designed and then The work described in [15] supposed a first step in
superscalar, superpipeline and VLIW processors [1][3]. our study of the influence of I/O operations on processor
Furthermore, instruction sets have evolved with the performance. The simulation language SMPL [16] - a set of
development of new processor architectures and these facts C functions designed for making discrete event simulators
have lead to the Reduced Instruction Set Computers (RISC) [17] using C language - was chosen in order to develop the
processors [1][2]. At present, multicore processors [4][5] are processor and system memory simulator as well as the I/O
getting more frequent in the computer industry. A multicore subsystem simulator. Figure 1 shows the basic processor
processor with two o more low-clock-speed cores is structure: It is a sliding window superscalar statically
designed to provide excellent performance while minimizing scheduled 32-bit RISC processor [1][2] with an instruction
power consumption and delivering lower heat output than cache, a non-blocking data cache and a branch target buffer.
configurations that rely on a single high-clock-speed core.
In relation to the sample program for the
On the other hand, I/O devices have evolved in experiments, we chose the code fragment corresponding to
order to get faster and faster for meeting the requirements of the key step in gaussian elimination [18]. As this code uses
today’s users, as high resolution graphics, high quality sound double precision variables, it is commonly named DAXPY
and video playback or high speed data transmission. because of the arithmetic operations that it performs.
Furthermore, new system buses [6][7][8][9][10] have been
developed with a higher bandwidth in order to improve data
transfer speeds among computer devices (i.e. processor,
i = 0; presentation on the monitor of the simulation timetable is
do { carried out by chrono().
y[i] = a * x[i] + y[i];
i++; There are also another research works on this
} while (i < 1000); subject. For example, in [19] the impact of the PCI bus load
on the performance of a PC processor is analysed from an
For our purposes, we used a generic RISC empirical point of view (with a Pentium PC and five
assembler loosely based on the DLX assembler [1]. identical ATM network cards for generating the bus load)
and from a mathematical one (using an abstract instruction
0 SUB R0, R0, R0 ;i=0 mix for calculating the impact of the external bus load on the
1 LD R1, 0(R0) ; R1 = #08 execution time of an application). However, in our study we
2 LD R2, 50(R0) ; R2 = #8000 have adopted a simulation based approach and for such study
3 LDF F0, 100(R0) ; F0 = a we have considered the influence of some processor
4 LDF F1, 8000(R0) ; F1 = x[i] microarchitectural characteristics, as instruction issue and
5 LDF F2, 16000(R0) ; F2 = y[i] functional units.
6 MULF F1, F1, F0 ; F1 = a * x[i]
7 ADDF F2, F1, F2 ; F2 = F1 + y[i] On the other hand, a method for reducing the
8 STF 16000(R0), F2 ; y[i] = F2 influence of the PCI bus on the execution time of real-time
9 ADD R0, R0, R1 ; i++ software is presented in [20]. This method achieves a more
10 SUB R2, R2, R0 ; R2 = #8000 - i deterministic behaviour for the access from a processor
11 BRC R2, 4 ; jump if i < #8000 through the PCI bus, as when switching to real-time context,
12 END ; psinstr: end of program the relevant devices can be configured so that only the
13 STP ; psinstr: stops ID phase Host/PCI bridge is allowed to become a bus master (an
14 NOP ; no operation “initiator” using PCI bus [6] terminology) and so, all the
system peripherals can operate exclusively as slave devices.
As our first processor and system memory Thus, the DMA transfers of PCI devices are postponed in
simulator [15] modelled a statically scheduled processor, we order to obtain more deterministic execution times for real-
unrolled and reordered the loop instructions in order to time software.
eliminate data dependences and increase the amount of
parallelism between instructions in each iteration [1]. These 3. SIMULATOR DESIGN
dependences decreased the performance of the simulated The selection of a simulation tool [21][22], either a
processor in such a way that the results obtained were not programming language, a simulation language or a
valid for extracting conclusions about them. simulator, is a decision of great importance as it conditions
the difficulty of the subsequent work as well as the obtaining
The I/O scenarios complete the system to simulate, of results. We have chosen the simulation language DESP-
as processor access to main memory will be blocked during C++ [23], a discrete-event [17] random simulation engine
the data transfers among the peripherals of a particular I/O based on the C++ programming language [24], since this
scenario and the memory. The I/O subsystem simulator is type of simulation seems to be the most suitable than
provided with the selected scenario parameters, so it continuous simulation due to the nature of the system to
generates a memory block trace and, optionally, an interrupt model, based on synchronous sequential circuits (main
trace if there are peripherals that must warn the processor processor, DMA controllers and peripherals), where the time
about events related to their operation using IRQs. The I/O unit is the clock cycle.
scenarios in which the processor and system memory
simulator ‘execute’ the sample program are: video capture, There also exist processor simulators with many
video capture and playback, high speed network and serial interesting features: The SimpleScalar toolset [25][26][27]
transmission. was written in 1992 at the University of Wisconsin, and it
was released in 1995 as an open source distribution freely
Figure 2 shows the I/O subsystem simulator block available to academic users. Actually, it has become very
diagram. main() function initializes all the program data popular among researches and instructors in the computer
structures and requests the trace file name and the simulation architecture research community. SimpleScalar infrastructure
time. Next, chooscen()requests the user to choose the components implement many common modelling tasks [26],
specific I/O scenario to simulate as well as its parameters. as instruction-set simulators, I/O emulation, discrete-event
The simulation loop resides in simulate() function, which management and modelling of common micro-architectural
generates the different events related to peripheral components (e.g. branch predictors, instruction queues and
operations. Finally, mark() auxiliary function is called by caches).
simulate() in order to register the initial and the final instants
of each memory block and each interrupt in the data structure However, we have preferred to develop a fully
that supports the traces. The writing of the memory block custom-made tool using a simulation language as a first step
and the interrupt traces is performed by write_file() and the in our research work (i.e. a prototype) and a more direct way
to obtain results. In the future, we want to develop a new • 4-issue superscalar processor with sufficient
version of our simulator using SimpleScalar. DESP-C++ resources: twelve FP addition units (12 ADDF), twenty four
complies with an important number of requirements [23] that FP product units (24 MULF), eight load/store units (8 MEM)
are relevant in order to choose a simulator or a simulation and four integer units (4 EX).
language [21][22] and recommend its use, as flexibility and • 4-issue superscalar processor with limited
simplicity, validity of the simulation results, efficiency, resources: six FP addition units (6 ADDF), twelve FP
compactness, portability and extensibility, object-oriented product units (12 MULF), four load/store units (4 MEM) and
approach and use of the Standard Template Library (STL) two integer units (2 EX).
[28], statistical facilities, random number generator and • 2-issue superscalar processor with sufficient
independent replication generator. resources: six FP addition units (6 ADDF), twelve FP
product units (12 MULF), four load/store units (4 MEM) and
The model of the dynamically scheduled processor two integer units (2 EX).
(figure 3) adds to the model of the statically scheduled one • 2-issue superscalar processor with limited
[15] those aspects (figure 4) related to the reservation resources: three FP addition units (3 ADDF), six FP product
stations and the common data bus (CDB) [1][29][30]. In units (6 MULF), two load/store units (2 MEM) and one
particular, when issuing an instruction to a functional unit (IS integer unit (1 EX).
stage), there must be at least a free reservation station for
such functional unit. Also, when a functional unit provides As well as redesigning the processor and system
the result of an operation, it must be propagated through the memory simulator using DESP-C++, its functionality has
CDB in order to be written into the corresponding processor been extended with the incorporation of the dynamic
destination register and to be read by those reservation scheduling feature. So, the number of reservation stations
stations whose instructions need it for their execution. [1][29][30] must be supplied as an additional parameter
Finally, this new version of the processor and system when simulating the operation of a dynamically scheduled
memory simulator also comprises the statically scheduled processor, and it will be the same for all the processor
processor model (figure 1) for allowing the execution of the functional units. The concrete values selected for the
corresponding simulation experiments (table 1). simulation experiments are 1, 2, 4, 8 and 16 reservation
stations [36][37].
In order to choose the different input parameters
which will configure the processor model for the 4. SCENARIO DESIGN
simulations, we have selected a set of typical values after In this section, we will treat those questions related
reviewing the features of various well-known processors to the I/O scenarios where our processor model will
[31][32][33][34][35]. These parameters are: ‘execute’ the sample program instructions (of course, we
mean a simulated execution). The I/O scenario generator
• Frequency of operation: 1 GHz. [31][33]. software has the same structure (figure 2) as its predecessor
• System memory access time: 2.5 ns for a DDR400 [15], and it works in a similar way. However, it has been
SDRAM at 200 MHz. [1][11]. rewritten using the simulation engine DESP-C++ and an
• Hit rate for the instruction and the data caches: 0.95 object-oriented design methodology.
for the instruction cache and 0.9 for the data cache [1][13].
• Line size for the instruction and the data caches: 32 The I/O scenarios in which the processor and
bytes (capacity for eight 32-bit instructions) for the system memory simulator will ‘execute’ the sample program
instruction cache and also 32 bytes for the data cache are updated versions of the same used for the experiments
[34][35]. performed in [15]: video capture, video capture and
• Number of consecutive misses for the data cache playback, high speed network and serial transmission.
without blocking [1]: four outstanding misses is an
acceptable value [33]. • Video capture: A digital video camera sends frames
• Code prefetch queue size: 32 bytes (capacity for in high definition format (HDTV) [38] to a computer through
eight 32-bit instructions) [35]. a channel with an adequate bandwidth (IEEE bus 1394 [39]).
• Latency of floating point (FP) units [1][33][34]. As frame data arrives to main memory, they are stored into
Addition (ADDF): 3 clocks. Product (MULF): 6 clocks. the system hard disk. General scenario attributes are:
Division (DIVF): 21 clocks.
• Latency of load/store units: One clock for - Frame resolution is 1920 x 1080 pixels and it has a
calculating the access effective address and another clock for 24 bits colour.
accessing the data cache [1][34]. - Vertical frequency is 30 frames/sec.
- Hard disk transfer rate (sustained bandwidth) is 115
Finally, we have chosen four types of processors for Mbytes/sec. [1][40].
the performance of the experiments. Thus, for each I/O - The camera adapter and the hard disk controller are
scenario four results will be obtained, one for each processor both connected to the system through a PCI Express 3.0 x1
type. bus [9].
• Video capture and playback: It is similar to
previous scenario with the only difference that there is an N
1
additional peripheral. When a data block arrives to main (2) S 2 = ∑(X − X )i
2

memory from the camera adapter, it is sent to the hard disk N −1 i =1


buffer and also to the video adapter memory in order to
achieve real time playback. We assume that the video
adapter is also connected to the system through a PCI
(3) µ = lim E ( Xn )
n →∞
Express 3.0 x16 bus.
As Xi values are approximately normally
• High speed network: A remote computer sends distributed when N is high enough, a confidence interval for
information to our system through a 1000 Mbits/sec. the estimated value by is given by equation (4), where tα/2,
baseband IEEE 802.3 (Gibabit Ethernet) network [41]. As n-1 is the value that leaves the (α/2 * 100) percent of the t
data blocks arrive to main memory, they are stored into the
Student distribution area on the left. Thus, P[ - H ≤ μ ≤ - H]
system hard disk. General scenario attributes are:
is equal to the confidence level (1 - α) for the interval.
- Gigabit Ethernet network transfer rate is 1000
Mbits/sec.
- Hard disk transfer rate (sustained bandwidth) is 115 S
Mbytes/sec. (4) H = t α / 2, N −1
- The network adapter and the hard disk controller N 1/ 2
are both connected to the system through a PCI Express 3.0
x1 bus. However, the samples (Xi) obtained from the
experiments must be independent and identically distributed
• Serial transmission: The system continuously for the confidence interval calculated in equation (4) to be
receives bytes through an USB [7] port. Input reports are valid. We may use one of the following two methods [42],
periodically sent from a HID-compliant mouse [8] to the which are also easy to apply:
USB Host at low-speed mode. Every time a data block of
certain size has completely arrived to main memory, it is The method of batch means divides an execution in
written into the system hard disk. General scenarios various blocks so the means obtained for each block are
attributes are: approximately independent. However, the means calculated
in this way are not strictly independent and furthermore,
- Every millisecond (i.e. an USB frame), the USB estimating the necessary duration for each block is difficult
Host driver requests information to the mouse about its state [43]. On the other hand, the method of replications is the
and receives a four bytes report (state of the three buttons, X, simplest one and it is correct as replications are independent
Y and wheel relative positions) from it. whenever the seed of the random number generator functions
- USB low-speed transfer rate is 1.5 Mbits/sec. is different in each replication.
- When an input report from the mouse arrives to the
USB Host, this one request an interrupt. In the ISR, the We have chosen the second method for our
processor reads the report data through the PCI Express bus experiments as it seems the most suitable, resource
and writes it into the system main memory. availability is enough and execution times are not too high.
- Hard disk transfer rate (sustained bandwidth) is 115 Each experiment will consist of a temporal simulation of a
Mbytes/sec. 2.000.000.000 instruction execution from the sample
- The hard disk controller is connected to the system program. As the processor model frequency is 1 GHz., we
through a PCI Express 3.0 x1 bus. will consider that the selected amount of instructions is
statistically significant. For each experiment we will make
5.Analysis of results ten executions (replications) and we will state an accuracy of
Let X1, X2, ... XN be the results obtained from 10% with a confidence interval [44] of 95%. If the desired
various simulation experiments. The parameters that are accuracy were not achieved for an experiment, the
normally interesting are the mean (1) and the variance (2). If corresponding execution lengths would be increased.
N is high enough, the results of equations (1) and (2) are
approximately equal to the really important parameters: the Experiment results [14] (instruction execution mean
expected value of X (3) and its second moment respectively. times) for statically scheduled processors are shown in Table
1. For each processor, the video capture and playback
scenario is the most aggressive for processor performance
1 N whereas the serial transmission scenario is the least, as it was
(1) X= ∑X i predictable. Furthermore, the processor that gives the best
N i =1 result in each column is the 4-issue superscalar one with
sufficient resources (without structural hazards), whereas the
2 issue processor with limited resources gives the worst hyperbolic functional relationships of the A-T and C-G
performance. variable pairs of DNA bases - adenine, thymine, cytosine and
guanine - [48], and the relationship between the joint
Tables 2.a, 2.b, 2.c, 2.d and 2.e show the instruction viscosity and the joint angular velocity in agonist skeletal
execution mean times (measured in clock cycles per muscles [49]) and, furthermore, it allow us to make
instruction [1]) for dynamically scheduled processors with 1, predictions easily, as the evolution of the inverse function
2, 4, 8 and 16 reservation stations in the ideal case and in the tends to a horizontal asymptote.
four I/O scenarios. The CPI values from the simulations of
processors with one reservation station for each functional (5)
unit have been obtained only for theoretical purposes (i.e. a f(x) = 0.00213 x 2 - 0.05017 x + 0.85016
dynamically scheduled processor with only one reservation
station has no practical utility), and they are greater than the Multiple R -squared: 0.9225. Adjusted R -squared: 0.845.
results obtained from the simulations of statically scheduled
processors, shown in table 1. This is normal considering that (6)
in a dynamically scheduled pipeline, the effective latency
between a producing instruction and a consuming one is at f(x) = - 0.00046 x 3 + 0.01344 x 2 - 0.11799 x + 0.93310
least one cycle longer than the latency of the functional unit Multiple R -squared: 0.9947. Adjusted R -squared: 0.9786.
producing the result, as it can not be used until the Write
Result (WR) stage has finished [1].
(7)
Tables 3.a, 3.b, 3.c, 3.d and 3.e show the speed ups 1
(%) in relation to statically scheduled processors for f(x) = 0.27759 + 0.57187
dynamically scheduled processors with 1, 4 and 16
x
reservation stations in the ideal case and in the four I/O Multiple R -squared: 0.9782. Adjusted R -squared: 0.9709.
scenarios. Obviously, the best values are achieved for each
I/O scenario when the processors work using 16 reservation
stations. Thus, for each dynamically scheduled processor
working in ‘ideal conditions’ or under the influence of a
In order to explain the evolution of the CPI values concrete I/O scenario, we can obtain from the CPI values
obtained in tables 2.a, 2.b, 2.c, 2.d and 2.e as the number of registered in tables 2.a, 2.b, 2.c, 2.d and 2.e a set of inverse
reservation station increases, a set of regression functions has regression functions, whose general expression is described
been obtained using the statistical program “R” [45][46]. by equation (8) and whose particular analytical expressions
After trying with various types of functions (e.g. linear, are shown in table 4, along with their corresponding
quadratic, cubic, inverse, exponential, logarithmic, ...) we determination coefficients. The constant A0 of each function
only selected three types of regression functions, which gave (ideal CPI) represents the limit (horizontal asymptote) when
the highest determination coefficients (i.e. R-squared the number of reservation stations tends to infinite and,
multiple and R-squared adjusted). therefore, the data dependencies existing in the sample
program (i.e. DAXPY [18]) have no negative effects on the
Figures 5.a, 5.b and 5.c show three cases of issue of instructions to the functional units. The term A1
regression functions for the existing CPI values in the first measures the hyperbola concavity degree, and it is related
row of table 2.b (4-isuee superscalar processor with with the cost of the hardware that would be necessary to add
sufficient resources operating in the video capture and to the processor so that its performance approximated
playback I/O scenario), corresponding to equations (5)-(7) sufficiently to the ideal CPI (i.e. as the value of A1 increases,
respectively. Equation (6) represents the best case a greater number of reservation stations will be needed in
considering that the determination coefficients corresponding order to achieve a performance close to the ideal CPI).
to the cubic adjust (figure 5.b) are the highest, so close to the
unit, and equation (5) represents the worst case, since the
determination coefficients corresponding to the quadratic A1
adjust (figure 5.a) are the lowest. The inverse adjustment (8) CPI ( Rs ) = + A0
(figure 5.c), whose analytical expression is provided by Rs
equation (7), has a slightly lower quality than the cubic
adjustment as the determination coefficients for the former Finally, table 5 shows the ideal speed ups for
are a bit lower than the ones for the latter. However, such dynamically scheduled processors with an unlimited number
inverse adjustment seems to be the more feasible and logical, of reservation stations. Each of these speed ups is defined as
since there are systems and phenomena whose characteristics the relation between the ideal instruction execution mean
and behaviours respectively can be modelled using an time (provided by the limit of the corresponding regression
inverse function (e.g. the evolution of the page fault rate for function in table 4 when the number of reservation stations
a process as the number of frames - pages maintained in tends to infinite) for a concrete type of dynamically
main memory - allocated to it grows [47], the equilateral scheduled processor and a specific I/O scenario, and the
instruction execution mean time shown in table 1 for the • Extend the study to new I/O scenarios for specific
corresponding statically scheduled processor (i.e. with the purpose machines (e.g. client and server of web pages, server
same features but without supporting dynamic scheduling) of network games, server of databases, ...), so that the
and the same I/O scenario. analysis of the results obtained allows to find suitable
configurations for the processor and the system memory of
6. CONCLUSIONS these machines.
A simulation software has been designed in order to • With regard to the previous line, we propose the
study the influence of peripheral I/O operations on the development of a software module capturer of parameters for
performance of dynamically scheduled processors. This real I/O scenarios - by means of which computer applications
software has been used to analyse the performance of four and real I/O loads can be properly characterized - in order to
different processors in four I/O scenarios (video capture, be used as a generator of inputs (traces of main memory
video capture and playback, high speed network and serial blocks due to I/O operations) to the processor and system
transmission). memory simulator.

The performance improvement provided by 8. REFERENCES


dynamic scheduling is evaluated from the results obtained [1] J.L. Hennessy, D.A. Patterson. Computer
using this simulation software. So, the relation between the Architecture. A Quantitative Approach. Fourth Edition.
cycles per instruction (CPI) of a dynamically scheduled Elsevier Inc., 2007.
processor and the CPI of a statically scheduled one with [2] D. Patterson. Reduced Instruction Set Computers.
identical features, gives a speed up that increases as the Communications of the ACM 28(1), 1985. Pp. 8-21.
reservation station number of the dynamically scheduled [3] N.P. Jouppi, D.W. Wall. Available Instruction-
processor does. Thus, it can be proved that dynamically Level Parallelism for Superscalar and Superpipelined
scheduled processors show a greater tolerance to memory Machines. ACM SIGARCH Computer Architecture News -
blocks due to the I/O operations of the peripherals working Special issue: Proceedings of ASPLOS-III: The Third
in the system. International Conference on Architecture Support for
Programming Languages and Operating Systems 17(2),
From the analysis performed, a simple analytical 1989. Pp. 272-282.
model using a regression method has been developed in [4] J. Fruehe. Planning Considerations for Multicore
order to explain the performance evolution - measured in Processor Technology. Dell Power Solutions. May 2005.
CPI - for a concrete dynamically scheduled processor [5] R.M. Ramanathan. Intel Multi-Core Processors.
operating in a particular I/O scenario as the reservation Making the Move to Quad-Core and Beyond. White Paper.
station number increases. This analytical model allows us to Intel Corp., 2006. Available at:
calculate a set of ideal speed ups for dynamically scheduled http://www.intel.com/technology/architecture/downloads/qu
processors with an unlimited number of reservation stations, ad-core-06.pdf
for which the data dependencies existing in the sample [6] T. Shanley, D. Anderson. PCI System Architecture.
program have no negative effects on the issue of instructions Third Edition. Mindshare Inc., 1995.
to the functional units. [7] Hewlet-Packard Company, Intel Corp., Microsoft
Corp. et al. Universal Serial Bus 3.0 Specification. Revision
Embedded systems usually include statically 1.0. November, 2008.
scheduled processors [32][50] since the extra hardware to [8] J. Axelson. USB Complete: The Developer’s
dynamically schedule instructions is both wasteful in terms Guide. Fourth Edition. Lakeview Research, 2009.
of chip area and power consumption [1] and thus, the [9] PCI Special Interest Group. PCI Express Base 3.0
programs they execute are optimized using code scheduling Specification. November, 2010.
and loop unrolling [1][51]. This optimization is useful and [10] R. Budruk, D. Anderson, T. Shanley. PCI Express
the embedded system performance is good whenever there System Architecture. Mindshare Inc., 2003.
are not external factors which can have a negative influence [11] Y. Katayama. Trends in Semiconductor Memories.
on the execution time of such programs. The results IEEE Micro 17(6), 1997. Pp. 10-17.
described in this work show the utility of using dynamically [12] Hewlett-Packard Development Company. Memory
scheduled processors in those embedded systems which must technology evolution: an overview of system memory
support an important volume of I/O transfers (i.e. an active technologies. Technology brief, ninth edition. Technical
I/O). White Paper TC101004TB. December, 2010.
[13] J.K. Peir, W.W. Hsu, A.J. Smith. Functional
7. FUTURE RESEARCH LINES Implementation Techniques for CPU Cache Memories. IEEE
• Extend the study described in this paper to Transactions on Computers 48(2), 1999. Pp. 100-110.
multicore processors. [14] C. Rioja del Río. Estudio de la Influencia de la
• Redesign the processor and system memory Entrada/Salida en el Rendimiento de los Procesadores (in
simulator and the I/O subsystem simulator using the Spanish), Ph.D. thesis. Universidad de Cádiz, 2011.
SimpleScalar tool set. [15] J.M. Rodríguez, A. Civit, G. Jiménez, J.L.
Sevillano, A. Morgado. Influence of Input/Output Operations
on Processor Performance. Journal of Circuits, Systems and [32] ARM Ltd. Cortex-A8. Technical Reference
Computers 15(1), 2006. Pp. 43-56. Manual. May, 2010. Available at:
[16] M.H. MacDougall. Simulating Computer Systems: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.do
Techniques and Tools. The MIT Press, 1987. c.ddi0344k/index.html
[17] J. Banks, J.S. Carson, B.L. Nelson, D.M. Nicol. [33] IBM Corp. IBM PowerPC 750GX RISC
Discrete-Event System Simulation. Fifth Edition. Prentice Microprocessor. Revision Level DD1.X (Datasheet).
Hall, 2010. September, 2005.
[18] Intel Corp. Using Streaming SIMD Extensions 2 [34] K.C. Yeager. The Mips R10000 Superscalar
(SSE 2) for SAXPY/DAXPY. Version 2.0. Application Note Microprocessor. IEEE Micro 16(2), 1996. Pp. 28-40.
AP-935. July, 2000. [35] T. Shanley. Pentium Pro and Pentium II Processor
[19] S. Schönberg. Impact of PCI-Bus Load on System Architecture. Second Edition. Mindshare Inc., 1998.
Applications in a PC Architecture. Proceedings of the 24th [36] L. Gwennap. Intel’s P6 Uses Decoupled
IEEE International Real-Time Systems Symposium (RTSS). Superscalar Design. Microprocessor Report 9(2), 1995. Pp.
Cancun (México), 2003. Pp. 430-439. 9-15.
[20] J. Stohr, A. von Bülow, G. Färber. Controlling the [37] J.M. Colmenar, O. Garnica, J. Lanchares, J.I.
Influence of PCI DMA Transfers on Worst Case Execution Hidalgo. Characterizing asynchronous variable latencies
Times of Real-Time Software. Proceedings of the 4th Intl through probability distribution functions. Microprocessors
Workshop on Worst-Case Execution Time (WCET) and Microsystems 33(7-8), 2009. Pp. 483-497.
Analysis. Sicily (Italy), 2004. Pp. 19-22. [38] C. Basile, A.P. Cavallerano, M.S. Deiss, R. Keeler
et al. The US HDTV standard. IEEE Spectrum 32(4), 1995.
[21] J. Nikoukaran. R.J. Paul. Software Selection for Pp. 36-45.
Simulation in Manufacturing: a Review. Simulation Practice [39] IEEE Computer Society. IEEE Standard for a High-
and Theory 7(1), 1999. Pp. 1-14. Performance Serial Bus. IEEE Std 1394-2008 (Revision of
[22] T.W. Tewoldeberhan, G. Bardonnet. An Evaluation IEEE Std 1394-1995). October, 2008.
and Selection Methodology for Discrete-Event Simulation [40] Seagate Technologies LLC. Barracuda 7200.11
Software. Proceedings of the 2002 Winter Simulation Serial ATA. Rev. C. Product Manual. August, 2008.
Conference (WSC). San Diego (California), 2002. Pp. 67-75. [41] IEEE Computer Society. Carrier Sense Multiple
[23] J. Darmont. DESP-C++: a discrete-event simulation Access with Collision Detection (CSMA/CD) access method
package for C++. Software: Practice and Experience 30(1), and Physical Layer specifications. IEEE Std 802.3-2008
2000. Pp. 37-60. (Revision of IEEE Std 802.3-2005). December, 2008.
[24] Bjarne Stroustrup. The C++ Programming [42] A.M. Law. Statistical Analysis of Simulation
Language. Third Edition. Addison-Wesley, 1997. Output Data. Operations Research 31(6), 1983. Pp. 983-
[25] D. Burger, T.M. Austin. The SimpleScalar Tool 1029.
Set, Version 2.0. Technical Report 1342. Department of [43] P. Heidelberger, S.S. Lavenberg. Computer
Computer Sciences. University of Wisconsin-Madison, Performance Evaluation Methodology. IEEE Transactions
1997. Available at: on Computers 33(12), 1984. Pp. 1195-1220.
http://www.cs.wisc.edu/~mscalar/simplescalar.html [44] J. Banks. Output Analysis Capabilities of
[26] T. Austin, E. Larson, D. Ernst. SimpleScalar: An Simulation Software. SIMULATION 66(1), 1996. Pp. 23-30.
Infrastructure for Computer System Modeling. IEEE [45] J. Verzani. Using R for Introductory Statistics.
Computer 35(2), 2002. Pp. 59-67. Chapman & Hall/CRC Press, 2005.
[27] D. Burger, T.M. Austin, S.W. Keckler. Recent [46] S.J. Sheather. A Modern Approach to Regression
Extensions to The SimpleScalar Tool Suite. ACM with R. Springer, 2009.
SIGMETRICS Performance Evaluation Review - Special [47] W. Stallings. Operating Systems. Internals and
issue on tools for computer architecture research 31(4), 2004. Design Principles. Sixth Edition. Prentice Hall, 2009.
Pp. 4-7. [48] L.L. Gatlin. Base Composition Hyperbolic
[28] M.H. Austern. Generic Programming and the STL: Functional Relationships in DNA. Journal of Theoretical
Using and Extending the C++ Standard Template Library. Biology 7(1), 1964. Pp. 129-140.
Addison-Wesley Professional, 1999. [49] Y. Takeda, M. Iwahara, T. Kato, T. Tsuji. Analysis
[29] R.M. Tomasulo. An Efficient Algorithm for of Human Wrist Joint Impedance: Does Human Joint
Exploiting Multiple Arithmetic Units. IBM Journal of Viscosity Depend on Its Angular Velocity? Proceedings of
Research and Development 11(1), 1967. Pp. 25-33. the 2004 IEEE Conference on Cybernetics and Intelligent
[30] G.S. Sohi. Instruction Issue Logic for High- Systems. Singapore, 2004. Pp. 999-1004.
Performance, Interruptible, Multiple Functional Unit, [50] E. Kappos, D.J. Kinniment. Application-specific
Pipelined Computers. IEEE Transactions on Computers Processor Architectures for Embedded Control: Case
39(3), 1990. Pp. 349-359. Studies. Microprocessors and Microsystems 20(4), 1996. Pp.
[31] ARM Ltd. Cortex-A8 Processor. Available at: 225-232.
http://www.arm.com/products/processors/cortex-a/cortex- [51] R. Leupers. Code Generation for Embedded
a8.php Processors. Proceedings of the 13th International
Symposium on System Synthesis. Madrid (Spain), 2000. Pp. 4-issue & hazards 112.8 94.0 81.9
173-178. 2-issue 110.0 97.6 92.7
2-issue & hazards 105.2 94.2 85.7
Processor Ideal C. & Play Capture Network Serial Tables 3.a, 3.b, 3.c, 3.d and 3.e. Speed ups (%) in relation to
4-issue 0.4991 0.7656 0.7324 0.6202 0.5106 statically scheduled processors for dynamically scheduled processors
4-issue & hazards 0.5314 0.7786 0.7613 0.6392 0.5330
2-issue 0.7819 1.2821 1.1900 1.0385 0.8296
2-issue & hazards 0.8997 1.3609 1.2994 1.1124 0.9265 Processor Ideal C & Play Capture Network Serial
4-issue
Table 1. Instruction execution mean times (processor clock cycles) CPI(Rs) = 0.14399 (1/Rs) + 0.38203
for statically scheduled processors CPI(Rs) = 0.27759 (1/Rs) + 0.57187
CPI(Rs) = 0.28812 (1/Rs) + 0.49289
CPI(Rs) = 0.25650 (1/Rs) + 0.42947
Ideal 1 Rs 2 Rs 4 Rs 8 Rs 16 Rs CPI(Rs) = 0.16418 (1/Rs) + 0.38858
4-issue 0.5206 0.4634 0.4209 0.4016 0.3941
4-issue & hazards 0.5782 0.5315 0.4955 0.4478 0.4354 Multiple R-squared 0.9899 0.9782 0.9895 0.9873
2-issue 0.8376 0.7920 0.7812 0.7734 0.7517 0.9975
2-issue & hazards 0.9084 0.8663 0.8549 0.8226 0.7858 Adjusted R-squared 0.9865 0.9709 0.9860 0.9831
0.9967
Capt. & Play 1 Rs 2 Rs 4 Rs 8 Rs 16 Rs 4-issue & hazards
4-issue 0.8342 0.7358 0.6532 0.6108 0.5902 CPI(Rs) = 0.16244 (1/Rs) + 0.43043
4-issue & hazards 0.8676 0.7821 0.7220 0.6388 0.6170 CPI(Rs) = 0.28425 (1/Rs) + 0.60788
2-issue 1.4139 1.2700 1.2299 1.1998 1.1480 CPI(Rs) = 0.28461 (1/Rs) + 0.57896
2-issue & hazards 1.4256 1.3402 1.2930 1.2390 1.1783 CPI(Rs) = 0.25769 (1/Rs) + 0.46691
CPI(Rs) = 0.18407 (1/Rs) + 0.43182
Capture 1 Rs 2 Rs 4 Rs 8 Rs 16 Rs
4-issue 0.7701 0.6507 0.5800 0.5372 0.5126 Multiple R-squared 0.9178 0.9227 0.8907 0.9680
4-issue & hazards 0.8344 0.7571 0.7022 0.6195 0.5842 0.9343
2-issue 1.2495 1.1600 1.1200 1.0952 1.0597 Adjusted R-squared 0.8904 0.8969 0.8542 0.9573
2-issue & hazards 1.3299 1.2400 1.2185 1.1472 1.0921 0.9124
2-issue
Network 1 Rs 2 Rs 4 Rs 8 Rs 16 Rs CPI(Rs) = 0.08985 (1/Rs) + 0.74965
4-issue 0.6761 0.5687 0.5094 0.4718 0.4452 CPI(Rs) = 0.28318 (1/Rs) + 1.13537
4-issue & hazards 0.7081 0.6199 0.5499 0.5069 0.4792 CPI(Rs) = 0.20553 (1/Rs) + 1.05183
2-issue 1.1865 1.0350 0.9611 0.9180 0.8608 CPI(Rs) = 0.35492 (1/Rs) + 0.84576
2-issue & hazards 1.2332 1.0698 1.0300 0.9594 0.8898 CPI(Rs) = 0.14763 (1/Rs) + 0.76349
Serial 1 Rs 2 Rs 4 Rs 8 Rs 16 Rs Multiple R-squared 0.9443 0.9694 0.9734 0.9768
4-issue 0.5498 0.4761 0.4307 0.4102 0.4014 0.9622
4-issue & hazards 0.6011 0.5451 0.5011 0.4497 0.4364 Adjusted R-squared 0.9258 0.9592 0.9645 0.9690
2-issue 0.9127 0.8266 0.8100 0.7980 0.7694 0.9496
2-issue & hazards 0.9743 0.8967 0.8723 0.8341 0.7938 2-issue & hazards
CPI(Rs) = 0.14321 (1/Rs) + 0.78186
Tables 2.a, 2.b, 2.c, 2.d and 2.e. Instruction execution mean times CPI(Rs) = 0.27621 (1/Rs) + 1.17552
(processor clock cycles) for dynamically scheduled processors CPI(Rs) = 0.26459 (1/Rs) + 1.08990
CPI(Rs) = 0.36667 (1/Rs) + 0.88307
CPI(Rs) = 0.19313 (1/Rs) + 0.79294
Ideal 1 Rs 4 Rs 16 Rs
4-issue 104.3 84.3 79.0 Multiple R-squared 0.7833 0.8890 0.8563 0.9421
4-issue & hazards 108.8 93.2 81.9 0.9267
2-issue 107.1 99.9 96.1 Adjusted R-squared 0.7111 0.8520 0.8084 0.9229
2-issue & hazards 101.0 95.0 87.3 0.9023
Capt. & Play 1 Rs 4 Rs 16 Rs Table 4. Regression functions for instruction execution mean times
4-issue 109.0 85.3 77.1 (processor clock cycles) and coefficients of determination
4-issue & hazards 111.4 92.7 79.2
2-issue 110.3 96.0 89.5
2-issue & hazards 104.8 95.0 86.6 Processor
Ideal C. & Play Capture Network Serial
Capture 1 Rs 4 Rs 16 Rs 4-issue 76.5 74.7 67.3 69.2 76.1
4-issue 105.2 79.2 70.0 4-issue & hazards 81.0 78.1 76.1 73.1 81.0
4-issue & hazards 109.6 92.2 76.7 2-issue 95.9 88.6 88.4 81.4 92.0
2-issue 105.0 94.1 89.1 2-issue & hazards 86.9 86.4 83.9 79.4 85.6
2-issue & hazards 102.4 93.8 84.1
Table 5. Ideal speed ups (%) in relation to statically scheduled
Network 1 Rs 4 Rs 16 Rs processors for dynamically scheduled processors with an unlimited number
4-issue 109.0 82.1 71.8 of reservation stations
4-issue & hazards 110.8 86.0 75.0
2-issue 114.3 92.5 82.9
2-issue & hazards 110.9 90.6 80.0

Serial 1 Rs 4 Rs 16 Rs
4-issue 107.7 84.4 78.6
Captions to Illustrations RESERVATION FUNCTIONAL
Figure 1. Statically scheduled processor structure. STATIONS UNIT 1

Figure 2. I/O subsystem simulator diagram. RESERVATION


STATIONS
FUNCTIONAL
UNIT 2
Figure 3. Dynamically scheduled processor structure. N
DECODING AND RESERVATION FUNCTIONAL
Figure 4. Dynamically scheduled processor and system ISSUE STATIONS UNIT 3
memory simulator diagram. INSTRUCTION QUEUE RESERVATION FUNCTIONAL
Figures 5a. Quadratic adjustment, 5b. Cubic adjustment REGISTER M
STATIONS UNIT K-1

and 5c. Inverse adjustment. FILE


RESERVATION FUNCTIONAL
STATIONS UNIT K

COMMON DATA BUS

FUNCTIONAL
UNIT 1
Influence of Input/Output Operations on the Performance
FUNCTIONAL
UNIT 2 of Dynamically Scheduled Processors
DECODING AND
N
FUNCTIONAL
José María Rodríguez Corral, Carlos Rioja del Río,
ISSUE UNIT 3 Antón Civit Balcells, Arturo Morgado Estévez, Fernando
INSTRUCTION QUEUE FUNCTIONAL Pérez Peña
REGISTER M
UNIT K-1 Figure 3
FILE
FUNCTIONAL
UNIT K
MAIN

RESULT BUS

INIDATA READFILES PIPELINE CHRONO

Influence of Input/Output Operations on the Performance


of Dynamically Scheduled Processors IF_STAGE QUEUE IS_STAGE JUMP EX_STAGE DATACACH WR_STAGE END_STAG
José María Rodríguez Corral, Carlos Rioja del Río, E

Antón Civit Balcells, Arturo Morgado Estévez, Fernando


INSTRCAC RSFREE ACTCDB
Pérez Peña H

Figure 1
TACCMEM

Influence of Input/Output Operations on the Performance


MAIN
of Dynamically Scheduled Processors
José María Rodríguez Corral, Carlos Rioja del Río,
Antón Civit Balcells, Arturo Morgado Estévez, Fernando
Pérez Peña
CHOOSCEN SIMULATE WRITE_FILE CHRONO Figure 4

MARK

Influence of Input/Output Operations on the Performance


of Dynamically Scheduled Processors
José María Rodríguez Corral, Carlos Rioja del Río,
Antón Civit Balcells, Arturo Morgado Estévez, Fernando
Pérez Peña
Figure 2
Influence of Input/Output Operations on the Performance
of Dynamically Scheduled Processors
José María Rodríguez Corral, Carlos Rioja del Río,
Antón Civit Balcells, Arturo Morgado Estévez, Fernando
Pérez Peña
Figures 5a, 5b and 5c.

You might also like