ICProc
ICProc
Abstract— Nowadays, computers are frequently equipped memory and I/O peripherals). Finally, memories [1][11][12]
with peripherals that transfer great amounts of data between and caches [1][13] have also evolved.
them and the system memory using direct memory access
techniques (e.g. digital video cameras, high speed networks, ...). However, as peripherals become faster and are able
Those peripherals prevent the processor from accessing system to transfer large blocks of data directly to system memory,
memory for significant periods of time (i.e. while they are the influence of I/O operations on processor performance
communicating with system memory in order to send or receive gets more importance and, thus, it must be seriously
data blocks). In this paper we study the negative effects that I/O analysed. In order to study this negative effect, we will
operations from computer peripherals have on the performance
develop two simulation programs [14]: The first one
of dynamically scheduled processors. With the help of an object-
(processor and system memory simulator) will be configured
oriented tool (DESP-C++) used to make discrete event
simulators, we have developed a configurable software that with a set of typical parameters for each type of processor.
simulates a computer processor and main memory as well as the The second one (I/O subsystem simulator) generates a set of
I/O scenarios where the peripherals operate. This software has I/O scenarios for the different processor and system memory
been used to analyse the performance of four different models selected. Then, once the sample code fragment have
processors in four I/O scenarios: video capture, video capture been chosen, the simulation result will depend on two
and playback, high speed network and serial transmission. variables: the processor type (depending on the issue degree
and if it is statically or dynamically scheduled) and the I/O
1. INTRODUCTION scenario used.
In the last two decades processor architecture has
experimented great advances. After sequential processors, 2. RELATED WORK
pipeline processors [1][2] were designed and then The work described in [15] supposed a first step in
superscalar, superpipeline and VLIW processors [1][3]. our study of the influence of I/O operations on processor
Furthermore, instruction sets have evolved with the performance. The simulation language SMPL [16] - a set of
development of new processor architectures and these facts C functions designed for making discrete event simulators
have lead to the Reduced Instruction Set Computers (RISC) [17] using C language - was chosen in order to develop the
processors [1][2]. At present, multicore processors [4][5] are processor and system memory simulator as well as the I/O
getting more frequent in the computer industry. A multicore subsystem simulator. Figure 1 shows the basic processor
processor with two o more low-clock-speed cores is structure: It is a sliding window superscalar statically
designed to provide excellent performance while minimizing scheduled 32-bit RISC processor [1][2] with an instruction
power consumption and delivering lower heat output than cache, a non-blocking data cache and a branch target buffer.
configurations that rely on a single high-clock-speed core.
In relation to the sample program for the
On the other hand, I/O devices have evolved in experiments, we chose the code fragment corresponding to
order to get faster and faster for meeting the requirements of the key step in gaussian elimination [18]. As this code uses
today’s users, as high resolution graphics, high quality sound double precision variables, it is commonly named DAXPY
and video playback or high speed data transmission. because of the arithmetic operations that it performs.
Furthermore, new system buses [6][7][8][9][10] have been
developed with a higher bandwidth in order to improve data
transfer speeds among computer devices (i.e. processor,
i = 0; presentation on the monitor of the simulation timetable is
do { carried out by chrono().
y[i] = a * x[i] + y[i];
i++; There are also another research works on this
} while (i < 1000); subject. For example, in [19] the impact of the PCI bus load
on the performance of a PC processor is analysed from an
For our purposes, we used a generic RISC empirical point of view (with a Pentium PC and five
assembler loosely based on the DLX assembler [1]. identical ATM network cards for generating the bus load)
and from a mathematical one (using an abstract instruction
0 SUB R0, R0, R0 ;i=0 mix for calculating the impact of the external bus load on the
1 LD R1, 0(R0) ; R1 = #08 execution time of an application). However, in our study we
2 LD R2, 50(R0) ; R2 = #8000 have adopted a simulation based approach and for such study
3 LDF F0, 100(R0) ; F0 = a we have considered the influence of some processor
4 LDF F1, 8000(R0) ; F1 = x[i] microarchitectural characteristics, as instruction issue and
5 LDF F2, 16000(R0) ; F2 = y[i] functional units.
6 MULF F1, F1, F0 ; F1 = a * x[i]
7 ADDF F2, F1, F2 ; F2 = F1 + y[i] On the other hand, a method for reducing the
8 STF 16000(R0), F2 ; y[i] = F2 influence of the PCI bus on the execution time of real-time
9 ADD R0, R0, R1 ; i++ software is presented in [20]. This method achieves a more
10 SUB R2, R2, R0 ; R2 = #8000 - i deterministic behaviour for the access from a processor
11 BRC R2, 4 ; jump if i < #8000 through the PCI bus, as when switching to real-time context,
12 END ; psinstr: end of program the relevant devices can be configured so that only the
13 STP ; psinstr: stops ID phase Host/PCI bridge is allowed to become a bus master (an
14 NOP ; no operation “initiator” using PCI bus [6] terminology) and so, all the
system peripherals can operate exclusively as slave devices.
As our first processor and system memory Thus, the DMA transfers of PCI devices are postponed in
simulator [15] modelled a statically scheduled processor, we order to obtain more deterministic execution times for real-
unrolled and reordered the loop instructions in order to time software.
eliminate data dependences and increase the amount of
parallelism between instructions in each iteration [1]. These 3. SIMULATOR DESIGN
dependences decreased the performance of the simulated The selection of a simulation tool [21][22], either a
processor in such a way that the results obtained were not programming language, a simulation language or a
valid for extracting conclusions about them. simulator, is a decision of great importance as it conditions
the difficulty of the subsequent work as well as the obtaining
The I/O scenarios complete the system to simulate, of results. We have chosen the simulation language DESP-
as processor access to main memory will be blocked during C++ [23], a discrete-event [17] random simulation engine
the data transfers among the peripherals of a particular I/O based on the C++ programming language [24], since this
scenario and the memory. The I/O subsystem simulator is type of simulation seems to be the most suitable than
provided with the selected scenario parameters, so it continuous simulation due to the nature of the system to
generates a memory block trace and, optionally, an interrupt model, based on synchronous sequential circuits (main
trace if there are peripherals that must warn the processor processor, DMA controllers and peripherals), where the time
about events related to their operation using IRQs. The I/O unit is the clock cycle.
scenarios in which the processor and system memory
simulator ‘execute’ the sample program are: video capture, There also exist processor simulators with many
video capture and playback, high speed network and serial interesting features: The SimpleScalar toolset [25][26][27]
transmission. was written in 1992 at the University of Wisconsin, and it
was released in 1995 as an open source distribution freely
Figure 2 shows the I/O subsystem simulator block available to academic users. Actually, it has become very
diagram. main() function initializes all the program data popular among researches and instructors in the computer
structures and requests the trace file name and the simulation architecture research community. SimpleScalar infrastructure
time. Next, chooscen()requests the user to choose the components implement many common modelling tasks [26],
specific I/O scenario to simulate as well as its parameters. as instruction-set simulators, I/O emulation, discrete-event
The simulation loop resides in simulate() function, which management and modelling of common micro-architectural
generates the different events related to peripheral components (e.g. branch predictors, instruction queues and
operations. Finally, mark() auxiliary function is called by caches).
simulate() in order to register the initial and the final instants
of each memory block and each interrupt in the data structure However, we have preferred to develop a fully
that supports the traces. The writing of the memory block custom-made tool using a simulation language as a first step
and the interrupt traces is performed by write_file() and the in our research work (i.e. a prototype) and a more direct way
to obtain results. In the future, we want to develop a new • 4-issue superscalar processor with sufficient
version of our simulator using SimpleScalar. DESP-C++ resources: twelve FP addition units (12 ADDF), twenty four
complies with an important number of requirements [23] that FP product units (24 MULF), eight load/store units (8 MEM)
are relevant in order to choose a simulator or a simulation and four integer units (4 EX).
language [21][22] and recommend its use, as flexibility and • 4-issue superscalar processor with limited
simplicity, validity of the simulation results, efficiency, resources: six FP addition units (6 ADDF), twelve FP
compactness, portability and extensibility, object-oriented product units (12 MULF), four load/store units (4 MEM) and
approach and use of the Standard Template Library (STL) two integer units (2 EX).
[28], statistical facilities, random number generator and • 2-issue superscalar processor with sufficient
independent replication generator. resources: six FP addition units (6 ADDF), twelve FP
product units (12 MULF), four load/store units (4 MEM) and
The model of the dynamically scheduled processor two integer units (2 EX).
(figure 3) adds to the model of the statically scheduled one • 2-issue superscalar processor with limited
[15] those aspects (figure 4) related to the reservation resources: three FP addition units (3 ADDF), six FP product
stations and the common data bus (CDB) [1][29][30]. In units (6 MULF), two load/store units (2 MEM) and one
particular, when issuing an instruction to a functional unit (IS integer unit (1 EX).
stage), there must be at least a free reservation station for
such functional unit. Also, when a functional unit provides As well as redesigning the processor and system
the result of an operation, it must be propagated through the memory simulator using DESP-C++, its functionality has
CDB in order to be written into the corresponding processor been extended with the incorporation of the dynamic
destination register and to be read by those reservation scheduling feature. So, the number of reservation stations
stations whose instructions need it for their execution. [1][29][30] must be supplied as an additional parameter
Finally, this new version of the processor and system when simulating the operation of a dynamically scheduled
memory simulator also comprises the statically scheduled processor, and it will be the same for all the processor
processor model (figure 1) for allowing the execution of the functional units. The concrete values selected for the
corresponding simulation experiments (table 1). simulation experiments are 1, 2, 4, 8 and 16 reservation
stations [36][37].
In order to choose the different input parameters
which will configure the processor model for the 4. SCENARIO DESIGN
simulations, we have selected a set of typical values after In this section, we will treat those questions related
reviewing the features of various well-known processors to the I/O scenarios where our processor model will
[31][32][33][34][35]. These parameters are: ‘execute’ the sample program instructions (of course, we
mean a simulated execution). The I/O scenario generator
• Frequency of operation: 1 GHz. [31][33]. software has the same structure (figure 2) as its predecessor
• System memory access time: 2.5 ns for a DDR400 [15], and it works in a similar way. However, it has been
SDRAM at 200 MHz. [1][11]. rewritten using the simulation engine DESP-C++ and an
• Hit rate for the instruction and the data caches: 0.95 object-oriented design methodology.
for the instruction cache and 0.9 for the data cache [1][13].
• Line size for the instruction and the data caches: 32 The I/O scenarios in which the processor and
bytes (capacity for eight 32-bit instructions) for the system memory simulator will ‘execute’ the sample program
instruction cache and also 32 bytes for the data cache are updated versions of the same used for the experiments
[34][35]. performed in [15]: video capture, video capture and
• Number of consecutive misses for the data cache playback, high speed network and serial transmission.
without blocking [1]: four outstanding misses is an
acceptable value [33]. • Video capture: A digital video camera sends frames
• Code prefetch queue size: 32 bytes (capacity for in high definition format (HDTV) [38] to a computer through
eight 32-bit instructions) [35]. a channel with an adequate bandwidth (IEEE bus 1394 [39]).
• Latency of floating point (FP) units [1][33][34]. As frame data arrives to main memory, they are stored into
Addition (ADDF): 3 clocks. Product (MULF): 6 clocks. the system hard disk. General scenario attributes are:
Division (DIVF): 21 clocks.
• Latency of load/store units: One clock for - Frame resolution is 1920 x 1080 pixels and it has a
calculating the access effective address and another clock for 24 bits colour.
accessing the data cache [1][34]. - Vertical frequency is 30 frames/sec.
- Hard disk transfer rate (sustained bandwidth) is 115
Finally, we have chosen four types of processors for Mbytes/sec. [1][40].
the performance of the experiments. Thus, for each I/O - The camera adapter and the hard disk controller are
scenario four results will be obtained, one for each processor both connected to the system through a PCI Express 3.0 x1
type. bus [9].
• Video capture and playback: It is similar to
previous scenario with the only difference that there is an N
1
additional peripheral. When a data block arrives to main (2) S 2 = ∑(X − X )i
2
Serial 1 Rs 4 Rs 16 Rs
4-issue 107.7 84.4 78.6
Captions to Illustrations RESERVATION FUNCTIONAL
Figure 1. Statically scheduled processor structure. STATIONS UNIT 1
FUNCTIONAL
UNIT 1
Influence of Input/Output Operations on the Performance
FUNCTIONAL
UNIT 2 of Dynamically Scheduled Processors
DECODING AND
N
FUNCTIONAL
José María Rodríguez Corral, Carlos Rioja del Río,
ISSUE UNIT 3 Antón Civit Balcells, Arturo Morgado Estévez, Fernando
INSTRUCTION QUEUE FUNCTIONAL Pérez Peña
REGISTER M
UNIT K-1 Figure 3
FILE
FUNCTIONAL
UNIT K
MAIN
RESULT BUS
Figure 1
TACCMEM
MARK