0% found this document useful (0 votes)
468 views136 pages

Advanced Microprocessors (SEM - VII) Comp PDF

Uploaded by

Cristian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
468 views136 pages

Advanced Microprocessors (SEM - VII) Comp PDF

Uploaded by

Cristian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 136
For Private Circulation Only Advanced Microprocessors BE-Semester-VII-Computer PART-I @® Chapter-1 Overview Chapter-2 Advanced Intel Microprocessors Chapter-3 Study of Pentium Family Notes Prepared By Prof. Faruk Kazi Printed notes are revised as per the changed question | paper pattern and are not suitable for independent | reading. Combine them with class notes for better | conceptual understanding and exam oriented preparation. | | NOTE: Downloaded from FaaDodEngineer's.com Downloaded from FaaDoEngineer's.com Are BLE, COMPUTER ENGINEERING URTH YEAR SEMESTER Vil ‘SUBJECT: ADVANCED MICROPROCESSORS Lectures: 4 Hrs per week Theory: 100 Marks Practical: 2 Hrs por week Term work: 25 Marks Oral Exam.: 25 Marks ‘Objective: To study microprocessor basics and the fundamental principles of architecture related to advanced microprocessors Pre- requisite: Microprocessors DETAILED SYLLABUS 1. Overview of new generation of modern microprocessors 2, Advanced intel Microprocessors Protected Mode operation of x 86 Intel family, study of Pentium, super scalar architecture and. pipelining, register set & special instructions, memory management, cache organization, bus ‘operation, branch prediction logic 3. Study of Pentium Family of Processors Pentium |, Pentium I, Pentium Ill, Pentium IV, architectural features, comparative study. 4, Advanced RISC Microprocessors Overview of RISC Development and current systems , Alpha AXP architecture , Alpha AXP Implementation and applications 5. Study of Sun SPARC Family ‘SPARC Architecture, the Super SPARC, SPARC implementation and application 6. Standard for Bus Architecture and Ports EISA, VESA, PCI, SCSI, PCMCIA Cards and slots, ATA, ATAPI, LPT, USB, AGP, RAID 7. System Architecture for desktop and server based systems ‘Study of memory subsystems and I/O subsystems, integration Issues. BOOKS: “Text Books: : 1. Daniel Tabak, “Advanced Microprocessors”, Tata McGraw Hill 2. Barry Brey , “The Intel Microprocessors, Architecture, Programming and Interfacing’, 3. Tom Shanley, "Pentium Processor System Architecture", Addison Wesley Press References: 1. Ray Bhurchandi, “Advanced Microprocessors and peripherals", TMH 2. James Abtonakos, "The Pentium Microprocessor’, Pearson Education 3. Badri Ram, “Advanced Microprocessors and Interfacing", TMH 4. Intel Manuals TERM WORK 7. Term work shall consist of at least 10 practical expermiments and two assignments covering the topics of the syllabus ‘ORAL EXAMINATION. Tobe conducted based on the above syllabus ‘An oral examinatior Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6.13 ‘The major status signals are: . ‘SLCT When printer is selected, this signal goes high. ‘ACKNLG When itis low, it indicates that the data character has been accepted and the printer is ready for next character. BUSY Due to some reason, if printer is not ready to receive a character, this reason is—the printer being out of paper- PE This signal goes high, ifthe out-f-paper switch in the printer i activated ERROR This signal goes low for a number of problems inthe printer. jgnal goes high. Example of such a RAID Redundant Array of Independent Drives (or Disks), also knovm as Redundant Array of Inexpensive Drives (or Disks), ‘The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives appears to the computer as a single logical storage unit or drive. That is, RAID combines multiple hard disks into a single logical unit, There are two ways this can be done: in hardware and in software. Hardware comnbines the drives into a logical unit in dedicated hardware which then presents the drives as a single drive to the operating system. Software does this within the operating system and presents the drives as a single drive to the users of the system. A. uick summary of the most commonly used RAID leveis is given below. (You may also look into your COA notes !!) + RAID 0: Striped Set (2 disks minimum) without parity, Provides improved performance and additional storage but no fault tolerance from disk errors or disk failure. Any disk failure destroys the array, which becomes ‘more likely with more disks in the array. The reason a single disk failure destroys the entire array is because ‘when data is writen to a RAID 0 drive, the data is broken into "fragments". The number of fragments is dictated by the number of disks in the drive, Each of these fragments are writen to their respective disks simultaneously on the same sector. This allows the entire chunk of data fo be read off the drive in parallel, siving this type of arrangement huge bandwidth. When one sector on one of the disks fails though, the corresponding sector on every other disk is rendeted useless because part of the data is now corrupted. RAID 0 does not implement error checking so any error is unrecoverable, More disks in the drive means higher bandwidth, but greater risk of data loss, + RAID 1: Mirrored Set (2 disks minimum) without parity. Provides fault tolerance from disk errors and single disk failure. Increased read performance occurs when using a multi-threaded operating system that supports split secks, very small performance reduction when writing. Array continues to operate so long as atleast one drive is functioning = RAID 3 and RAID 4: Striped Set (3 disk minimum) with Dedicated Party, the parity bits represent a memory location each, they have a value of 0 or 1, whether the given memory location is empty or full, thus enhancing the speed of read and write. Provides improved performance and fault tolerance similar to RAID 5, but with # dicated parity disk rather than rotated parity stripes. The single disk isa bottle-neck for writing since every ‘write requires updating the parity dats, One minor benefit isthe dedicated parity disk allows the parity dive to fail and operation will continue without parity or performance penalty. + RAID 5: Striped Set (3 disk minimum) with Distributed party, Distributed parity requires all but one drive to be present to operate; drive failure requires eplacernent, but the array is not destroyed by a single drive failure Upon drive failure, any subsequent reads can be calculated from the distributed parity such that the drive failure is masked from the end user. The array will have data loss in the event of second drive failure and is ‘vulnerable until the data that was on the failed drive is rebuilt onto a replacement drive * RAID 6: Striped Set (4 disk minimum) with Dual Distributed parity. Provides fault tolerance from two drive failures; array continues to operate with up to two failed drives. This makes larger RAID groups more practical. This is becoming a popular choice for SATA drives as they approach I Terabyte in size. This is because the single parity RAID levels are vulnerable to data loss until the failed drive is rebuilt. The larger the drive, the longer the rebuild will take, With dual parity, it gives the array time to rebuild onto a large drive ‘with the ability to sustain another drive failure Subjects For SEM VII ROBOTICS & SYSTEM SECURITY [email protected] | Downloaded from FaaDo0Engineers.com BE-Sem-VII-C -Advanc ficroprocessor Notes By Prof, Faruk Kazi Chapter 1- Overview Notes by Prof. Faruk Kazi WW 1223893 DECO07/10M: Compare x86 processors- 8086 to Pentium (Please refer class notes of first lecture for complete answer) ectare | Processing: [instru in | eitle: eons | clock speed OF ey ee eet eed CISC with separate Fetch and 14 3 Miz with 0.33 8086 data and | bus interface and execution toad 20 bits | execution units. Six | cycles overlap. Ay ‘Number of address. | byte instruction ‘MIPS Transistors Addressab } queue. 29,0003 um | te memory 10 Miz with 1 MB. 0.75 MIPS | 80186 Refer class notes 16 bits | Added protected- 80286 data and | mode features to 24 bit | 8086 with essentially 20dtt, 10 mite Numberot | adress | the same instruction vwth 15 MIPS Transistors bus. set. Included memory | 134,000 at .5-| Addressab protection hardware 12.5 MHz with em Je memory | to support ae 16 MB| multitasking physical, | operating systems with per-process = address space | Complete | Real and protected | 3-stage 12 16 Miz with 5 80386 DX | 32bit mode with 32 bit Pipeline as | 2) microproc | memory fetch, decode eet Numberot | essor. | management. and execute tare Transistors Reduced | 275,000 at | wm | (Also read instruction 25 MHz with 8.5. ate yee tines, Mies vations 33 Miz with venions) 1a MIPS a i ee a EM SECURITY: faru Hitbacin | Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 6.12 Differential SCSI is a great idea in theory, and one might have thought it would become very popular. In fac, this never happened in the PC world, largely due to cost. The circuits needed to drive differential signals are more expensive and use more power than those for single-ended SCSI. For many years, single-ended SCSI was "good enough’, and allowed cable lengths sufficient forthe needs of most users. PCMCIA CARDS AND SLOTS + POMCIA stands for Personal Computer Memory Card Intemational Assocition, an international standards body and trade association based in San Jose, California, Founded in 1989. + Itisa standard for peripherals whose size is that ofa credit ard + PCMCIA cards are also called PC cards +The card size is approximately 2 inches wide and 3. inches long. * The thickness varies from 1/8 inch to 1.5 inch depending ofits ype * Originally, the standard was developed for removable memory cards for portable computers, Now-a-days the standards are available for a variety of peripheral devices such as fax, modem, SCSI adapter, Ethemet adapter, disk drives ete * The standard specifies the physical design of cards, the physical design of connector, and the electrical interface to cards ete. PCMCTA cards are becoming standard features on portable and desktop computers, © The PCMCIA slot supports hot insertion, which allows devices to be plugged and unplugged without switching off the power supply to the computer. ‘The types of PCMCIA cards are: Type I to Type IV. The Type I, is 3.3mm thick, it is provided with 34-pin connector and supported actual memory card (for example ATA Type I Flash Memory Cards) like DRAM or flash memory. ‘= The Type If card is Smm thick and has 68-pin connector. It can interface fax, cellular modem, LAN adapter, wireless LAN adapter, SCSI adapter etc. ‘The Type Il card is 10,5mm thick and it has 68-pin connector, It is used for hard disk drive up to GB. ‘The Type IV card is 16mm thick and it was developed by Toshiba for removable hard diss. LPT- Line PrinTer- Parallel port A parallel port can be used to interface printers, CD-ROM drive, extemal hard disk drive, ZIP drive or other mass storage devices. This technique makes the devices slower than if they were directly connected tothe PC's UO bus via a plug-in-card. But it is convenient to hook up extemal peripheral devices to the parallel port of a PC. IEBE 1284 is a ‘Standard for peripheral devices to be connected toa parallel port. It allows five mode of operation, Centronics Standard Centronics is the name of « company. It developed a standard for interfacing printer to a parallel Port. It used 36-pin interface, as 36-pin connector becomes bulky. IBM preferred to use a 25-pin connector for printer interface to a parallel port and given name as LPT port ‘The major control signals in LPT standard are: INIT. te direts the printer to perform its internal initialization sequence STROBE Ittells the printer there isa character for you. SLCTIN When this signal is low, data can be entered into the printer. AUTOFEED XT When this signal is low, the paper is automatically fed one line after printing. Subjects For SEM VIII ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com viel pp into Sof uie 12 ssor Notes 1 Kari 9820223893 32 bits | Onboard SKB cache | 5-stage pipeline [1:1 5 Me wi 80486 address as | and FPU. pale 33 Mz with 27 Namberof | data bus Mes ‘Transistors 12 nillion at 1 pr; 50 Mia with 41 the $0 Miz was Mies 208 um Superscalar, spit | Dual pipelines [2:1 OME with 100 Pentium-I | 64bits | cache. Redesigned —_| allow two MIPS FPU with 8 stage | instructions to amet Namber of pipeline. Branch _| be executed Mrs seansisoes 3.1 prediction logic. simultaneously. nillion Three instruction [Instructions [3:1 300-500 MF, P6 64 bits | decoders with 12 _| broken into | stage pipeline, Out of | micro-ops that | Tae order execution with | move through | on-board level one | the pipeline in a a | (Refer 3 | and level two cache. | fetch/decode; chapter dispatohiexecut notes) es retire sequence. Table 1.1: Comparison of Intel family Note: Clock frequency/ MIPS are dependent on processor versions. Pentium-I is having three basic versions as PS, P54 and P54C of which PS is considered here for comparison. For memory hierarchy and its calculations, please refer class notes of first lecture. Common features of all advanced microprocessors- * 32 or 64 bit microprocessor Wider data bus- double the ALU size On chip FPU On chip MMU Dedicated L1 code cache Dedicated LI data cache Branch Prediction Logic (BPL) Superscalar Multiple stage pipeline Downloaded from FaaDo0Engineer's.com sata ag | BE-Sem-VII-COMP-Adyanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6.11 REQ(REQUEST) A signal driven by a target to request a REQIACK data transfer handshake, ACK ‘A signal driven by an initiator to acknowledge a REQUACK data transfer, (ACKNOWLEDGE) ATN(ATTENTION) A signal driven by an initiato? to indicate the Attention condition (initiator has a message for the target). RST (RESET) Reset condition. DB_(7-0P) (DATA Eight data-bi(DB) signals, plus a parity-bit signal tat form a Data Bus. DBC?) isthe most BUS) significant bit Bit umber, significance, and priority decreases downward to DB (0). A data bitis defined as one when the signal value is true and defined as zero when the signal valve is false, Data parity DB(P) shall be odd, bu parity is undefined during the Arbitration phase Single-Ended (SE) and Differential (High Voltage Differential, HVD) SCSI Conventional SCSI signaling is very similar to tht uxéd for most other interfaces and buses within the PC. Convetional logic is wed: apostive volage is &“one’, and a zero volge (ground) ia "zero". This is called single- ended signaling, abbreviated SE. Up unl event, single-ended SCSI had been by far the most poplar signaling pe, fora simple reason: is relatively simple and inexpensive implement ‘There's an important problem with SE signaling, however. SCSI is a high-speed bus capable of supporting multiple devices, including devices connected both inside and outside the PC. As with all high-speed parallel buses, there is always a concern about signal integrity on the bus, problems can arise due to bouncing signals, interference, and degradation over distance and cross-talk from adjacent signals. The faster the bus runs, the more these problems manifest themselves; the longer the cable, the more the problems exist for any given interface speed. As a result, the length of a single-ended SCSI cable is rather limited, and the faster the bus runs, the shorter the maximums allowable cable length. To get around this problem, a different signaling method was also defined for SCSI, which uses two wires for each signal that are mirror images of each other, For a logical "zero", zero voltage is sent on both wires. Fora logical one", the first wire of each signal pair contains a positive voltage, similar tothe signal on an SE bus, but not necessarily at the same voltage. The second wire contains the electrical opposite ofthe frst wire. The circuitry at the receiving device takes the difference between the two signals sent, and thus sees a relatively high voltage for a one, and a zero voltage for a zero. It is called differential signaling, after the technique used to determine the value of each signal by the recipient. The (wo signals in each pair are usually named with "*" and "-" signs; for example, the signal carrying data bit would use "+DB(0)" and "-DB(0)" Table below shows the great difference in cable length that exists between SE and differential devices, particularly as bbus speed increases: Signaling Bus Single-Ended—_Differential SCSI Speed Speed SCSI Maximum = Maximum Cable (MHz) Cable Length (m) Length (m) Slow 5 6 25 Fast 10 3 25 Fas 20 1s 25 AAs you can see, each doubling of the bus speed results in « halving of the maximum cable length for single-ended SCSI, but differential SCSI allows long (25m) cables forall three speeds. Subjects For SEM VIII | ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 13 r Notes By Prof, Faruk. ce ‘Table 1.2: Generations of Intel processors een see ee PI 8086 16-bit registers and data bus, eal mode only 8088 ‘Same as 8086 with 8-bit extemal data bus P2 80286 ‘Added protected mode PS 80386DK Introduced IA-32, 32-bit registers and buses, added virtual 8086 mode B0386SX Same as 80386Dx with 16-bit external data bus Pa 80486DX ‘Same as 80386Dx with integrated FPU and Li cache B0486SX ‘Same a5 80486Dx without coprocessor B0486DX2 and ‘Same as 80486Dx with faster (2x or 4x) internal 80486DX4 clock PS Pentium Classic ‘Super-sealar architecture, Dual instruction pipelines, 64 bit external data bus. Branch prediction Pentiom MMX ‘Same as Classic with support for MMX operations. Pe Pentium pro Dytiamic execution, L2 cache in same package, no MMX Pentium I ‘Same as Pro new cartridge package, MMX support Celeron ‘Same as Pentium I but no integrated L2 Pentium IIT Same as Pentium If with SSE support Pentium 4 Microburst architecture P7 Ttanium TA-64, 64-bit registers, 128 bit instruction bundles Released May 29, 2001 | with explicit parallelism, 128 bit data bus, 64 bit address bus. It is having 16 KB of Level 1 instruction cache and 16 KB of Level 1 data cache. ‘The L2 cache was unified (both instruction and data) and is 256 KB, The Level 3 cache was also unified and varied in size from 1.5 MB to 24 MB. ee sul Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6.10 SCSI transection There is also # SCSI Command called SCSI Reset that can lake immediately to this phase. SCST . Reset is used to force the SCSI bus ogo to the BUS FREE phase ; ‘To other SCSI devices Figure 6.7 Typical SCSI Configuration recip SCSI Bus Signals: Signal Description BSY(BUSY) Signal which indicates that the bus is Being used SBL(SELECT) A sign used by an initiator to select a target, or by a target to reselect an initiator. cD ‘A signal driven by a target to indicate whether or not control or data information is on the data (CONTROLIDATA) bus. True indicetes control, wo ‘A signal driven by 2 target to control the direction of data movement on the data bus. True (NPUTIOUTPUT) indicates input to the inititor. This signal is also used to distinguish between selection and reselection phases a As you scsi MSG (MESSAGE) A signal driven by target during the Message phase a Subjects For SEM Vill | ROBOTICS & SYSTEM SECURITY [email protected] Sub; Downloaded from FaaDo0Engineers.com VI Advanced Mi sor Ni Prof. Fa i 3 Clock Generator 2 Independent DMA Channels 3 Programmable 16-Bit Timers Dynamic RAM Reftesh Control Unit Programmable Memery ancl Peripheral Chip Select Logic Programmable Wait State Generator Local Bus Controller System-Level Testing Support Direct Addressing Capability to 1 Mbyte Memory and 64 Kbyte 1/0 ‘Supports Intel 80187 Numeric Coprocessor Interface 14 Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6, 9 SCSI Bus Phases ‘The SCSI bus can be time-shared, which results in greater usage of bus bandwidth. This is how it works: while one device is using the bus, other devices may be active and performing intemal activities. Devices do not use the bus unless they are involved in data transfer or have status to report. Devices may disconnect from the bus while time consuming activities intemal to the device ae occurring, As soon as a device is ready to resume communication, the device can arbitrate for the bus (when the bus is free) to reattach to the host System performance is significently increased when devices disconnect and reconnect to the bus. During the bus phases (refer figure), devices must frst ‘contend for access to the bus. Then «physical path is established between the initiator and target. Remember, the SCSI ‘bus cannot be in more than one phase at atime. Bus Free Phase: The Bus Free Phase is used to indicate that no SCSI device is actively using the SCSI bus and thet it ‘available for subsequent users. SCSI devices shall detect the Bus Free Phase after SEL and BSY are both flse Arbitration Phase: The Arbitration Phase allows one SCSI device to gain contro! of the SCSI bus so that it ean assume the role of an initiator or target. If no higher priority SCSI ID bit is true on the Data Bus, then the SCSI device has won the facbitration and it asserts SEL. Any other SCSI device that is participating in the Arbitration Phase has lost the arbitration. The SCSI device that won the arbitration has both BSY and SEL asserted Selection Phase: The Selection Phase allows an initiator to select a target forthe purpose of initsting some target function for example the Read or Write command. The initiator sets the Data Bus to a value whichis the OR of its SCSI 1D bit and the targe's SCSI ID bit for selection of target. Reselection Phase: Reselection is an optional phase that allows e target to reconnect to an initiator for the purpose of continuing some operation that was previously stated by the initiator but was suspended by the target. For example, a Thost system may have requested a Read from a disk. The disk can Disconnect and Reconnect ifthe Read involves a time consuming seek operation to be performed, ‘This is one ofthe optimization features of SCSI. Information Trassfer Phases: The Command, Data, Status, and Message Phases fare all grouped together as the Information Tansfer Phases becguse they are all used fo transfer dala or contro! information via the Data Bus. The C/D, UO, and MSG signals are used to distinguish between the different Information Tansfer Phases. The target drives these three signals and therefore controls all changes from one phase to another. [Command Phase: The Command Phase allows the target to request command information from the initiator. The target shall assert the C/D signal and negate the 1/ and MSG during the REQ/ACK handshake) ofthis phase. Data Phase: The Data Phase is « term that encompasses both the Data In Phase 1d the Data Out Phase. The Data In Phase allows the target to request that data ‘be sent to the initiator from the target. The Data Out Phase allows the target to request that data be sent from the initiator tothe target. Status Phase: The Status Phase allows the target to request that status information be sent from the target to the initiator. Message Phase: The Message Phase is term that references either a Message In, for a Message Out Phase, Multiple messages may be sent during either phase. The first byte transferred in ether of these phases is either a single-byte message oF the first byte of « multiple-byte message, Multiple-byte messages are wholly contained within a single message phase. The Message In Phase allows the target io request that message(s) be sent tothe initiator from the target ‘Bus Free Phase: Once the bus has gone to the BUS FREE phase, the target is no longer in control of the bus. At this point device is fee to proceed with another in Subjects For SEM VIII. ROBOTICS & SYSTEM SECURITY farukkazi@iitb. Downloaded from FaaDo0Engineers.com BE-Sem. A MAY06/10Marks: State versions of 80386. Draw its block diagram and explain, Figure 1.2a: Functional Block Diagram of Intel 80386 Processor Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 6. 8 SCSI Standard: (MAY0S/4M, NOVOS/1OM, MAY06/10M, NOV06/20M, MAYO7/10M) ‘The development of the Small Computer Systems Interface (SCSI) was a major stop forward in hardware interfaces for “small computers" (as opposed to mainframes and minicomputers). Interfaces prior to SCSI were not intelligent and ‘were designed for specific devices, Thus there was ¢ hard disk interface fora hard drive, a tape drive interface for a tape drive, and so on, With SCSI, a standard interface was defined for all devices so tht only a single adapter was required. The first SCSI standard, referred to as SCSI-1, supported upto seven devices per adapter and vas approved in 1986, It had its roots in SASI (Shugart Associates Systems Interface) which was developed by Al Shugarts Shugart Associates in 1999, ‘SCSI is an intelligent adapter and not « controller. The controller is built on the drive itself. It has a separate /O bus called SCSI bus. As its data transfer rate is high, it is connected to PCI bus. Its latest version called SCSI-3 can interface up to 15 devices. Each device connected to SCSI bus is assigned an identification number. The highest identification (ID) number is used by the host adapter. The device designed to be connected to an SCSI bus is called a SCSI device, A wide variety of devices such as hard disk drive, optical disk drive, ZIP drive, tape drive, printer, Scanner, graphics tablet etc. can be connected to an SCSI bus. The devices are connected in a daisy-chain fashion, as show in Fig. 6.5. A flat cable which contains 50 wires, runs from SCSI host adapter card to SCSI devices. To reduce reflections the cable is terminated atthe end. The terminator terminates lines to the ground through resistors. SCSI devices can be moved with data from a host adapter of one computer to another, except hard disk drive. A hard disk is treated as « new disk on the other computer. This problem does not arise in case of optical disks and removable bard disks In these cases the driver software, which supports their use, has been provided with features needed for such ‘movement, SCSI standard allows multiple, independent conversation between SCSI devices to go on simultaneously across a single SC§I bus. This feature allows to have high degree of multiprocessing in a PC with SCSI bus. For this purpose suitable software is needed. ‘On SCST host adapter card, there are connectors for external SCSI devices as wel as intemal SCSI devices. SCSI being, costly is used on servers. Desktop computers use EIDE connectors, scsi ‘scsi SCSI saa Devicet,} | Device? | | Flat cables Fig. 65 SCSI Interface SCSI Layered Architecture: ‘The peripheral interface is made up of many layers. The peripheral interface model with four layers for the SCSI is agreed by the American National Standard Institute (ANSI). Lowest layer is the Physical Interface Layer. It describes the cable and connector types. It also defines signal voltages and current requirements of the drivers used in the interface. The timing specifications and the coordination of all the signals atthe’ interface bus are described in this layer. Above the physical layer resides the Protocol Layer. The protocol is nothing but the set of rules. The protocol layer gives rules for the exchange of messages between devices connected through an interface. It also describes the use of error correction if the data is corrupted. It defines data byte and separates it from an instruction. The Device Mfodel Layer lcs on top of the protocol layer. This layer describes the behavior ofthe device to be connected tothe interface. For example, any printer interface may define a printer- page printer or line printer- depending on an interface. These descriptions can be detailed and precise. Command Set Layer represents the fourth layer of the interface model, The command set builds upon the device model. It defines the commands that must be understood by the interface devices. Subjects For SEM VIII__ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 16 10223893, P-Advanced rocessor Notes By Prof, Raruk Kazi FEATURES OF 80386 (Refer class notes) + 32 bit Microprocessor having 8 General purpose 32-bit registers supports 8, 16, 32 Bit data types. Very large address space. 4 GB physical & upto 64'TB virtual Memory support. Variable segment size form one byte to 4 Gigabyte, 4 levels of protection PLO - PL3. Paging on Demand. Optimized for multitasking operations. Virtual 8086 mode for running 8066 software in a protected and paged system. Pipelined instruction execution. TLB for address translation Cache High speed Numeric support via 80287 and 80387 coprocessor. ‘The 80386 Functional Units: (Diagram to be drawn in exam) tonsa ven ws sant we [sexu] | race ca} | "inen | p+bit barre! 7 pus F/O Mets Bae UNIT bq contol Bus > disp: Tnatosion Prefeth Decoder rare A Tastraction Prefecher ‘queue _——— © Desade Prefetch oe a Figure 1.2b: Functional Block uiagram of Intel 80386 Processor ‘The 80386 Microprocessor consists of five functional unit. > BUS UNIT : Handles communication with devices extemal to the microprocessor chip. > CODE PREFETCH UNIT : Fetches instructions from memory before the microprocessor actually requests them, It is having a 16-byte prefetch queue, code prefetch requests are given a lower priority by the bus unit than the requests from Execution Unit. > INSTRUCTION DECODE UNIT : Decodes the instruction prior to passing it to the executing unit for execution. It is having 3 instruction deep decode queue for use by the execution unit. > EXECUTION UNIT : Consists of 8 General purpose 32 bytes Registers, 64 bit barrel shifter and ALU. The execution unit executes each instruction received from the decoded instruction queue. Seas nudes ee VILL ROWOTIGS & S1STEN Sct Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 7 * PIO Modes: ATA includes support for PIO modes 0, and 2 * DMA Modes: ATTA includes support for single word DMA modes 0, 1 and 2, and multivord DMA mode 0 ‘+ The flat ribbon cable has 40 wire connectors init, and usually has three identical female connectors: one is intended for the IDE controller (or motherboard header for PCs with built in PCI ATA controllers) and the other two are for the master and slave devices on the interface Flat ribbon cables have no insulation or protection from electromagnetic interference Was originally designed for very slow hard disks that transferred less than S MB/s, not the high-speed devices of today * The main issue is the length of the cable, The longer the cable, the more the chance of data corruption due to interference on the cable and uneven signal propagation, and therefore, itis often recommended that the cable bbe kept as short as possible. According to the ATA standards, the official maximum length is 18 inches "Plain* ATA does not include support for enhancements such as ATAPI support for non-hard-disk IDE/ATA devices, block mode transfers, logical block addressing, Ultra DMA modes or other advanced features. Drives developed to rect this standard are no longer made, asthe standard is old and obsolete. In fac, atthe recommendation of the T13 Technical Committee, ATA] was withdrawn as an official ANSI standard in 1999. This is presumably du to its age, and the large number of replacement ATA standards already published by that time. SATA/PATA: SATA is « High Speed Serialized AT Attachment Serial version of the IDE [ATA] specification. It uses 24 conductor cable with two differential pairs [TW/Rx], plus an additional three grounds pins and a separate power pin. ata runs at 150MBps (1.5GHz) and 250moV signal swings. Serial ATA is not compatible withthe IDE [Parallel ATA- PATA] because the connectors are different, the vollage levels are different, and dats format is different. SATA sends a Dit at atime while PATA sends 16 bits at once. SATA will not interface with the IDE bus. No cable can be made to connect SATA with IDE, However a converter may be purchased which translates SATA to PATA ATA Packet Interface (ATAPI): Originally, the IDE/ATA interface was designed to work only with hard disks. CD-ROMs and tape drives used either proprietary interfaces (often implemented on sound card), the floppy disk interface (which is slow and cumbersome) or SCSI. In the early 1990s it became apparent that there would be enormous advantages to using the standard [DE/ATA interface to support devices other than hard disks, due to its high performance, relative simplicity, and universality. The intention was not to replace SCSI of course, but rather to get rid of the proprietary interfaces (Which nobody really likes) andthe slow floppy interface for tape drives. Unfortunately, because of how the ATA command structure works, it wasnit possible to simply put non-hard-disk devices on the IDE channel and expect them to work. Therefore, a special protocal was developed called the AT Attachment Packet Interface ot ATAPI. The ATAPI standard is used for devices like opticsl, tape and removable storage drives. It enables them to plug into the standard IDE cable used by IDE/ATA herd disks, and be configured as master of slave, etc. just like a hard disk would be. When you see a CD-ROM or other non-hard-disk peripheral advertised as being an "IDE device” or working wit IDE, itis really using the ATAPI protocol Internally, however, the ATAPI protocol isnot identical to the standard ATA (ATA-2, etc.) command set used by hard disks at all. The name "packet interface" comes from the fact that commands to ATAPI devices are sent in groups called packets. ATAPI in general is a much more complex interface than regular ATA, and in some ways resembles ‘SCSI more than IDE in terms ofits command set and operation. (AL the time it was created, SCSI was the interface of choice for many CD-ROM and higher-end tape drives.) ‘A special ATAPI driver is used to communicate with ATAPI devices. This driver must be loaded into memory before the device can be accessed (most newer operating systems support ATAPI internally and in essence, load their own drivers for the interface). The actual transfers over the channel use regular PIO or DMA modes, just like hard disks, although support for the various modes differs much more widely by device than it does for hard disks. For the most part, ATAPI devices will coexist with IDE/ATA devices and from the user's perspective, they behave as if they are regular IDE/ATA hard disks on the channel Subjects For SEM VIII ROBOTICS & SYSTEM SECURITY farukkazi@iitb. Downloaded from FaaDodEngineer's.com 17 BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Fs i -9820223893 > MEMORY MANAGEMENT UNIT : When the microprocessor must address a memory location, the MMU forms the physical memory address that is driven out onto the address bus by the bus unit during a bus cycle. MAY08/SM: DIFFERENCE BETWEEN 80386 SX & 80386 DX Address Bus: Unlike the address bus consisting of A31-A2 and 4-byte enable lines, the SX address Bus consist of A23-A0 and BHE # (i. identical to 80286). Data Bus : Like the 80286 Microprocessor, the SX only has two data paths, verses four for the DX Microprocessor. 80386SX Microprocessor has substantially slower throughput than an 80386 DX. 0386 80386 SX Single 80386 DX Double eXecution Xecution speed 16 speed 32 bit data bit data Operated in Real Mode Protected Mode (fast 8086) ‘Virsa 8086 Intel's 80386 DX can operate in real mode or protected Mode or a variation of protected mode called virtual 8086 mode, When the processor is reset or powered up, it is initialized in Real mode. The real ‘mode has the same base architecture as the 8086, but allows the access to the 32-bit register set of 80386 DX. Basically it functions as a fast 8086, This mode is usually used to: * Initialize the peripherals device. ‘Load the main part of the operating system from disk into memory. Load some registers. Enable the interrupts. Enter into the protected mode, NOTE (VIVA): There are a few more versions of 80386 like EX and SL. The 80386 EX microprocessor is designed for embedded applications that require high integration and low power. Key features include power management, low-voltage operation, and on-chip integration of numerous common peripherals such as interrupt controllers, chip selects, counters and timers. ‘The 80386SL is basically an_ 855,000 transistor version of the 386SX processor, with cache, bus, and memory controllers, ISA compatibility and power management circuitry. It added a special system management mode (SMM), in which the BIOS could more easily perform power management and other functions without requiring OS support. The 386SL was the first chip specifically made for portable computers ( 5 FSECURITY farukkasiqiitacin ee ea Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6, 6 Difference between USB and serial port: An ordinary serial port provided on the back of a PC, can connect only one setial device. From practical consideration two serial and two parallel ports ean be provided on the back of a PC. ‘Therefore, up to two serial devices ean be connected to the serial ports, and up to two devices designed for parallel data ‘transfer, can be connected to the parallel ports. There is no su limitation when USB is used. A dozen of USB devices, which has a wide range of input/output devices, can be easily connected to USB bus. IDE, EIDE, ATA, ATAPI IDE stands for Integrated Drive (or Device) Electronics. It is a standard according to which IDE interface is done BIDE is an Enhanced IDE. IDE was developed to interface hard disk drives. EIDE can interface hard disk, floppy disk rive, optical disk drive and tape drive. The motherboard of a new PC has two EIDE connectors. It is an adapter and ‘not a controller. The controller is on the drive itself. Commercial motherboards of PCs provide only two EIDE. connectors. From each connector a flat eable runs. A flat cable provides one channel. From each channel two EIDE devices can be connected. The cable has two more connectors atthe other end, each of Which can connect an EIDE (or IDB) device. From two EIDE connectors, two flat cables run and up to four EIDE devices can be connected. But EIDE, is capable of providing up to 4 channels. Two additional channels, if required, can be provided by adding plug-in eards on the ISA bus. Figure 6.4 shows EIDE interface, . Magnetic isk ae ‘hives . Fig. 64 EIDE Interface IDE drives is cost Because the separate controller or host adapter is eliminated and the cable connections are simplified, IDE drives cost much less than a standard controller-and-drive combination, These drives also are more reliable, because the controller is built into the drive. Therefore, the data separator (the converter between the digital and analog signals on the drive) stays close tothe media. Because the drive has a short analogesignal path, it is less susceptible to external noise and interference. ATA is AT Attachment It is standard which specifies how to deal with hard disks over EIDE channel. ATTAPI is ATA Packet Interface, It extends ATA to deal with optical disks and other types of devices on the EIDE channel ATA (ATA-1) ‘The first format standard defining the AT Attachment interface was submitted to ANSI for approval in 1990. The original IDE/ATA. standard defines the following Features: * Two Hard Disks: The specification calls for a single channel in a PC, shared by two devices that are configured as master and slave Subjects For SEM VII ROBOTICS & SYSTEM SECURITY [email protected] | Downloaded from FaaDo0Engineers.com Downloaded from FaaDo0Engineers.com of BES Ge3>8e ‘em-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 5 320 [TA] controller} a2-bit 6b 132MB/S ‘SMBS. . PCIbus WSAbus ‘AGP slot Figure 6.2 AGP workstation UNIVERSAL SERIAL BUS (USB) It sa serial bus designed to connect several devices. It can interface a wide variety of peripherals such as monitor, keyboard, mouse, modem, speaker, microphone, seanner and printer ete Itcan handle upto 127 devices. ‘An USB cable contains four wires: two for supplying electrical power and two for transmitting data and commands, Low-power devices such as keyboard, mouse, etc. can get power from USB cable, eliminating bulky power supply. The device, which needs larger amount of power, for example, a big loud speaker, must have a local power supply. ‘The USB controller assigns each device an identification number and allows devices to communicate to one another. thas two operating modes: low-speed and medium-speed. In low-speed mode, data transfer rate i 1.SMbps. ‘At medium-speed mode, data transfer rate is 12Mps. It provides three types of data transfer schemes: isochronous (or real-time), interrupt driven and bulk data ‘transfer. In isochronous data transfer scheme, there is no interruption in the flow of data for example, video or sound, In such a case uniform amount of deta must be transferred every second, and fixed amounts of data must be transferred in chunks on regular schedule, ‘The USB provides plug-to-play facility USB devices can be connected in a daisy-chain fashion, as shown in Figure 6.3. One device is connected to the USB controller. Another deviee is plugged into the device, which has already been connected. In this way a number of devices can be connected in # daisy chain. Sometimes the resulting chs of cables may branch at some devices. The system treats all the devices similar as if they were all connected in series or directly at the PC, Flat cables Figure 6.3 USB Connections in Daisy-Cl Subjects For SEM VIII. ROBOTICS & SYSTEM SECURITY [email protected]. "| Downloaded from FaaDo0Engineers.com 19 BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 80286 Features: (Same as 8086 except protected mode for multitasking- refer class notes) © Itisa 16 bit microprocessor Its 24 bit address bus gives 16 MBytes address space Itis having 16 bit data bus Its prefetch queue is 6 bytes Protected mode operation was first introduced with 80286 It implements protection mechanism with 4-levels of privilege as PLO-PL3 On chip support for multitasking Its segment size is variable from 1 byte to 64 KBytes Descriptor structure was first implemented with 80286 It supports virtual memory of 1 GBytes (refer class notes for this calculation) GQ: Explain general structure of an advanced microprocessor (Also useful for VIVA) Prefetch Unit and Decoding _—_2k Instruction Queue Unit Dan Bus Instruction ej Bus Interface ‘Address Ba uniter | Ct) | Branch Controt Hess Bus Target | Unit Buffer cy Dia | @7B) <4 ‘Contol Bus Cache @eache) Memory Management za Unit Ma) Tateral Bus a Integer | Unit QU) Fleing fiat Uae PUD pel ; il Function Unis painseet, | [ Hosting Fein Srv) easter File | | Register File eg. MMX ake) RF) Integer Floating Point Operation Operation Units Units Figure 1.4: General structure of Adv Microprocessors cst fheSENVT RROTICS iS TEMISECUIUTY. Nala, Downloaded from FaaDodEngineer's.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. -9820223893 6. 4 PG) Devices 368i 8 MHz ISA bus 7 ISA devices Fig 6.1 Typical PCI workstati ACCELERATED GRAPHICS PORT (AGP) With the increase of processor speed, the speed of the host bus (or processor bus) is also increasing. When the speed of a processor was about 400-5OOMHz, the speed of the host bus was 100MHz. Today the speed of a processor isin the range of 700-I000MHz and that of the processor bus is 33-400MHz. The PCI was developed when the processor bus speed was 33MHz, Its present speed remains the same, ie, 33MHz. Today new bus standard is needed to cope up with the processor bus speed. A new bus, AGP bus has been developed to operate at processor bus speed, ‘Though AGP is called a port, it is actually an expansion slot It is new 32-bit bus, specially designed for video card Figure 6.2 shows AGP along with PCI and ISA bus. Its data transfer rate is 528MBIsec. or more. The video card contains @ video accelerator, which can access main memory at high speed through AGP bus and the chipset A graphics and video accelerator performs image calculation. It generates and processes pixel, receives commend from the CPU, converts graphies commands into a data stream and keeps in the local memory. A video accelerator is ‘provided with local memory. The video accelerator includes a digital to analog converter (DAC), which receives information from the local memory and controls the intensity ofthe red, blue and green electron beams, Subjects For SEM VIET ROBOTICS & SYSTEM SECURITY [email protected]. ‘ Downloaded from FaaDo0Engineers.com USE cont devi syste Sul Downloaded from FaaDodEngineer's.com sE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 3 PCI BUS: (NOV04/SM, NOV0S/10MMA Y06/8M, NOVO6/10MMAY07/10M) ‘PCI stands for Peripheral Component Interconnect. + Tewas developed by Intel Corporation in 1992. It is @ kind of local bus, which is directly connected to the processor bus. In other words a local bus is an extension of the processor bus. ‘tis widely used bus architecture. ‘= The PCT bus provides plug-and play facility or auto-configuration * Ttis 32-bit bus and can be expanded even up to 64-bit, if need arises (Pentium). + Ttoperates at 33MHz and data transfer rate is 130MB/se ‘Tis faster than BISA and MCA bus. Its address and data buses are multiplexed to reduce the size of the connector. na 32-bit bus, all the 32 lines are multiplexed for address and data, At one moment they carry address and at the other moment they carry date + In ease of 64-bit bus system, 32 address lines are multiplexed with data lines. The remaining 32 lines are only to carry data . ‘Error chocking mechanism has been provided for all addresses and data transfer. * It supports reflected-wave-switching for power consumption. Hence sometimes called GREEN MACHINE. ‘+ It uses hidden-bus arbitration in which, even if PCT bus isnot free- arbitration can be performed. Figure 6.1 shows a workstation system with PCI bus. A PCI slot does not accept 8 or 16-bit ISA cards. Therefore, ISA ‘bus is also used in combination with PCI bus, to interface 8 and 16-bit cards. The PCI bus provides plug-and play facility, which gives user the ability to insert any hardware peripheral into the system and use it without any ‘configuration or setup. In other words it provides auto-configuration, which enables the peripheral to configure itself, rather than configuration being supplied by the user. The PCI interface contains a number of registers to hold information about the board that allows the computer to sulomatically configure a PCI card. This auto-configuration feature is called Plug-and-Play ‘There is a bridge-chip (chipset) between the processor and the PCI bus, which connects the PCI bus to the processor ‘bus, Once a host chipset is included in the system, the processor can access all evailable PCI peripherals. This makes PCI bus processor independent. When & new processor is to be used, only the chipset needs to be changed. Power PC ‘and Apple's Macintosh system also use PCI bus. A PCI controller immediately stores data in a buffer. This allows CPU to go quickly to the next operation, rather than waiting for it to complete the data transfer. The PCI bus is designed to ‘operate without termination (unlike a SCSI bus). ‘There may be more than one PCI bus. PCI-to-PCI Bridge is available. The second PCI bus is connected tothe first PCI ‘bus through the PCI-to-PCI Bridge IC. Subjects For SEM VIII_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 21 BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kari -9820223893 Chapter 2- Advanced Intel Microprocessors Notes by Prof. Faruk Kazi 2.1 Protected Mode Operation of X86 Intel Family The first processor in the 80x86family was the 16-bit 8086, which was capable of addressing one Megabyte of memory, a significant improvement over the 8-bit machines available in the late 1970s, ‘Twenty address lines were provided on the processor to access the IMB of memory. The advanced processors tat Yllowed he S086, “beginning with the 80286, all contained additional address lines. ‘The 80386, 80486, and Pentium all contain 32 address lines, giving them the ability to access 2°, or 4.GB of memory. This large addressing space allows the advanced Intel Microprocessors, to perform many operating system chores- such as multitasking- that are difficult, or even impossible, on the Cs a Beginning with the 80286, the advanced Intel Microprocessors all contained the ability to operate in two different modes of operation, Real mode and Protected mode. In teal mode, the advanced processors, including the Pentium, simply operate like very fast 8086, with the associated 1 MB ‘memory limit, Real mode operation is automatically selected upon power-up. So a Pentium-based PC that boots up into DOS is operating in real mode (SOS is a teal mode operating system). In protected mode, the full 4 GB of memory is available to the processor, as are special privileged instructions and many other architectural goodies, including support for multitasking, virtual memory addressing, memory management and protection, and control over the intemal data and instruction cache. The Windows operating system runs in protected mode to take advantage of these improvements. Writing programs that runs in protected mode requires special background knowledge of operating systems theory. Mode cE (Virtual 80 Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 6, 2 (66287.77 Hz. This output of timer is used as a request signal for DRAM refresh in ISA systems. Upon sensing DRAM refresh request, the DRAM refresh logic asserts DRQO signal of the master DMAC to execute the refresh eycle, Timer 2 (Speaker Timer): The output of the Timer 2 is used to drive speaker. The Timer 2 is also given from 1.1918 MHz timebase. EISA BUS (32 Bit version) EISA stands for Extended Industry Standard Architecture Itwas introduced in 1988 ‘Ituses 32-bit address lines and 32-bit data lines Its suitable for multiuser system and faster than ISA Bus. ISA cards can also be inserted into EISA slots. An EISA connector contains two layers of contacts. The top layer contacts are the contacts for additional EISA signals. The connector of an EISA bus is of the same physical size as that of an ISA bus so that either ISA or BISA card can be inserted into the EISA connector slot “Though an BISA bus has 32-bit data bus, its clock speed is only SMFIz and data transfer rate 33MBsec. Hence, is slower tan PCI bus which operates at 33Miz ISA buss no longer used and has been replaced by PCI bus ‘As itwas expensive, it was used in servers. MCA BUS MCA stands fay Micro Channel Architecture. It-was developed by IBM in 1987. Ituses 32-bit address lines and 32-bit data lines. It does not accept older S-bit and 16-bit ISA expansion cards ‘MCA and BISA bus architectures are completely incompatible ‘MCA bus operates at 1OMHz and its data transfer rate is 80MB/sec, It was used in IBM's servers, ‘At the time of MCA release, there was already large established base of products and machines that were ISA compatible It was also expensive. Due to these reasons it did not gain industry acceptance and was replaced by PCI bus. VESA BUS ‘ESA stands for Video Electonics Standards Associaton, Twas introduced in 1992. tis 32-bit local bus, diretly connected to the processor bus operate at 33MHz, and its data transfer rate is 130MB/ec. Its also called VL (VESA Local bus. Kallows upto 3 peripherals [does not provide auto-configurtion and has 64-bit expansion capability {twas used in combination with ISA bus, It contained bus controller to arbitrate between the bus masters and the CPU. Subjects For SEM VIII | ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com Su 22 By Pro 9820223893 2.2 Study of Pentium: (Architecture & Features) Pentium Processor Features (PI refer class notes for EQ and detailed answer) ‘The on-chip memory management unit (MMU) of Pentium is completely compatible with Intel 386 and 486CPUs. The Pentium processor (510/60, 567/66 MHz) contains all the features of Intel 486 CPU. The significant features and additions are as the following: U_ Improved Instruction Execution Time Bus Cycle Pipelining Address Parity Internal Parity checking Functional Redundancy Checking Execution Tracing Performance Monitoring System Management Mode ‘Virtual Mode Extensions The Pentium processor operates at a very high speed. It also overcomes many performance bottlenecks associated with earlier X86 processors. This enhanced performance is achieved by it due to its superscalar architecture, The important features of Pentium architecture are U_ Wider (64-bit) Data Bus: With its 64-bit-wide extemal data bus (in contrast to the Intel486 processor's 32-bit- wide external bus) the Pentium processor can handle up to twice the data load of the Intel486 processor at the same clock frequency U Superscalar Architecture: Dual Instruction Pipeline ‘The Intel486 processor can execute only one instruction at a time. With superscalar execution, the Pentium processor can sometimes execute two instructions simultaneously U_ Dynamic Branch Prediction Logic: The Pentium processor fetches the branch target instruction before it executes the branch instruction U Enhanced Floating Point Unit: The Pentium processor executes individual instructions faster through execution pipelining, which allows multiple floating- point instructions to be executed at the same time U_ Dedicated Instruction and Data Cache: The Pentium processor has two separate 8- kilobyte (KB) caches on chip~-one for instructions and one for data--which allows the Pentium processor to fetch data and instructions from the cache simultaneously U_ Write-Back MESI Protocol in Data Cache: When data is modified; only the data in the cache is changed. Memory data is changed only when the Pentium processor replaces the modified data in the cache with a different set of data c ceccecee As Figure 2.1(or Figure 2.2- any figure can be drawn for the EQ) Shows, the Pentium processor is a complex machine with many interlocking parts. At the heart of the processors are the two integer pipelines, the U pipeline and the V pipeline, These pipelines are responsible for executing 80x86 Subjects for SEM Downloaded from FaaDo0Engineers.com ‘BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 6. 1 Chapter 6- Standard for Bus Architecture and Ports Notes by Faruk Kazi PLNOTE: Printed notes are not sufficient for this chapter. Please refer class notes for better understanding and exam oriented preparation as usually one out-of syllabus question comes on this topic. The following major bus architectures have been developed over the past two decades 1, ISA Bus @Bitand 16 Bit version) EISA Bus MCA Bus PCI Bus ‘VESA Bus 6. AGP Out of these bus architectures PCI, ISA and AGP are used in a modem computer. Other bus architectures have been replaced by PCI bus architecture. ISA BUS (8/16 Bit Version) + ISA stands for Industry Standard Architecture. ts pronounced as"e-sai *Thwas introduced in 1984, + Itcontains 24 address lines and 16 dat lines. (8:Bit: 20 bit address and 8 bit data) + Since it i a 16-bit bus it does not take full advantage of the 32-bit address bus and 32-bit data bus of « 32-bit microprocessor (for example 80386 and onwards) + Teoperates as SMHz and its data transfer rate is SMB/see, (8-Bit: 4MB/se°) ‘The ISA expansion slot is designed in such « way that an older 8-bit card can also be plugged into it ISA connector slots are still used in most PCs to connect S-bit and 16-bit cards, usvally with other types of expansion slots. ISA bus is also known as AT bus because it first appeared with IBM's PC/AT. ISA Timers: (Pl refer class notes for ISA DMA and ISA Interrupt) ISA systems have a free raning clock with fequeney 1431818 MHz. This clock is available on the ISA connector with name “OSC” and may be used in the add-on cards Its frequency (1431818 MHz) is our times the television color burst fequeney and therefore after division by 4 it can be used asa clock inthe video display adapter cards. The divider foie on te system bourd divides OSC clock by 12 to provide driving signal with frequency 1.19318 Mi for ‘Timer 0, Timer | and Timer 2 Timer 0 (System Timer: Te sytem timer, timer 0, i wed as a programmable frequency soure. Iisa 16-bit dow counter. The programmer can load divisor cout in the count register ofthe timer to get the desired output frequency. During the POST, the count register is loaded with count FFFFH (65535 decimal). After every clock input count in the count register is decremented by one, unl the eount becomes zero. When count in the count register becomes zero, the court reps is automatically loaded with original count and cycle repeats, Inthe ISA systems the ‘output of timer 0 is used to trigger IRQO interrupt of 8259 interrupt controller. Timer 1 (Refresh Ter): The recs timer, Timer 1, is alo used asa programmable frequency soures. Is 16-bit down counter. The programmer can load divisor cout in the cout register of the timer to get desired output frequency. Ducing POST, the count register is loaded with count O012H (IB decimal), giving the ouput frequency Subjects For SEM VII ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com floating-point unit is included on the chip to execute instructions previously handled by the extemal 80x87 math coprocessors. During execution, the U and V pipelines are capable of executing two integer instructions at the same time, under special conditions, or one floating-point instruction. The Pentium communicates with the outside world via a 32-bit address bus and a 64-bit data bus, The bus unit is capable of performing burst reads and writes of 32 bytes to memory, and through bus cycle pipelining, allows two bus cycles to be in progress simultaneously An 8KB instruction cache is used to provide quick access to frequently used instructions. When an instruction is not found in the instruction cache, it is read from the external data bus and a ‘copy placed into the instruction cache for future references, The branch target buffer and prefetch buffers work together with the instruction cache to fetch instructions as fast as possible. The prefetch buffers maintain a copy of the next 32 bytes of prefetched instruction code, and can be loaded from the cache in a single clock cycle, due to the 256-bit wide data output of the instruction cache. A separate 8KB data cache stores a copy of the most frequently accessed memory data. Since memory accesses are significantly longer than processor clock cycles, it pays to keep a copy of memory data in a fast-reading cache. The data and instruction caches may both be enabled/disabled hardware or software. Both also employ the use of a translation look aside buffer, which Converts logical addresses into physical addresses when virtual memory is employed The Pentium uses a technique called branch prediction to maintain a steady flow of instructions into the pipelines. To support branch prediction, the branch target buffer maintains a copy of instructions in a different part of the program located at an address called the branch target. ‘Why not 80586? In 1993, following their earlier naming conventions, Intel's new fifth- generation chip was expected to be named the 586. However, Intel | wanted to be able to register as a trademark the name of their new processor, and since numbers cannot be trademarked, the Pentium was bom. Since this time the Pentium name has become one of the most. widely recognized trademarks throughout the computer world, Downloaded from FaaDo0Engineers.com Downloaded from Faalo0Engineer's.com Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 5. 13 ‘+ Implemented using 0.5 micron, 4-layer metal CMOS technology operating at 3.3 volts. + Packaged using a 521-pin plastic Ball Grid Array (BGA) + High speed memory transfers (1.3 GB/sec) ‘+ Implements the new Ultra Port Architecture (UPA) Interconnect Other SPARC Implementations: ‘The very first SPARC architecture implementations included a CPU chip with a single-ALU IU, a register file (usually 136 registers from the start), PCs, PSR, and a few other control and status registers. The FPU, MMU, and cache had to be rez ‘on separate chips. The original Cypress SPARC CPU was labeled CY7C601. Cypress followed it up by a subsequent modet called HyperSPARC, or CY7C620. The HyperSPARC is a two-issue superscalar, with over 1 million transistors, containing on-chip IU and FPU, operating at 55.5 MHz, with a 64 bit wide data bus. The MMU and cache have to be configured outside of ‘the CPU chip. Tl also. featured a lower-level, scalar, single-pipeline microprocessor, ‘called~MicroSPARC. ‘The MicroSPARC is a 0.8 micron, 800 000 transistors, SV microprocessor, consuming 3.5 W at 50 MHz. It contains an IU, FPU, and a modest on-chip dual cache 4 Kbytes code and 2 Kbytes data. ‘Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 25 BE-Sem-VII-Ct licroprocessor Ni sruk Kazi -9820223893 bs Figure 22: Pentium Architecture ‘The floating point unit of the Pentium maintains a set of floating point registers and provides 80- bit precision when performing high-speed math operations. This unit has been completely redesigned from the one used inside the 80486 and is also pipelined. The floating-point unit uses hardware in the U and V pipelines to perform the initial work during a floating point instruction (such as fetching a . 764- bit operand). And then uses its own pipeline to complete the operation. Since both integer cfg pipelines are used, only one floating point instruction may be executed ata time. fe | | Altogether, the Pentium processor includes many features designed to increase performance over | earlier 80x86 machines. Downloaded from FaaDodEngineer's.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5. 12 ‘and cache organization. This organization uses the term ‘set’ in a different manner than in the computer literature. In the SuperSPARC a set is a block of data of 4kbyte, the page size in this system. The instructions cache contains five such sets for a total of five pages or 20Kbyte, The data cache contains four such sets for a total of four pages or 16Kbyte. Based on this, the instructions cache is said to be a five-way set-associative, and the data cache is four-way set-associative, Both caches use a pseudo-LRU replacement algorithm. The line size on the instruction cache is 64bytes, and the line size on the data is 32bytes. ‘The instruction cache is accessed by 128-bit fetch path, allowing the fetching of 4 instructions simultaneously. The data cache is accessed by a 64-bit path, allowing transmission of double-precision floating-point data in one bus cycle. The hit rate was reported by sun to be 98 percent for the instruction cache and 90 percent for the data cache. The cache is physically addressed. The SuperSPARC TLB has 64 entries and it is fully associated, ‘nsiuction from PC Loa ‘The UltraSPARC-I Processor: MAY08/5M One of the first implementations of the new SPARC-V9 architecture, the UltraSPARC-1 retains complete upwards compatibility with the 32-bit SPARC-V8 specification, ensuring binary compatibility with existing applications. UltaSPARC-1 not only provides 64-bit data and addressing, but adds a number of other features to improve operating system and application performance: ‘+ Nine stage pipeline; can issue up to 4 instructions per cycle ‘+ Better cache management and greatly reduced memory latency + On-chip cache 16K Data and 16K Instruction, with up to 4 MB extemal cache allowed ‘+ Integrated multi-processor support with low latency to shared data ‘© On-chip graphics and imaging support Subjects For SEM VI ROBOTICS & SYSTEM SECURITY [email protected] Subj Downloaded from FaaDo0Engineers.com 26 }E-Sem-VII- iP. racessor s By Pri Kazi 93, 2.3 superscalar Arc ing In Pentium processor, the integer instructions traverse a five-stage pipeline. ‘The pipeline stages are as follows: PF - Prefetch D1 — Instruction Decode D2- Address Generate EX ~ Execute - ALU and Cache Access ‘WB - Write-Back Pentium processor is a superscalar machine, capable of executing two instructions in parallel. The five stage pipelines operate in parallel allowing integer instructions to execute in a single clock in each pipeline. The pipelines in Pentium processor are called U and V pipes and the process of issuing two instructions in parallel is termed as 2 Issue superscalar. There are two execution units in Pentium and the instruction pairing allows each unit to complete the execution of an instruction at the same time, ‘The Figure 2.3 depicts how ten instructions move through the pipeline of Pentium processor. nha Downloaded from FaaDodEngineers.com. ate See eee Figure 5.7 (a) Taken conditional branch (b) Untaken conditional branch ‘The Super SPARC implements a precise exception model. At any given time there can be up to nine instructions in the TU pipeline and four more in the floating-point queue. Exceptions and the instruction that caused them propagate through the TU pipeline. They are resolved in the execute stage before their results can modify visible state in the register file, control registers or memory ‘The FPU consists of a floating-point controller (FPC), two independent pipelines FADD and FMUL, 2 floating-point queue and a 32-bit 32-register floating-point file. The FP file is organized as sixteen 16-bit double words to optimize double-precision performance, Each 32-bit word of the FP file can be accessed separately, however. The FP file has three read and two write ports. The FPC is tightly coupled to the TU pipeline and is capable of executing o floating-point memory event and a floating-point operation in the same cycle, The FPC also handles floating-point exceptions, There ore two types of floating-point instructions: 1, FPOPs-f loafing point operations, such as add, multiply, convert and so on. 2, FPEVENTS- floating-point events, such as load/store to/from floating-point register, load/store to/from floating-point status register, store floating-point queue, integer multiply, and integer divide. FPEVENTS are executed by the FPU but do not enter the FPU queue. ‘The FPU pipeline consists of four stages 1 FRD decode and read 2. FMIFA execute multiply or add 3. FN/FR normalization and rounding 4 FWB write ~ back to FP file ‘The SuperSPARC has a dual cache 20kbyte Icache 16Kbyte Deache for a total of 36kbyte on chip primary cache. There is support for an external, second level 1-Mbyte cache. Figure shows the SuperSPARC MMU ‘Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 27 ‘TI-COMP-Advanced Microprocessor Notes By Prof. Faruk. 3202238 ‘Assuming that all the instructions have followed the pairing rules, the five instruction pairs are shown in Figure 2.3. The five clock cycles are used to perform five pipeline stages. In the clock cycle 1, the prefetch (PF) action is implemented. A pair of instructions is prefetched from the on- chip code cache during clock 1. This first pair is isstied in parallel to the U and V pipelines for decoding purpose (D1 stage), while another pair is being prefetched (PF stage) during the clock 2 cycle. In clock 3 cycle, the first instruction pair moves to decode 2 (D2) stage, while the second pair is now issued to the decode 1 (D1) stage of both the pipelines and the third pair of instructions is being fetched (PF stage). In this way, each pair of instructions can proceed to the next stage in the pipeline with each cycle of the processor clock (PCLK). During clock cycle 5, the first instruction pair completes its execution. If we observe the coluimn of CLKS, the first pair is in the last stage (WB) of the pipeline whereas the second pair is implementing the 4° stage (EX) and the third instruction pair is at the 3 stage (D2) of the pipeline and so on. Thus, ten different instructions are present at the various pipeline stages during a single clock cycle. After the clock cycle 5, each succeeding clock cycle shows the completion of another instruction pair. Integer Pipeline Stages: 7 1. Prefetch (PF Stage) There are two prefetch buffer/queue present in Pentium and at a time, one of them is active, active queue fetches the instruction codes from the on-chip cache or memory until the branch prediction logic predicts that a branch will be taken when the branch instruction reaches the execution stage. During the normal pipeline operation, this active queue supplies two consecutive instructions to U and V pipeliies.. 2. Decode i (D1) Stage ‘Two pipelines filled with instructions are decoded in D1 stage. The instructions are first checked for the pairability beside branch prediction. © Instruction Pairing (Refer class notes'for complete answer) Certain rules are provided for instruction pairing. Not all instructions are pairable. The first limitation is put by the V pipeline. During normal operation, the active queue delivers the first instruction to u pipe and second to the V pipeline. But the V pipeline has no barrel shifter and it cannot execute all type of instructions. It can execute simple instructions. The U pipeline, enhanced version of 486 pipeline can execute any instruction in Intel architecture. Considering all the ‘ons, a certain criteria are defined. The two instructions are pairable only if they satisfy the Plowing conditions ¥ Both instructions in the pair must be simple. v No register dependencies/contention between them. ‘The instructions, which are completely hardwired, are called Simple Instructions. They do not require any microcode control and execute in 1,2 or at the most 3 clock cycle. The following integer instructions are considered simple and may be paired MOV teg, reg/mem/imm MOV mem,reg/imm OBOTICS & 8} Downloaded from FaaDo0Engineers.com BE-Sem-Vil-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi .9820223893 5. 10 data cache. The other three ALUs are used for data processing. If we have integer instructions that are data dependent, the first instruction can be executed in one of the upper ALUs, forwarding the result and another ‘operand to the lower ALU, completing both computations within a single cycle. Afier that, both results are stored in the IU register file. ‘The SuperSPARC integer pipeline consists of four stages (cycles). Each cycle has two phases. The pipeline eight phases are 1, FO Instruction cache (1 cache) access and TLB lookup. 2. Fl Tcache match detect. Four instructions sent tothe instructions queue (Iqueue) 3. Do Issue one, two, or three instructions. Select register indices for load/store instructions. 4. DI Read register file load/store instructions, Resources allocation for ALU instructions. Evaluate branch target address, 5. D2 Read register file for ALU operands, Calculate EA for load/store instructions, 6. 0 First stage of ALU. Data cache (Deache) access and TLB lookup. Floating point instruction dispatch, 7. El Second stage of ALU, Deache match detect. Load data available. Resolve exceptions. 8. WB Write back result into the register file, Retire store into the store buffer. ‘The FPU pipeline is tightly coupled to the integer pipeline. An operation may be started every cycle; the delay of most floating point operations is three cycles. In the EO phase, one floating point arithmetic instruction is selected for execution and its operands are read during El. Two stages of execution delay are required for the double precision FPU adder and FPU multiplier. The first cycle of the adder examines exponents, aligns mantissas, and produces a result. The first cycle of the multiplier computes and adds partial products. Independent second stages round and normalize the result of the respective units Forwarding paths are provided to chain resulfs of one FPU operation into the source of a subsequent operation Figure 5.6 illustrates an example of pipelined execution of a set of ALU operations, with a load instructions in between. Up to four instructions can be fetched during the (FO, F1) cycle, but only up to three instructions can be issued as a group (GRP) during the DO phase. Forwarding of results between subsequent groups of instructions is shown by arrows in figure 5.6, But where is Figure 5.6? Refer class notes of Faruk Kazi. Pipelined execution of a set of instructions, which includes a conditional branch, is shown in Figure 5.7. A taken branch case in Figure $.7(a) and a no taken case in Figure S.7(b). The original sequential instructions are denoted as SI and $2, and the target instructions as T1, T2, T3 and T4. The delay instruction, placed after the conditional branch instruction (BNE in this example) is denoted by DI while C1 refers to the certainty instruction stream. The Super SPARC process can group compare (CMP) and the conditional branch instruction (BNE), to speed execution. The processor statically predicts that all branches are taken. ‘When a control transfer instruction relative to the PC is issued, its DI is fetched concurrently. During the D1 phase the target address (TA) is computed. As the branch instruction enters phase D2, the target instruction stream is fetched (FT). The fetch completed as the DI advances to phase DI and the compare and branch instruction enter phase E0, The compare instruction computes new integer condition codes in phase EO and the branch direction is resolved. When a branch is taken, all sequential path instructions (SI and on; grouped together with the DI) are invalidated (squash $1), as shown in fig. When a branch is not taken (untaken) se(quential path instructions (SA+) remain valid and the target instructions (Tl and on) fetched are discarded. This scheme does not introduce a pipeline bubble (stall) for either branch path. The PC and prefetch PC values for both directions are precomputed. The SuperSPARC branch implementation can execute nontaken branches somewhat more efficiently than taken branches. Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 28 BESS wr Notes By Prof. Faruk Kazi -9820223893 ‘ALU reg, reg/menvimm ALU mem, reg/imm INC reg/mem DEC regmem PUSH reg/mem POP reg LEA reg, mem IMP/ Call/ jecnear NOP If the two instructions are not pairable, 12 instruction in the V pipeline’s D1 stage is deleted and shifted to the D1 stage of the U pipeline when Il is moved to the D2 stage of U pipeline. Instruction Issue Algorithm (EQ): Decode two consecutive instructions I1 and 12 If the following are all true: 11 is a “simple” instruction 2s a “simple” instruction Tis not a jump instruction Destination of [1 # source of 12 Destination of 11 # destination of 12 (ic. no contention) Then, issue [1 to U pipe and I2 to V pipe Else, issue to U pipe «Branch Prediction (Refer Section 2.8 for detailed answer for separate EQ) The Pentium processor includes branch prediction logic, allowing it to avoid pipeline stalls if it correctly predicts whether or not the branch will be taken when the branch instruction is executed. When a branch operation is correctly predicted, no performance penalty is incurred, However, when branch prediction is not correct, a three cycle penalty is incurred if the branch is executed in the U pipeline and a four cycle penalty if the branch is in the V pipeline. 3. Decode 2 or D2 Stage The D1 stage is followed by D2 stage in which the instructions are further decoded and the addresses of memory resident operands are calculated. It performs segmentation addressing. ‘The address calculation at this stage is much faster, Pentium requires a single clock cycle to calculate the address for the instructions containing a base and index-addressing mode with displacement and an immediate addressing mode, pe ferukkeni@itbacin Downloaded from FaaDo0Engineers.com ‘BE-Sem.VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5. 9 5.3 The SuperSPARC (SPARC Implementation) “The SuperSPARC is a 3.1-million transistor, 0.8-micron, three-layer metal BICMOS, 293 ceramic pin grid array (PGA) microprocessor, manufactured by TI in cooperation with Sun Microsystems. The processor chip contains an IU, FPU; MMU, and a dual cache (20 Kbytes code, 16kbytes data, total 36 Kbytes). A block gram of the Super SPARC is shown in Figure 5.5. ae - Figure §.5~Super SPARC functional block diagram ‘The SuperSPARC is a three-issue super scalar system. The SuperSPARC can issue and execute three instructions every eyele subject to the following constraints: 1. Maximum of two integer results 2, Maximum of one data memory reference 3, Maximum of one floating-point arithmetic instruction 4, Terminate group of instructions after each control transfer Data dependencies are solved on the SuperSPARC by 1. Cascading dependent instructions in the same group 2. Forwarding dependent instructions in consecutive group ‘The block diagram in Figure $.5 shows the detail of the structure of the IU. There are four ALUs and a shifter, The lower leftmost ALU is used for address computations; its output is forward to the MMU and the Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com During the D2 stage, the processor also performs the segmentation protection checks required when the processor forming memory addresses in protected mode. Figure 2.4: D2 Stage 29 | BE-Sem-VII-COM} ced Mics jotes By Prof, Faruk Kazi 9820223893 | I | Control Unit x“ t v v coo baler fa sez f nd and eT ea Gre Gkiang D2 A | oe g 4, Execution or EX-Stage Figure 2.5 illustrates the execution stage of the dual instruction pipelines. Figure 2.5: EX Stage | TALU & tage flag v anv | Repair | i Barrel Shits it Pa a uU es sss ss an Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kaz| -9820223893 5, @ SUBX (SUBXe<) Subtract with Carry (and modify ice) TSUBce (TSUBesTV) Tagged Subtract and modify icc (and Trap on overflow) MULSee ‘Multiply Step and modify ice AND (ANDNes) ‘And (and modify ies) ‘ANDN(ANDNec) ‘And not (and modify ice) (OR (ORec) Inclusive-Or (and modify ice) ORN (ORNec) Inclusive-Or Not (and modify ie) XOR (KORCC) Exelusive-Or (and modify ics) XNORCKNORes) _Bxclusive-Nor (and modify ice) SLL Shif Left Logical SRL Shift Right Logical SRA Shift Right Arithmetic SETHI Set High 22 bits ofr register SAVE Save caller's window RESTORE Restore calle’s window Bice Branch on integer condition codes FBfee Branch on floating-point condition codes CBece ‘Branch on coprocessor condition codes CALL call JMPL Jump and Link RETT* Retum from Trap Tice ‘Trap on integer condition codes RDY Read ¥ register RDPSR* Read Processor State Register RDWIM? Reed Window invalid Mask Register RDTBR* Read Trap Base Register wry Write ¥ register WRPSR* Write Processor State Register wRWwim? Weite Window invalid Mask Register WRTBR* ‘Write Trap Base Register 7 UNIMP: ‘Unimplemented instruction in IFLUSH Instruction cache Flush FPop Floating point Operate: Pop Coprocessor operate ‘Privileged instruction. Ds Th shi ‘Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected]. [ Sul Downloaded from FaaDo0Engineers.com BE-Ser ‘The execution stage is comprised of the arithmetic logic unit, or ALU. The U pipeli incorporates a barrel shifter, while the V pipeline’s does not. It is obvious, then, that the U pipeline can handle instructions that cannot be handled in the V pipeline. When necessary, data cache accesses (on a cache hit) or memory accesses (on a cache miss) are performed in this stage. Access to the data cache can be made by the U pipeline and V pipeline simultaneously. Note that both instructions enter the execution stage at the same time. If the instruction in the V pipeline stalls, the U pipeline instruction is permitted to proceed to the write-back stage (i.e. the last stage in integer pipeline). However, if the U pipeline instruction stalls, the V pipeline instruction will not proceed to the write-back stage. 5, Write-Back or WB Stage This is the final stage of integer instruction execution. In WB stage, the processor state is modified by updating target registers and EFLAGS register (if necessary). (EQ)Floating Point Instruction Pipeline Stages (PI. Refer class notes) Most floating-point instructions are issued singly to the U pipeline and cannot be paired with integer instructions. It consists of eight pipeline stages. The first four stages are shared with integer pipeline and the last four reside within the floating-point unit itself ‘The 8 Pipeline Stages are: Bec igen |o ene Dennen cee Prefetch (PF) Tdentical to integer prefetch stage Instruction Decode 1 | Identical to the integer D1 stage @)) Instruction Decode 2 | Identical to the integer D2 stage (D2) ‘Execution Stage (EX) __| Register read, memory read, or memory write performed as required by the instruction (to access an operand) FP Execution 1 Stage | Information from register or memory is vitten into a FP x) register. Data is converted to floating-point format before being loaded into the floating-point unit. FP Execution 2 Stage| Floating-point operation performed within floating-point (x2) unit “Write FP Result (WR) | Floating-point results are rounded and the result is written to the target floating-point register. Error Reporting (ER) | If an error is detected, the error is reported and the FPU status word is updated. Downloaded from FaaDo0Engineers.com. BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5, 7 Instruction set ‘ ~ The SPARC architecture features the following types of instructions: 1 Load/store 2° Arithmetic/logical/shift 3° Control transfer 4 Read/write control registers 5 Floating-point operate 6 Coprocessor operate (not needed on latest highly integrated implementations) ‘The SPARC instruston set is summarized in Table .1. Table $.1 SPARC Instruction Set Opcode Name LDSB (LDSBA*) ‘Load Signed Byte (from Altemate space) LDSH(LDSHA*) Load Signed Half word (from Alternate space) LDUB (LDUBA*) ‘Load Unsigned Byte (from Altemate space). LDUH (LDUHA*) Load Unsigned Halfword (om Alternate space) LD@LDA*) Load Word (from Altemate space) LDD @DDA*) Load Doubleword (from Altemate space) LF Load Floating-point LDDF ‘Load Double Floating-point LDFSR Load Floating-point State Register Lupe Loed Coprocessor Lppe Load Double Coprocessor LDCR Load Coprocessor State Register STB (STBA*) Store Bytes (into Altemate space) STH (STHA*) Store Halfword (into Alternate space) SISTA") Store Word (into Alternate space) STD (STDAS) Store Doubleword (into Altemste space) STF Store Floating-point STDF Store Double Floating-point STRSR Store Floating-point State Register STDFQ* Store Double Floating-point State Register stc Store Coprocessor.. sTDC Store Double Coprocessor STCSR Store Coprocessor State Register sTDCQ* ‘Store Double Coprovessor Queue LDSTUB(LDSTUBA*) Atomic Load-Store Unsigned Byte (in Alternate space) SWAP (SWAPA*) ‘Swap r Register with Memory (in Alternate space) ADD (ADDec) ‘Ada (etd modify ice) ‘ADDX (ADDXce) ‘Add with Carry (and modify ice) TADDee (TADDecTV) _Tegged Ald and modity ie (and Trap on overflow) SUB (SUBcc) Subtract (and modify ice) ‘Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 211 wr Notes By Prof, Faruk Kazi 9820223893 Instruction Pairing Rules for Floating Point Instructions The rules of how floating point (FP) instructions get issued on Pentium processor are given below. ” |, EP instructions are normally issued to the U pipeline singly as they do not get paired with integer instructions. However, a limited pairing of two FP instructions can be performed, ii. Pairing can occur only if the first instruction issued to the U pipeline is a simple set F instruction and the second instruction is the floating point exchange, FXCH instruction. “The F set or simple instructions are FLD single/ double precision, FLDST () and all forms of FADD, FSUB, FMUL, FDIV, FCOM, FUCOM, FABS, and FCHS. FPU Internal Pipelining-resources Inside the FPU, all the resources are allocated to one or more instructions at one time. This permits pipeline execution within the FPU. This is explained with the help of three examples: i. FDIV instruction cannot be executed with any other instruction, since FDIV requires all of the FPU resources. ii, Similarly, two consecutive FMUL instructions cannot be executed simultaneously, iii, FMUL instruction can be executed in parallel with one or two FADD instructions. iv. Three FADD instructions can be executed simultaneously. 2.4 The Register Set (Software Model /Architecture) The Intel x86 architectures register set is subdivided into the following groups: 1 Base architectures registers (application register set) i General-purpose registers (8x32 bit) ii Instruction pointer (BIP 32 bit) iil. Flags Register (EFLAGS 32 bit) (DECO7/MAY08) iv Segment registers (6x16 bit) 2 System registers (MAY08) i Memory management registers (MAYO7) ii Control registers (NOVOS/MAY06/MAY07) 3. Floating-point registers (Same as 8087 hence not discussed here) i Data registers ii Tag word iii, Status word iv_ instruction and data pointers 4 Debug registers (DEC07) (Refer class notes) ‘The base architecture and floating-point registers are accessible by applications programs. The system and debug registers are accessible only by system programs (such as OS), running on the highest privilege level Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprecessor Notes By Prof. Faruk Kaz| 9820223893 5. 6 pros ges iy ars Figure 54~ SPARC instruction formats As can be seen in Figure 5.4, the operations instructions implement three-operand addressing, and all formats are of a single word length (32 bits). The fields in the instructions have the following designation: op its 31,30 in al formas. They ar interpreta as follows: i _bits 29 to 25 in formats 2 and 3, Selects the source register for store instructions and the destination register for al other instructions. a bit 29 in format 2, Annul bil. Changes the behavior ofthe instruction encountered immediately after a contol transfer. cond bits 28 to 25 in format 2. Selects the condition code for eondtona ranches inm22__ bits 24 to Gin format 2 A 22-bit constant value used bythe SETHI instruction. sp22 bits 21 to On format 2. A 2s sign-xtended word displacement for branch instructions. 85920 bits 29 0 format 1. A 30-bit sign-extended word dsplacement for PC-eatve cal instructions, 093 bits 24to 19 format 3. Opoode extension. i _bit #3 n format 3. Selects the typeof the second ALU operand for non-loating-point operation instuctions =O: the second operand isin register 82 the second operands sign-extended simi. ‘asi bits 120 Sin format 3. An Sit address space identifier generate by load and store atemateistucons. 131 bits 180 14 in format 3, Selects the rst source operand register. 152 bits 4toOin format 3. Selects the second source operand register. sine 13 bits 12 Oin format 3. A signextended 13-iinmediate vale, oof 1310 Sin format 3. dents a foating pont operate instuction 5.2.4 Addressing modes (Refer class notes for examples and complete answer) Besides the standard register direct and immediate addressing modes, there are only three addressing modes for memory access: 1 Register indirect with displacement; register + signed 13-bit constant 2 Register indirect indexed; register! + register? 3° PC-relative; used in CALL instructions with a 30-bit displacement ‘Sub Downloaded from FaaDoEngineer's.com 212 Prof. Faruk Kazi 9820223893 BE. [P-Advanced Micropracess By Base architectures registers: . ; “The base architectures registers (or the application register set) are shown in fig 2.6 There are eight 32-bit general purpose registers ‘The Flags Register (EFLAGS) shown in Fig.2.7, is a 32-bit register called EFLAGS. The specified j bits and bit fields of EFLAGS control a number of operations and indicate the status of the processor. The lower 16 bits of EFLAGS, called FLAGS, are used when executing 8086 or 80286 code. = Bit 21, ID- Identification Flag: The ability of a program to set and clear the ID flag indicates that the processor supports the CPU identification (CPUID) instruction. = Bit 20, VIP- Virtual Interrupt Pending Flag: The VIP flag together with the VIF (bit 19) ; enable each applications program in a multitasking environment to have virtualized versions of the system's IF flag (bit 9). The processor reads this flag but never modifies ths flag. ' = Bit 19, VIF- Virtual Interrupt Flag: The VIF is a virtual image of the IF flag used with VIP. ‘The processor recognizes the VIF flag when either the VME or PVI bit in CR4 is set and the : IOPL is less than 3. The VME flag enables the virtual 8086 mode extensions while the PVI enables the protected-mode virtual interrupts. Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5. inten Estencedt Figure 5.3~ Processor data types 5.2.3 Instruction formats ‘The SPARC instruction formats are shown in Figure 5.4. There are three basic instruction format types: 1 CALL 2 Branch instructions 3 Operate instruction (egister-to-egister) 5 Subjects For SEM Vil ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDodEngineer's.com the 86 tes 19) of BE-Sem- 2.13 icro) By Pi ik Kazi -98202: MP-Advan Bit 18, AC- Alignment Check: Setting the AC flag and the AM bit in the control Register 0 (CRO) enables alignment checking on memory references. An alignment check exception is generated when references is made to an ‘unaligned operand, such as a word at an odd byte address. Alignment check exceptions are generated only in user mode (PL=3). Bit 17, VM-Virtual Mode: If VM is set (VM=1) the processor will be placed in virtual 8086 ‘mode which is an emulation of the programming environment of the 8086 microprocessor. Bit 16, RF- Resume Flag: When RF is set (RF=1), it temporarily disables debug fauits so that an instruction can be restarted after a debug faults without immediately causing another debug fault. Bit 14, NT-Nested Task: If NT is set (NT=1), it indicates that the currently executing task is nested within another task and has a valid link to the previous task in TSS.. Bit 13-12, IOPL- Input/Output Privilege Level: The IOPL encoded values (0,1,2,3) indicate the numerically maximum current privilege level permitted to access VO address space. Bit 11, OF -Overflow Flag: The OF is set (OF=1) if the operation resulted in a signed overflow Bit 10, DF-Direction Flag: DF defines whether ESI and /or EDI registers are incremented (post increment) or decremented (post-decrement) during the execution of string instructions Post increment occurs if DF =0; post decrement occurs if DF = Bit, IF-Interrupt Enable Flag: When IF is set (IF=1) it allows recognition of external interrupts signaled on the INTR pin. When F=0 external interrupts on INTR are not recognized. Bit 8, TF- Trap Enable Flag: When TF is set (TF=1), the Processor is put into single-step mode for debugging, In this mode, the processor generates a debug exception after each instruction, which allows a program to be inspected as it executes each instruction. Bit 7, SF-Sign Flag: SF is set (SF=1) if the MSB of the result is set (MSB =1), or in other words, the result is negative. SF reflects the state of bit 7,15,31, for 8-,16, and 32-bit operations, respectively. Bit 6, ZF-Zero Flag: ZF is set (ZF=1) if all bits of the result are zero. Otherwise, ZF =0. Bit 4, AF-Auxiliary Carry Flag: The AF is used for BCD operations. AF is set (AF=1) if the operation resulted in a carry out of bit 3. Otherwise, AF-0. Bit 2, PF-Parity Flag: PF is set (PF=1) if the low-order 8 bits of the operation contain an even. number of Is(even party), PF is reset (PF=0) ifthe low-order 8 bits have odd parity (odd number of Is). Downloaded from FaaDo0Engineers.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5. 4 Window Invalid Mask (WIM): Refer class notes . ‘Trap Base Register (TBR): The Trap Base Register (TBR) containé the address to which control is transferred when a trap occurs. Program Counters (PC, nPC): The 32-bit PC contains the address of the instruction currently being executed by the IU. The nPC holds the address of the next instruction to be executed (assuming a trap does not occur). For a delayed control transfer, the instruction that immediately follows the transfer instruction is known as the delay instruction. This delay instruction is executed (unless the control transfer instruction annuls it) before control is transferred to the target. During execution of the delay instruction, the nPC points to the target of the control transfer instruction, while the PC points to the delay instruction. Ancillary State Registers (ASRs): SPARC provides for up to 31 Ancillary State Registers (ASR’s), ‘numbered from 1 to 31. ASR’s numbered 1-15 are reserved for future use by the architecture and should not be referenced by software. ASR’s numbered 16-31 are available for implementation dependent uses, such as timers, counters, diagnostic registers, selftest registers, and trap-control registers. A particular IU may choose to implement from zer9 to sixteen of these ASR’s. The semantics of accessing any of these ASR’s is implementation dependent. Whether a particular Ancillary State Register is privileged or not is implementation-dependent. An ASR is read and written with the RDASR and WRASR instructions. TU Deferred-Trap Queue: An implementation may contain zero or more deferred-trap queues. Such a ‘queue contains sufficient state to implement resumable deferred traps caused by the IU. 5.2.2 Data types SPARC architecture recognizes the following data types, as shown in Figure 53: Integer Signed, unsigned byte 8 bits Signed, unsigned half word 16 bits Signed, unsigned word 32 bits Double word 64 bits 2 Floating-point (EEE 754 standard) Single-precision 32 bits Double-precision 64 bits ‘Quad-precision exponent: 15 bits, mantissa: 63 bits Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Sub Downloaded from FaaDo0Engineers.com 2.14 BE-Sem-VIJ-COMP-Advanced Microprocessor Not yruk Kazi -9820223893 = Bit 0, CF-Carry Flag: CF is set (CF=1) if the operation resulted in a carryout of the MSB (the sign bit). Otherwise CF=0. For8-16-, or 32-bit operations, CF is set according to the carryout of bit7, 15, or 31, respectively. Segment registers Six 16-bit segment registers CS, SS, DS, ES, FS, and GS hold segment selector values identifying the currently addressable memory segments. The selector in CS indicates the current code segment, the selector in SS indicates the current stack segment, and the selectors in DS, ES, FS and GS indicate the current four data segments. ‘System memory management registers (Refer class notes for detailed answer) EQ Four memory management registers are used to control segmented memory management. The Gbbal Descriptor Tables Register (GDTR) and Interrupt Descriptor Table Register (IDTR) can be loaded with instructions which get a 6-byte data item from memory. The Local Descriptor Table Register Downloaded from FaaDo0Engineers.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 5, 3 TU controUstatus registers: A SPARC processor includes two types of registers: general-purpose or “working” data registers and controV/status registers. The [U's general-purpose registers ae called r registers as discussed above, and the FPU’s general-purpose registers ae called f registers. IU control/staus registers include: * Processor State Register (PSR) ‘© Window Invalid Mask (WIM) + Trap Base Register (TBR) ‘Program Counters (PC, nPC) + Implementation-dependent Ancillary State Registers (ASRs) ‘+ Implementation-dependent IU Deferred-Trap Queue Processor State Register (PSR) a | cw | ee Lae Sod ORE GRIMES oo BSR oor ceases ae Vea erate mmm mscoonMOa nea? ipl Bits 31 through 28 are hardwired to identify an implementation or class of implementations of the architecture. Together, the imp! and ver fields define a ‘unique implementation or class of implementations of the architecture. ver Bits 27 through 24 are implementation-dependent. The ver field is either hardwired to identify one or more particular implementations or is a readable and writable slate field whose properties are implementation-dependent. ice Bits 23 through 20 are the IU"s condition codes, These bits are modified by the arithmetic and logical instructions whose names end with the letters ce (¢.g., ANDec), and by the WRPSR instruction, The Bicc and Tice instructions cause a transfer of control based on the value of these bits, which are defined as follows: reserved Bits 19 through 14 are reserved. EC (Enable Bit 13 determines whether the implementation-dependent coprocessor is enabled. Coprocessor) If disabled, a coprocessor instruction will trap. 1 = enabled, 0 = disabled. If an ‘implementation does not support a coprocessor in hardware, EC should always read as 0. EF (Enable Bit 12 determines whether the FPU is enabled. If disabled, a floating-point Floating- instruction will trap. 1 = enabled, 0 = disabled. point) PIL (Processor Bits 11 (the most significant bit) through 8 (the least significant bit) identify the Interrupt interrupt level above which the processor will accept an interrupt. Level) s Bit 7 determines whether the processor isin supervisor or user mode. 1 = supervisor mode, 0= user mode. PS @revious Bit 6 contains the value of the S bit atthe time of the most recent trap. Supervisor) ET (Enable 1 = traps enabled, 0 traps disabled. Traps) CWP (Current Bits 4 (the MSB) through 0 (the LSB) comprise the current window pointer, a * Window ‘counter that identifies the current window into the r registers. Pointer) Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com ee 215 BE-Sem-VII-C iced Microprocessor Notes By Prof, Faruk Kazi -98202238: (LDTR) and Task Register (TR) can be loaded with instructions, which take a 16-bit segment selector as an operand, The remaining bytes of these registers are than loaded automatically by the )rocessor from the descriptor referenced by the operand. Control registers (Caution: wrong figure is given in some text book) There are five control registers (CRO, CRI, CR2, CR3, CR4), Only four of them are used by the current implementation; register CR1 is reserved for future use. The CRO register contains system control flags, which control modes of operation or indicate states of the processor. Only bits 0 to 5, 16,18,and 29 to 31 are currently used. The other bits are reserved for future implementation. The function of the CRO bits is briefly explained in the following, * Bit 31, PG- Paging Enable: When PG is set (PG=1), paging is enabled, When PG-0, paging is disabled. * Bit 30, CD- Cache Disable: The CD bit is used to enable or disable the on-chip cache fill mechanism, When CD=1, the cache will not be filled on cache misses. When CD=0, cache fills may be performed on misses, . * Bit 29, NW-Not Write-through: When NW is cleared (NW=0), it enables on-chip cache writes- through. When NW=0, all writes, including cache hits, are sent out to the pins, When NW=1, write-through and write-invalidate cycles are disabled. The only write cycles that reach the extemal bus when NW=I are cache misses. Invalidate cycles are ignored, Write cycles with NW=1 does not update main memory. * Bit 18, AM-Alignment Mask: The AM bit allows alignment checking when set (AM=1) and disables alignment checking when clear (AM=0) * Bit 16, WP-Write Protect: When WP is set (WP=1) it offers write-protection to user-level ages against supervisor-level write operations. When WP is clear (WP= 0), read-only user-level ages can be written by a supervisor process. * Bit 5, NE-Numeric Error: When NE is set (NE=1), it enables the standard mechanism for reporting floating-point numeric errors. * Bit 4, ET-Extension Type: The ET bit indicates support of the i387 mathematical coprocessor instructions, = Bit3, TS- Task Switched: The TS is set (TS=1) whenever a task switch operation is performed. * Bit 2, EM-Emulation: When EM is set (EM=1) execution of a numeric floating-point instruction generates the coprocessor-not-available exception. The EM bit must be set when the processor does not have a floating * Bit 1, MP-Monitor coProcessor: On the i286 and i386 processors, the MP bit controls the function of the WAIT instruction, which is used to synchronize with a coprocessor. The WAIT Subjects forSEM Vin. ROBOTICS & S¥STEN Downloaded from FaaDo0Engineers.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 5, 2 ‘The group subdivision of the window registers is ‘ 131 t0 124 ins, contain parameters passed to the procedure by the calling procedure 123 to 116 locals, contain local parameters of the procedure 115 to r8 outs, contain parameters passed tothe called procedure As can be seen in fig. 5.1, the outs registers ofthe calling procedure, are physically the ins registers ofthe called procedure. The calling procedure passes parameters to the called procedure through its outs registers, which are the ins registers ofthe called procedure. The register window of the currently running procedure, called the active window, is pointed to by the current window pointer (CWP) in the processor state register (sR). ‘The number of windows (NWINDOWS) that can be used in different versions of the SPARC ranges from 2 to 32, for a total number of general purpose IU registers (including the eight globals) ranging from 48 to 548, respectively. Most current SPARC implementation microprocessors feature eight windows fora total of 136 registers. Implemented windows are contiguously numbered 0 to (NWINDOWS-1). An example of an cight-window implementation, where the windows ore cicularly interconnected, is shown in Figure 5.2 Figure 5.2 ~ Circular stack of window registers ‘The CPU contains a 32-bit control register called window invalid mask (WIM). Each bit of WIM, wi 1.31), corresponds to one of the possible 32. windows (even if less than 32 are implemented). If wi = 1, window i is considered to be invalid, and a trap condition exists. The CPU's program counter (PC) is @ separate register, not included in the general-purpose register file. SPARC implementations may have several PCs containing address of subsequent instructions. Some of the SPARC IU registers have specially designated tasks. The 10 is hardwired to a zero value, as itis in many other RISC-type systems. A CALL instruction writes its own address into the outs register 115. The CWP is decremented with a SAVE instruction on a procedure call and incremented by a RESTORE instruction on a procedure return. Procedures can also be called without changing the window. Suppose that in the case of NWINDOWS = 8 (Figure 5.2), window 0 is the currently running active window. In this case, CWP=O. Since window 0 is the last free window, when the procedure, using window O calls another procedure, a window overflow occurs. A new register window wraps around fo overwrite the previously used window 7, whose contents must be saved in the memory by software. After a return, and when the register file was out of windows, we have a window underflow. Software must restore previously used register windows in this case. A window overflow trap is caused by the overflow. The overflow trap handler uses the locals of window 7 for pointers into the memory where the overflowed window is stored. ‘Window 7 is invalidated during the trap handling by setting bit w7 of the register. Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Sut Downloaded from FaaDo0Engineers.com 216 BE-Sem-VII-COMP-Advanced Micro Prof, Faruk Kazi -9820223893 instruction is not needed on processors with on-chip FPU, such as i486 and the Pentium. When running i286 and i386 programs on i486 and the Pentium FPU, MP should be set (MP=1) . It should be cleared (MP=0) on i486 and the Pentium. = Bit 0, PE-Protection Enable: When PE is set (PE=1), the protection mechanism is enabled. ‘When PE=0, the processor operates in unprotected real (8086) mode. ‘The low-order 16 bits of the CRO are also known as the machine status word (MSW), for compatibility with the i286. The MSW of the i286 has 4 bits that are usgd: 3 through 0, TS, EM,MP, and PE, ‘The CR2 register holds the 32-bit linear address that caused the last page fault detected. ‘The CR3 register contains in its upper 20 bits (bits 31 though 12), the address of the page directory base. The CR3 is also known as the page directory base register (PDBR). The page directory ‘occupies a regular page frame, that is, it must be aligned to a page boundary, so the low 12 bits of the CR3 are not used as address bits. On the i486 and the Pentium (and possibly on future implementations) the state of bits 4 and 3 is driven on the outside pins PCD and PWT respectively, and they are used as follows: * Bit 4, PCD-Page-level Cache Disable: When PCD=l, the on-chip cache is disabled. When PCD=0, on-chip caching is enabled, provided it is not disabled by other means (such as cache deactivation by a signal from an external pin. = Bit 3, PWT-Page-level Write Transparent: The PWT bit can be used to control the write policy of an extemal second-level cache. When PWT=l, it allows a write-through policy for the external cache. If PWT=0, a write-back policy for the extemal cache is adopted Register CR4, new on the Pentium, contains bits that enable certain architectural extensions. Only bits 6 and 4 through 0 are currently used as follows "Bit 6, MCE-Machine Check Enable: Setting MCE (MCE=1) enables the machine check exception, = Bit 4, PSE- Page Size Extension: Setting PSE (PSE=1) enables paging with large 4-Mbyte ages. = Bit3, DE-Debugging Extensions: Setting DE (DE=1) enables 1/O breakpoints. = Bit 2, TSD —Time Stamp Disable: Setting TSD (TSD=1) makes the read from time stamp counter (RDTSC) a privileged instruction. ee ‘Subjects for SEM Downloaded from FaaDo0Engineers.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 5, 1 Chapter 5 - Sun SPARC Family Notes by Prof. Faruk Kazi 5.1 Overview ‘The SPARC architecture was initiated by Sun Microsystems. Before announcing the SPARC, Sun Microsystems produced a very popular family of M68000-bosed Sun workstations. One of the things that differentiate SPARC from other RISC-type systems is that Sun does not have a history of preceding microprocessors, RISC or CISC, 0 it had no software compatibility constraints to worry about. SPARC designers could start from a clean slate. The name SPARC stands for Scalable Processor ARChitecture. The concept of scalability, as seen by the creators of SPARC, is the wide spectrum of its possible Price/performance implementations, ranging from microcomputers to supercomputers. The scalability of the SPARC can also be interpreted in the number of CPU registers that can be used in various versions of products, implementing the SPARC architecture. The SPARC architecture follows the Berkeley RISC design philosophy by stressing of the importance of the relatively large CPU register file and by i register window features. 5.2 SPARC Architecture 5.2.1 Register Organization General Purpose Register file SPARC architecture features a comparatively large CPU register file of over 100 registers. As in the Berkeley RISC, any procedure running on the SPARC can access only 32 registers, denoted 10 to 131. Eight Of the registers (10 to £7) are global, accessible by all procedures. The other 24 registers are the window registers, assigned to each procedure, with an overlap of eight registers between procedures. The 24 window registers are subdivided into three groups of eight registers each, as illustrated in Figure 5.1 for a sequence of three nested procedures. ext a a Pm ie it stow aa a fe ioe, or ain sa C41 or 7, bo 24 Pe a z ai a / a ro a os Figure 5.1 Three overlapping windows and globals, Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com z v she me ay, yen rite the nly veck byte amp 217 Faruk Kazi -9820223893 BE-Sem-VII-COMP-Adv: icroprocessor Note cs * Bit 1, PVE-Protected-made Virtual Interrupts: Setting PVI (PVI=1) enables support for a virtual interrupt flag in protected mode. This feature can enable some programs designed for ‘execution at privilege level 0 to execute at privilege leve! 3 (applications level; least privileged). "Bit 0, VME-Virtual-8086 Mode Extensions: Setting VME (VME=1) enables support for a virtual interrupt flag in virtual-8086 mode. This feature may improve performance in this mode. Fig 2.7 Control Registers Floating-point registers (PI refer 8087 class notes for diagram) ‘The on-chip FPU includes eight 80-bit data registers RO to R7, a 16, bit tag word, a 16-bit control registers, a 16-bit status register, a 48-bit instruction pointer, and a 48-bit data pointer. Data Registers: The data registers RO to R7 are used by floating-point computations. These registers can be accessed in two ways: 1 Asa stack whose top is pointed to by bits 13 to 11 (TOP field) ofthe status register (or status word) with instructions operating on the top one or two stack elements. 2 Asa fixed register set with instructions operating on explicitly designated registers ‘A PUSH operation decrements TOP by 1 and loads a value into the new top data registers. A POP operation stores the value from the current top data register and then increments TOP by one. Like other x86 stacks in memory, the FPU data register stack grows down towards lower-addressed Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4, 18 Paging in DEC -Alpha AXP ‘The system supports pages of 8 Kbytes, 64 Kbytes, 512 Kbytes and 4 Mbytes. Other AXP Implementations ‘A numberof subsequent Alpha AXP implementation micropocestor has ben produced by DEC. ‘he 270644 is a (.S-mieron.three-metal-layer. CMOS-S. 2.5-millioe transistor. 431-pin PGA microprocessor, running at Irequencies ranging from 225 to 275 MHz. It has double the on-chip cache than the 21064: 16 Kbytes instruction, Jokbytes data. for a total of 32 Kbytes, The 27066 is a highly integrated implementation of Alpha, whose on-chip fanetions include an 10 controller. IU, FPU, memory controler. graphics, accelerator, instruction and data caches (B Kbytes each, as onthe 21064) and an extemal cache controler. The 21066 isa 0.68-micron, three-metal-layer. CMOS 4. 287-pin PGA microprocessor. running at 166 MHZ. The 27068 is alower-frequency version of the 21066 running at 66 Mil 4.5 Applications of DEC Alpha AXP: (Pl refer class notes for CRAY T3D MPP) {he T3D (Torus, -Dimensional) was Cray Researe's first attempt at a massively parallel supercomputer architecture. Launched in 1993, it also marked Cray's first use of a non-proprietary microprocessor architecture in @supercamputer. The 13D consisted of between 32 and 2048 Processing Elements (PES), each comprising a 150 MHz DEC Alpha 21064 (EVA) processor and cither 16 or 64 MB of DRAM. PEs were grouped in pais, or nodes, which incorporated a eway processor interconnect switch. These switches had @ peak bandwidth of 300 MBsecond in each direction and were connected to form a tree-dimensional torus network topology. The T3D was designed to be hosted by @ Cray Y-MP Model E, M90 or C90-series “frontend!” system and rely on it and its UNICOS operating system for all UO and most system services. The T3D PES ran a simple mierokemel called UNICOS MAX. VIVA: The first processor of the Alpha family was called 21064 ("21" implied that Alpha was an architecture of the 21st century, "0" — a processor's generation, “64” — a computational capability in bits), also code-named as BV4 ("EV" was. the abbreviation of “Extended VAX" and “4” — a technological process’ generation). But then what is AXP? Check your class Subjects For SEM Vill_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.18 BE-Sem- MP-Advanced Microy s By Prof, Faruk Kazi -9820223893, registers, ‘Tag word: The tag word marks the content of each data register, RO to R7. Bach 2-bit tag (0) to Tag (7) represents one of the RO to R7 registers, respectively. Status word: The 16-bit status word, located in the status register, reflects the over-all state of the FPU. Control word: The control word provides the user with several programmable processing options. The low-order 6 bits contain individual masks for each of the six exceptions that the FPU recognizes. ‘They fit the low-order 6 bits of the status word. Instruction and data pointers: In case of an FPU ertor, the 48-bit instruction pointer contains the address of the failing instruction and the 48-bit data pointer contains the address of its numeric memory operand, if appropriate Debug registers ‘The x86 architecture features eight debug registers, DRO to DR7. Only programs executing at the highest privilege level can access these registers. Registers DRO to DR3 specify the four linear breakpoint addresses. The debug control register, DR7 is used to set the breakpoints. 2.5 Memory Management The primary functions of the MMU are: 1 Translation of the virtual (logical) address into a physical (real) address. 2. Provide for the paging mechanism involved in the virtual memory organization. The paging unit does this. 3. Provide for the segmentation mechanism by the segmentation init. 4 Provide for memory protection. This is usually done within the paging or segmentation unit, or both, 5 Inclusion and management of a fast-access translation look aside buffer (TLB) Segmentation (Refer class notes for comparison of real/protected mode segmentation) Segmented memory is utilized by protected mode to allow tasks to have their own separate memory spaces, which are protected from access by other tasks. A segment can be from 1 byte to 4 GB long. Segments can start at any base address in memory, and storage overlapping between segments is allowed Address Translation Mechanism A virtual (logical) address in the x86 architecture is formed out of two components: 1 A 16-bit selector, used to determine the linear base address (the address of the first byte of the segment) of the segment. 2 A 32-bit offset used the intemally address within a segment. The offset of a given memory Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes. 9f. Faruk Kazi -9820223893 4. 17 Figure 4.8 CPU external interface Pipeline in DEC Alpha AXP The 21064 IU and FPU pipelines are illustrated in figure (Pl refer class notes of Faruk Kazi). The integer pipeline is seven stages deep. ‘The first four stages are associated with instruction fetching, decoding, and scoreboard checking of operands for possible date dependency. Pipeline stages 0 through 3 can be stalled. Beyond stage 3, however, all pipeline stages advance every cycle. Most ALU operations complete in cycle 4 (Al), Primary cache accesses complete im cycle 6 (WR), so cache delay is three cycles. The instruction stream is based on autonomous perfecting in cycles 0 ‘and 1 withthe final resolution of CACHE hit not oceurring until eycle S. The prefetcher includes a branch history table and a subroutine return stack. The architecture provides a convention for compiles to predict branch decisions and destination addresses, including those for register indirect jumps. The penalty for branch mispredict is four cycles. in Figure 4.9 Pipeline in DEC ALPHA AXP ‘The FPU pipeline is 10 stages deep. It is identical and mostly shared with the IU pipeline in stages 0 through 3. All operations, 32-and64-bit, have the same timing (except divide). Divide is handled by a non-pipelined, single bit per cycle, dedicated divide unit In eycle 4(F1), the register file data is formatted to fraction, exponent, and sign. Inthe first stage adder exponent difference is calculated and a 3xmultiplicand is generated for multiplies. In addition, a predictive leading 1 or 0 detector using the input operands is initiated for use in result normalization. In cycles 5 (F2) and 6 (F3), for odd/subtract, alignment or shift are performed. For both single-and double-precision multiplication, the multiply is done in a radix-8 pipelined array multiplier. In eycles 7 (F4) and 8(F5), the final addition and rounding are performed in parallel and the final result is selected and driven back to the register file in cycle 9 (FWR). With an allowed bypass of the register write data floating-point delay is six eycles. Pairability in DEC ALPHA AXP ‘The super scalar dual issue of instructions is restricted tothe following pairs: Any load/store in parallel with any operate An integer operate in parallel with floating-point operate A floating-point operate and a flosting-point branch © Aninteger operate and an integer branch Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 18 zes, the ear oy ng. sis I- -Advanced Microy yr Notes By Prof location address is its distance in bytes from the segment base address. (Called as EIP for instruction Fetch) The addressing of a memory operand within a segment is illustrated in Fig 2.8. When a 32 bit x86 processor is reset or powered up, itis initialized in real mode, Real mode has the same architecture as the 8086 but allows access to the 32-bit register set. The default operand size in real mode is 16 bits, However, the regular mode of operation of a 32-bit x86 architecture processor is in protected virtual address mode (PVAM) or simply, protected mode. In protected mode the 16-bit selector is used to specify an index in an OS-defined table ‘The table contains the 32-bit base address of a given segment. Adding the base address obtained from the table to the offset forms the physical address (if PG-0) Segment Descriptors (BQ) Each segment has a segment descriptor associated with it the segment descriptor is 8 bytes long and contains the following information about the segment: ‘A 32-bit segment base linear address. A. 20-bit segment limit, specifying the size of the segment ‘Access rights byte, containing protection mechanism information Control bits Rune ‘The segment limit field of the segment descriptor has only 20 bits (and not 32) because the segment size does not have byte granularity for all segment sizes. Segments have a byte granularity (i.e segments may differ in size by a single byte) for segment sizes up to 1 Mbytes. For segments above 1 Mbytes and up to 4 GB, there is a page granularity that is segment sizes may differ by a page size, which is 4 Kbytes (2 " bytes). Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4. 16 is Figure 4:7 Block diagram of the 21064. TBOX-issues instructions (two at a time), maintains the integer pipeline, and performs PC calculations. It decodes two instructions in parallel and checks availability of resources. There is no out-of-order issue. There will be an issue if appropriate resources are available. If resources are not available forthe frst instruction, there will be no issue. The IBOX contains branch prediction logic, instruction translator buffers (TBs), interrupt logi, and performance counters (issues, non issues, total eycles, pipe dry, pipe freeze, cache misses). ‘There are two ITBs: 1, Small page ITB, eightentry fully associative contains recently used instruction stream page table entries (PTEs) for 8-KByte pages. 2. Large-page ITB, four-entry, fully associative, for 512 x 8 KByte pages (4 Mbytes). EBOX - integer execution unit. It contains a 64-bit adder, logic box, barrel shifter, bypassers, integer multiplier, 32 x 64 IRF with four read and two write ports ABOX - address generation unit, It contains address translation datapath, load silo, data cache interface, intemal processor registers (PRs). and the BIU, and a 32-entry, fully associative data translation buffer (DIB). The load silo is a memory reference pipeline that can accept a new load or store instruction every cycle until a data cache fll is required. The BIU has on external 128-bit data bus. FBOX - the FPU. It contains in addition to the operation units a 32x64 floating-point register fle (FRE) and a user accessible floating-point control register (FPCR). ICACHE - instruction cache. 8 KBytes, direct-mapped, physicaladdressed, 32 bytesfine. DCACHE - data cache, 8 KBytes, direct-mapped, physical-addressed, 32 bytesline, write-through, read allocate, ‘An example of an external interface interconnection of the 21064 is shown in fig. It is designed to directly support an Off chip secondary cache (also called backup cache, or B-cache) that can range from 128 Kbytes to 8 Mbytes and can bbe constructed from ordinary SRAM. The interface is designed to allow all cache policy decisions to be controlled by logic external to the CPU chip. There are 3 control bits associated with each B-cache line: valid (V), shared (S), and ditty (D). The chip completes a B-cache read as long as valid is truc. A write is processed by the CPU only if valid is ‘rue and shared is false, When a write is performed, the dirty bit i set to true. In all other cases, the chip defers to an ‘external state machine to complete the transaction. Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com ql i E 3 ae ce = oa Fig 2.9 Segment Selector sonatas preamps 5 ae 8 o8 2 8s ee aire zE ae é 35 5 & Pek. 3 5 a Pee (D7) 256 1SR. The IDT is basically the interrupt v Fig 2.10 Segment Deseriptor format (PI refer class notes for TYPE field) The Global Descriptor Table (GDT) contains descriptors that are possibly available to all tasks in the system. 2. The Local Descriptor Table (LDT) contains descriptors associated with a given task. Each task may have a separate LDT. A segment cannot be accessed by a tas descriptor does not exist in either the current LDT or the GDT. 1 Segment descriptors are stored in descriptor tables in memory. The descriptor tables define all the 3. The Interrupt Descriptor Table segments, which are used in the system. There are three types of descriptor tables: 8 bytes each). Th: arrays, They can range in size between 92 descriptors, t upper 13 bits of a selector are used as an index into the descriptor table. Each of the above tables lies All of the above descriptor tables are variable-length memo: 8 bytes (a single descriptor) and 64 Kbytes (upper limit: 2 associated with the GDT associated with the IDT 48 bits, GDT register (GDTR), 2 LDT register (LDTR), 16 bits, associated with the LDT ister, located in the CPU associated with it and pointing to it: 1 aregi 3 IDT register (IDTR), 48 bits, Downloaded from FaaDo0Engineers.com BE-Sem.VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4. 15 ‘There are two types of floating-point compare instructions: 4 1 CMPGxx Compare G_ f loafing, operands: Farg, Fog, Fewq where x¢ may take the options: EQ | equel LE_| less than or equal LT | less than For a total of thee instructions. 2 CMPTox Compare T_f loafing operands: Fars, Fbrx, Fo.wq where xx may take the options EQ, LE, LT as for CMPGxx and another option UN unordered In all the floating-point compare instructions the operands in Fa and Fb are compared. Ifthe specified relationship is true, a nonzero floating-point value (0.5 for CMPGxx, 2.0 FOR CMPTxx) is written into Fe, Otherwise, a true zero is ‘written into Fe. Privileged architecture library (PAL) code ‘The PAL code provides a mechanism to implement the following functions without resorting to a micro coded ‘machine: Instructions that require complex sequencing as an atomic operation Instructions that require VAX-style interlocked memory accesses Privileged instructions Memory management control Context swapping Interrupt and exception dispatching Power-up initialization and booting, Console functions ‘Emulation of instructions with no hardware support PAL functions are implemented in alpha architecture in standard machine code, resident in mein memory. PAL code environment differs from the normal environment in the following ways: 1 There is complete control of the machine slate allowing all functions of the machine to be controlled. 2 Interrupts are disabled, allowing the system to provide mult instruction sequences as atomic operations. 3 Implementation-specific hardware functions are enabled, allowing access to low-level system hardware. 4 Instruction stream memory managemedt traps are prevented, allowing PAL code to implement memory ‘management functions such as translation buffer (TB) fills 4.4 Alpha AXP Implementations ‘The first implementation ofthe Alpha architecture isthe 21064 microprocessor chip. The 21064 is fabricated in a 0.75 micron CMOS technology uilzing three levels of metalization end optimized for 3.3-V operation. The de size is 16.8 X 13.9 mm and it contains 1.68 million transistors. Its initial operating frequency is within the 150 to 200 MHZ interval. Power dissipation at 200 MHz is 30 W. The processor is a two-issue super scalar. The chip includes @ dusl- cache &-kbyt instruction and &-kbyte data It also includes a fourrentry, 32bytes-per-entry waite buffer, a pipelined 64- bit integer execution unit with a 32 X 64 register file, and a pipelined FPU with a 32 X 64 register file of its own. The pin interface includes integral support for an extemal secondary cache of 128 Kbytes up to 8Mbytes, The internal aches are ditect-mapped, All caches have 32 bytes/inc. The intemal data cache is a wrte-through. read allocste, physial cache. The chip package is « 431-pin pin grid arey (PGA) with 140 pins dedicated to power supply voltege and ground. ‘Subjects For SEM Vil ROBOTICS & SYSTEM SECURITY farukka: Downloaded from FaaDo0Engineers.com ae al at to 2.21 320223893, Advanced Microprocessor Notes B The LGDT, LLDT and LIDT instructions load the base and the limit of the GDT, LDT, and IDT, respectively, into the appropriate register: GDTR, LDTR and IDTR, respectively. The SGDT, SLDT} and SIDT instructions store the contents of GDTR, LDTR, and IDTR, respectively, into a specified destination address, Paging Mechanism ‘The paging mechanism is optional. It is enabled when PG Bit (Bit 1) in CRO is set. Paging works beneath segmentation and is transparent to the segmentation. The standard page size of the x86 is 4KB = 2"? bytes, but can be extended to 4 Mbytes for Pentium processor. The x86 uses two levels of tables to translate the linear address into a physical address, There are three components to the aging mechanism: the page directory, the page tables, and the page frame. A uniform size forall the elements simplifies memory allocation and reallocation schemes, since there is no problem with memory fragmentation. Figure 2.11 illustrates the paging mechanism. ig 2.11 Paging Mechanism (4 KB Page) (For 4 MB Page'PI. Refer Class Notes) The control register CR2 is the page fault linear address register. It holds the 32-bit linear address, which caused the last page fault detected. Register CR3 points to the base of the page directory ‘The page directory is 4 KBytes long and allows up to 1024 PDEs. Each PDE, shown in Fig,2.12 (a), contains the address of the next level tables, the page tables, and information about the page table Pointed to. The upper 10 bits of the linear address (the directory field; bits 31 to 22) are used as an index to select the correct PDE. Downloaded from FaaDodEngineer's.com BE-Sem-VIl ced licroprocessor Notes By Prof. Faruk Kazi -98202238 SUBF SUBG suBS suBT Subtract F_floating Subtract G-floating Subtract §_floating Subtract T_floating MULF Multiply F-floating MULG — Multiply G_floating MULS —Maltiply 8 floating MULT — Multiply T-floating DIVF _Divide F_floating DIVG —_Divide G_floating DIVS _Divide S_floating DIVT __ Divide T_floating ‘The four arithmetic operations are performed as follows: ‘Addition Fe € Fav Fov Subirastion: Fe€ Fav- Fw | Malliplication: | Fo € Fav* Fov Divisi Fe € Fav/Fov 414 The floating-point convert operations are summarized in Table 4.9. [n all the above operations Fb contains the datum to bbe converted and Fo is the destination where the converted datum is stored, Table 4.9 Floating-point convert operations Mnemonic cvTLQ cvTQt. cvIDe cvtap cvTGr cvTcg cvTgr cevT9s cvTQs evror cvrTg vrs cvrst ‘Operation, ‘Convert longword to quadword ‘Convert quadword to longword Convert D to G_ floating Convert G to D_floating Convert G to F_foating Convert G_floating to quadword Convert quadword TO F_foating Convert quadword TO G_flosting Convert quadword TO 8_ floating Convert quadword TO T_foating Convert T_ floating to quadword Convert T Convert $ to S_floating toT_floating ‘Subjects For SEM VIIl_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.22 By Prof, ik Kazi Each page table is 4kbytes and holds up to 1024 PTEs, A PTE, showg in Fig.2.12 (b), contains the starting address of the page frame and access information about page, Address bits 31 to 22 (table field) are used as an index to select one of the 1024 PTE, Bits 31 to 12 of the PTE contain the upper 20 bits of the page frame base. The lower 12 bits of the PTE are identical to those of the PDE. The function of the bits currently in use is as follows: Bit 6, D-Dirty Bit : D is set before a write to an address covered by the PTE occurs. It is undefined for the PDE. Bit 5, A- Accessed Bit: A is set before a read or write access occurs to an address covered by the entry Bit 4, PCD-Page Cache Disable: The PCD bit controls the page on chip cache ability. When (PCD) '=0, the on-chip cache is enabled. When (PCD) =I, on-chip caching is disabled. Bit 3, PWT-Page Write-Through. The PWT bit controls page write policy. (PWT) =1 defines a vwrite-thrqugh policy for the current page. (PWT) =0 allows the possibility of write-back. Bits PCD and PWT are also bits 4 and 3 on the CR3. The state of the PCD and PWT bits is driven out on the PCD and PWT pins during a memory aécess. Bit 2, User/Supervisor: Bit U/S differentiates between lower-privilege user mode and bigher- privilege supervisor mode, Bit 1, Read/Writ Bit R/W establishes read and write protection privileges for the page Bit 0, Present: Present in physical memory. Downloaded from FaaDo0Engineers.com :-Sem-VIl-COMP-Advanced Microprocessor Ni 9f, Faruk Kazi 9820223893 4, 13 LDT Load T-floating STF Load Floating STG Load G-floating sTs Load S-floating STT___Load T-floating Floating-point control instructions ‘There are six floating-point branch instructions. These instructions tet the value of a floating-point register Fe, and conditionally change the value of the PC. The instructions are suramarized in table 4.7 Toble 4 Floating-point Branch Instructions “Vinemonte Operation FBEQ Flosting branch equal FBGE Floating branch > or equal FEGT — Floating branch> FBLE Floating branch ) or by an immediate value of a literal #b, The number of bytes to extract is specified in the function code. Remaining bytes are filled with zeros. 3. Byte Insert. The byte insert instruction INSxx has seven options: TNSrx Option Insert BL Byte low WL Word low u Longword low a Quadword low WH Word high LH Longword high aH Quadword high INSkL and INSxH shift bytes from register Re and insert them into a field of zeros, storing the result in Re. Register Rbv<2:0> or a literal #b select the shift amount (0 to 7), and the function code selects the maximum field ‘width: 1,2,4 or 8 bytas, 4 Byte mask: The byte MSKxx instruction has seven options same as the byte insert and byte extract instruction (MSKBL, MSKWL, MSKLL, MSKQL, MSKWH, MSKLH, MSKQH). MSKxL and MSKxi set selected bytes of register Ra to zero storing the result in Register Re. Register Rbv<2:0> or literal selects the starting position of the field of zero bytes, and the function code selects the maximum width; 1, 2, 4 or 8, bytes. 5 Zero bytes: This group contains two instructions ZAP Zero bytes ZAPNOT Zero bytes not ‘These instructions set selected bytes of register Ra fo zero and store the result in register Ro. Register Rbv or 4 literal selets the bytes to be zeroed; bit of Rb corresponds o byte 0 of Ra bit I of Rb corresponds to byte 1 of Ra, and 30 on. ‘The CMPBGE and ZAP instructions allow very fast implementations of the C language string routines, among oer uses. Floating-point load and store instructions ‘The floating-point load and store instructions (a total of eight) move floating-point data between memory and floting- point registers, The instructions are summarized in Table 46 In the load instructions Fa isthe destination register. The memory address is computed by sdding a sig-extended displacement to Rbv for both load and store instructions. Register Fa serves as the source register in the store instrutions. ‘Table 46 Floating-point Load and store Instructions ‘Mnemonic Operation LDF Load Floating LDG ——_Loed G-floating LDS __Load $Floating ‘Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.24 BE-Ser P-Advanced Microprocessor Notes By Prof, Faruk 1820223893 2.6 Pentium Cache Organization (EQ NOV06/20Marks) ¥ Cache Revision (PI refer notes of COA for basic understanding) Cache architecture ‘A cache system (the cache and cache controller) can be interfaced to the processor in two ways. i, Look-through cache architecture. ii, Look-aside cache architecture. Cache Coherency It is required that the data present in the cache and main memory are exact duplicates of each other, i.e. they should always provide the processor or any other bus master the latest copy of the information. This is known as maintaining cache coherency or consistency,When the processor updates any information in the cache, the same change should be made in the main memory before any other bus master tries to access it. This is achieved by the cache controller by following one of the following write policies. i, Write through ii, Buffered write through iii, Write back First and Second Level Caches ‘The 80486 and the Pentium processors have an intemal cache. This cache is known as level 1 cache or LI cache, This L1 cache provides the processor with the most often used code and data and are usually small in size (4KB to 64 KB).A second level (L2) cache can also be added in between the L1 cache and the main memory. L2 caches are usually larger in size compared to L1 caches (64 KB to 512 KB). LI caches are thus subsets of the L2 caches. The use of two cache levels substantially increases the hit rate. This is because when a cache miss occurs in LI cache, the L2 cache can provide the required information in zero wait states, Pentium Processor Cache: General Features © The Pentium processor has a separated code and data cache each of 8k bytes. ‘© The cache line size is 32-bytes ‘© Since the Pentium processor has data bus of 8 bytes (64 — bits), it requires a burst of four consecutive transfers to fill the cache line of 32 bytes. © Each cache is organized as two-way set-associative. + The data cache can be configured as a write-through or a write-back cache on a line-by-line basis and it follows the MESI protocol. © The code cache does not require a write policy, as it is a read-only cache. + Each cache has a dedicated translation look aside buffer (TLB) to translate linear addresses to physical addresses. ‘© The data cache tags are triple ported to support two data transfers and an snoop cycle in the same clock. * The code cache tags are also triple ported to support snooping and split line access Downloaded from FaaDo0Engineers.com 1 1 ‘ E a BE-Sem-VII-COMP-Advanced Micro} sssor Note: f. Faruk Kazi 9820223893 4. 11 Table 4.5 Logical and Shift Irstuctions ‘Mnemonic Operation AND, Logical AND BIC Logical AND with complement BIS Logical OR EQV Logical equivalence (KORNOT) ORNOT Logic OR with complement XOR Exclusive OR MOV Conditional move integer SLL Shift left logical SRA. ‘Shift right arithmetic SRL Shift right logical ‘The logical operations are performed as follows: gical op Mnemonic Operation AND Re € Rav AND Rbv Bic Re € Rav AND (NOT Rb) Bis Re € Rav ORROV ORNOT Re CRavOR WOT REw) XOR Re € Rav XORRBV EO Re € Rav XOR (NOT Rbv) Byte -manipulation instructions The Alpha architecture features five types of byte-manipulation instruction (a total of 24 instructions) within registers. ‘This is an unusual feature compare to other systems, The byte manipulation instructions can be used with the load and store unaligned instructions to manipulate short unaligned string of bytes. All of the byte-manipulation instructions have the same operand specifications as the logical and shift instructions. 1, Compare byte: This group contains a single instruction CMPBGE (compare byte greater or equal). It does eight parallel unsigned byte comparisons between corresponding bytes of Rav and Rbv, storing the eight results in the low 8 bits of Re. 2. Extract byte: This group features seven options forthe EXT instruction: EXTxx Option Extract, BL Byte low WL Word low | LL Longword low a Quadword low WH Word high Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com ver, sor ore vot che are Ll 3 to ally four sine esto athe oess 2.25 [P-Advanced Microprocessor Notes By Prof. Farul 182027 3. simultaneously. + Individual pages in the main memory can be configured as cacheable or non-cacheable by software or hardware. "The cache can be enabled or disabled by software or hardware. Handling of Memory Read Transaction ‘The sequence of events when a memory read request from L1 cache is given by one of the execution units (from the data cache) or by the prefetcher (form the code cache) is given below. 1.” IFLI cache hit occurs, the request is immediately fulfilled. 2. If Li cache miss occurs, the Pentium processor must perform a cache line fill from extemal memory of L2 cache. 3, The L2 cache now detects the memory read bus cycle and checks its directory to see if it has a copy of the requested information. 4, Ifitis L2 cache hit, the L2 cache asserts KEN# to indicate that the address is cacheable. L2 cache then supplies the data a burst of four consecutive 64-bit transfers (cache line fill operation). 5. If itis L2 cache miss, the L2 cache passes the bus cycle to the system bus. The NCA (non- cacheable address) logic decodes this address to determine if the address is cacheable or not. 6. If address is non-cacheable, the NCA logic desserts KEN #, Thus, the bus cycle is not converted into a cache line fill and instead a single-transfer bus cycle is run to fetch the requested information directly form memory. 7. If the address is cacheable, the NCA logic asserts KEN # (and since the processor had asserted CACHE #) and the read cycle is converted into a cache line fill for both L2 and L1 caches. 8. The L2 cache copies the first quadword (8 bytes) of data into its cache line fill buffer and simultaneously forwards it to the processor while asserting BRDY # to indicate that valid datais present in the processors data bus, 9. The Pentium Processor reads the first quadword and stores it in its cache line-fill buffer, ‘This quadword is the one, which was requested, and hence it is immediately passed on to the requester (execution unit or prefetcher). The next three quadwords are awaited and as they are received, they are stored in the buffer. When the entire line is received, both L1 and L2 ‘caches copy the line from their respective line-fill buffer into their respective caches. The Write Once Policy & the MESI Cache Consistency Model ‘The MESI (modified-Exclusive-Shared-Invalid) protocol provides a method to maintain cache coherency. The MESI protocol is only for the data cache and the SI protocol for the code cache. Each line in the data cache can be in one of the four MESI states as indicated by two bits stored along with the tag address Modified It indicates that this line in cache has been updated or modified due to a write hit in the cache. In this case, when the cache subsystem snoops the system bus and finds a snoop hit, it should write the modified line back to memory (update the memory). | Subjects for SEM. Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4. 10 3. Describe subroutine and co routine returns: By marking each branch and jump as ‘call’, ‘return’ or ‘neither? the architecture provides in implementation enough information to maintain a small stock of likely subroutine returns quickly. ‘The conditional move instructions and the branching hints eliminated same branches and speed up the remaining ones ‘without compromising multiple instruction issu. Integer arithmetic instructions ‘The integer arithmetic instructions of Alpha (a total of 20) are listed in table 4.4 Table 4.4 Integer arithmetic instructions ‘Mnemonic Operations ‘ADDL ‘Add longword ‘ADDQ ‘Add quadord S4ADDL Sealed add longword by 4 S8ADDL ‘Scaled add longword by 4 S4aDDQ Scaled add quadward by 4 SaAbbQ Scaled add quadword by 8 cMPEQ Compare signed quadword = CMPLT Compare signed quadword < CMPLE Compare signed quadword Mnemonic Predicted Target <15:0> Prediction Stack Action 00 IMP. PC+4*disp<13:0>] - a JSR PCH4*disp<13:0>] Push PC 10 RET Prediction Stack Pop u JSR.COROUTINE __ Prediction Stuck Pop, push PC ‘Table 4.3 Integer control instructions ‘Mnemonic Operations BEQ Branch ifRav=0 BGE Branch if Rav 20 BOT Branch if Rav>0 BLBC Branch if Ra LSB=0 BLES BLE BLT BNE Branch ifRa 0 BR ‘Unconditional branch BSR Branch to subroutine IMP. Jump ISR Jump to subroutine RET Return from subroutine JSR-COROUTINE __Jump to subroutine retum The Alpha architecture specifies three types of branching hints in instructions: 1 Architected static branch prediction rule: forvard conditional branches are predicted not-aken, and backward ‘ones taken to the extent that compilers and hardware implementations follow this rule, programs can run more Quickly with lite hardware cost, This hint does not preclude doing dynamic branch predictions in am implementations, bt it may reduce the need to do so. 2 Describes computed jump targets: Otherwise unused instructions bits are defined to give the low bits of the most likely target, using the same target calculations as unconditional branches. The 14 bits provided are enough to specify the instructions offset within a page, which is often enough to start a fastest level instruction cache fetch ‘many eycles before the actual target value is known, Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com B od of be 227 -Sem-VIJ-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893, ‘When the prefetcher issues an instruction request, the code cache is checked to see if a copy is available. Assuming a cache miss, a cache line-fill request is made to the bus unit, i.e. a cache line is brought in from L2 cache/memory. ‘The 32-bit address given by the processor is interpreted as shown in figure TAGIPAGE (20 Bits) | INDEX (7 Bits) | BYTE (5 Bits) Figure 2.14- Interpretation of 32-Bit Address Downloaded from FaaDo0Engineers.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820423893 4. 6 ‘The is ter denoting the operand neces ype. Itmay be one of th following: “Access pe Meaning — 2 (Used in addres calculation “at” means scale by 4 (longwords) "aq" means scale by 8 (quadwords) “ab means the operand isin byte units i The operand isan immediate tral r ‘The operand is read only m ‘The operand is both read and written w The operand is write only ‘The isa letter denoting the datatype of the operand. It may be one ofthe following: Data type Meaning > Byte f — F-Floating 8 G-Floating 1 Longword 4 Quadward 8 S-flosting 1 Tfloating w Word Xx __ Specified by the instruction Integer load and store instructions “The memory access integer loud and ste instructions (total of 12) oe summarized in Table 42 Mnemonic Operation LDA [Load address LDAH Load address high LDL Load SE longword LDQ Load quadvword LDQU Load quadword unaligned LDL_L Load SE longword locked LDQ_L Load quadword locked st. Store longword stg Store quadword STL_C Store longword conditional STQ.C __ Store quadword conditional Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.28 BE-Sem-VII-COMP-Advanced Microprocessor. By Prof, Faruks Kazi -9820223893 Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4, Operate Instruction Format rade) Rp 682 0| racton| Re | #2 ues a ams peed ur [3] rato [Re Floating-Point Operate Instruction Format a 2635 nts part a : fovesce to | my | reson ne | 2 PAL Code Instruction Format fre faLcodeFancien | :F2 Figure 46 Alpha Instruction formats (check corrections) Addressing modes: (PI refer class notes for examples) The Alpha architecture features four simple addressing modes as practiced on RISC-type systems, Register Immediate Register indirect with displacement PC-relative Instruction set: ‘The Alpha architecture features the following types of instructions 1, Integer load and store 2, Integer control 3. Integer arithmetic 4. Logical and shift 5. Byte manipulation (6. Floating-point load and store 7. Floating-point contro! 8. Floating-point operate 9. Miscellaneous ‘An instruction operand is specified by the following attributes: ‘ ‘The may be any ofthe registers; Ra, Rb, Re. Fa, Fb, Fe, ot isp The displacement ld ofthe intucion fac The PAL funcon fold of the instucon 4% Anintoger teal operand inthe Rb Fld ofthe instucton, 7 Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.29 BE-Sem-VII-COMP- nce ia 1k Kazi -9820223893 Split-Line Access: NOV06/SMarks-(PI Refer Class Notes) In a CISC processor, instructions are of variable length. In the Pentium processor the smallest instruction is one byte while the maximum legal length is 15 bytes. A code cache miss always results in a 32-byte cache line fil, if its a cacheable address. Multi-byte instructions may straddle two seqitential lines stored in the code cache, When the prefetcher determines that the instruction is straddled across two lines, it would have to perform two sequential cache accesses, which would hamper performance, For this reason the Pentium processor incorporates a split line access which allows the upper half of one line and the lower half of the next line to be accessed in one cycle. ‘When a split line access is made the bytes must be rotated so that they are in proper order. In order for the split access to work efficiently, instruction boundaries within the cache line need to be defined, When an instruction is decoded the first time the length of the instruction is fed back to the cache, Each code cache entry marks instruction boundaries within the line so that if necessary split line accesses can be performed Subjects for SEM Downloaded from FaaDo0Engineers.com BE-Sem.VII.COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi 9820223893 4. © LIT is an 8-bit literal value from 0 to 255. ‘All instruction formats have a 6-bit (its <31-26>) major opcode field. Any umused register field (5 bits) of on instruction (Ra, Rb, Fa, or Fb) must be set to a Value of 31 (11111 binary) The five instruction formats are described below. 1, Memory instruction format ‘This format is used to transfer information between registers and memory, to load on affective address, and for subroutine jumps, The Memory_disp field is a byte offset. tis sign extended and added to the contents of register Rb to form a virtual address. The virtual address is used as a memory load/store address or a result value, depending on the specific instruction. For some instructions, the Memory_disp field is replaced by the Function field. It serves as an extension of the opcode that designates a set of miscellaneous instructions. 2. Branch instruction format ‘The branch format is used for conditional branch instructions (in which case the Ra field contains the condition encoding) and for PC-relative subroutine jumps. As each instruction is decoded, the PC value is advanced to point to the next sequential instruction. The new PC valuc is refered to asthe updated PC. ‘The Branch disp field is treated as long word offset [tis shifted left 2 bits (to address a longword boundary), sign-extended to 64 bits, and edded to the updated PC value to form the target virtual address. 3. Operate instruction format “The operate instruction format is used for instructions that perform integer register to register operations. Fields Ra and Rb specify source operands, Field Re specifies the destination, The Function field is an extension of the ‘opeode. If bit 12 ig 0, Rb specifies a source register operand. If bit 12 is 1, an 8 bit zero extended literal constant is formed by bits <20:13> of the instruction. The itera is interpreted as a positive integer between 0 and 255 and is zero-extended to 64 bits. 4, Floating-point operate instruction format ‘This format is used for instructions that perform floating-point register-to-rogister operations, The Fa and Fb fields specify floating-point register sowce operands. The Fe field specifies the destination. Floating-point convert instructions use a subset of the floating-point operate format and perform register to register conversion operations. ‘The Fb operand specifies the source, the Fa field must be F31 (ie, zero) and Fe is naturally the destination 5. PAL code instruction format ‘The Privileged Architecture Library (PAL) code format is used to specify extended processor functions. The 26 bit PAL code Function field specifies the particular PAL code operation. The source and destination operands for PAL. code instructions aré supplied in fixed registers that are specified inthe individual instruction descriptions. (VIVA: ‘An opcode of zero and PAL. code function of zero specify the HALT instruction.) Memory Instruction Format 31 2625 2120 1615 fnsid | m | Merenoip 31,2625 21201615 ° ; e lopcode| Ra | pp | Function 2 Branch Instruction Format 34,2625 2120 lovcade! Ra Branch_Disp | Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.30 yruk Kazi -9820223893 BE-Ser I-COM} ‘The Data Cache ‘The data cache is 8KB and is organized as a 2-way set associative cache. The two ways are called way 0 and way I. Each cache line is 32 bytes. © Total size of cache KB * Sizeofeach way = 4KB © Cache line size 2 bytes. © Number of lines = 128 Operation of Internal Data Cache Each 4 KB cache way is divided into 128 lines, There are thus correspondingly 128 entries in each tag directory. Each directory stores a 20-bit tag (page) address A [31-12]. The entry also consists of two state bits (10 indicate one of four states M-E-S or I) and a parity bit P. dvanced Microprocessor Parity Exclusive at Each data cache line is 32 byte or eight double words. Parity is generated for each byte within a data cache line as shown in figure. (Lee ‘When a byte of information is read from the data cache, the parity is checked. On detecting a parity error, an internal parity error is signaled to extemal logic through the IERR # (Intemal Error) output. “The processor also generates a special shutdown bus cycle and stops execution. The data cache itself is single ported, but the cache directories are triple ported to allow access form both pipelines (U and V) and to allow an external snoop simultaneously. Interpretation of 32-Bit address as viewed by intemal Data cache Controller is shown below P| Bye |r] eve [Le at wnat s4 205 8 . | Bank Page Lie ea © A3I:A12 identify the page in which the target location resides * AILAS identify the line that the target address occupies within the page (hence its position in the cache way) © A4:A2 identifies which double word within the line that the target address occupies (hence the internal data cache bank in which the address data resides) * A1:A0 are not used (don’t care) Downloaded from FaaDo0Engineers.com BE-Sem-Vil-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4, 5 S_floating Register Format Frac.HI | Fraction Lo Fraction Lo Fraction Mid! ed Fraction Midh Ate ‘Qutewordlteger Reatng-Regltar Format = Figure 4.5 Integer data storage in memory and FPU registers, Instruction formats All Alpha instructions are 32 bits long. The Alpha architecture festures five basic instruction formats illustrated in fig 4.6. The notation used in fig. 46 i the following: Ra, Rb, Re are integer register operands Fe, Fb, Fe ae floating-point register operands disp isa displacement, added to the value in Rb to form a virtual address, BZ. Should Be Zero. Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.31 By Prof, Faruk Kazi -9820223893 od Bank Select Logie hz Way 1 ch of ° i Bank Select Logie y27 ata Way 0 ity wut. elf and a by Snoop Pipeline “V" a Line low | Pipeline “U" jon Figure - Internal Data Cache Structure ace Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4. 4 F_Floating Datum: . 154 76 ° Alec [ct | Fradtion Lo y D_Floating Datum D_Floating Datum 1514 76. foie oi34) 43 ° co | racnt |x [5] fm | Freat | Freetion wiah —_—___ | Fraction Mia 42, is Fraction Midh a2 | Fraction Lo Figure 4.2 Memory storage of VAX floating-point formats F_floating Register Format 6362 52514544 2928 : G_floating Register Format 6362 5251 _ 4847 3238 __1615 ° st] Fracion Mth | Fracion mish | Fractonto | Fe s] exe. G_floating Register Format 63625554 asay a seis ° a | t i i Figure 4.3 Register format of VAX floating-point formats TEBE floating-point formats- ‘The IEEE standard features the single-precision 32 bits (S floating), and the double-precision 64 bit (T floating) formats. Their memory and floating register storage ore illustrated in Fig. 44. Location A may be anywhere in ‘memory, butfor better performance it should be naturally aligned, as in the case of the VAX formats. Long word and quad word integers may be stored in FPU registers. Their storage in memory and FPU registers is ‘lustrated in Fig. 45. S_Floating Datum 15.14 76 ° ‘Subjects For SEM VIll_ ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 232 BE-Sem-VII-COMP-Advanced Microprocessor ‘Notes By Prof. Faruk Kazi 9820223893 =e ‘Request From U and V Pipelines : Singl ‘When the U and V pipelines request information and if the operands are present in different banks Table (banks are decided by A [4:2], the both the operands can be accessed simultaneously. If the U and Mit ‘V pipelines simultaneously require data from the same bank, then a bank conflict occurs since banks or data cache is single ported. In such cases, the U-pipe access is completed first and the V-pipe is. made to wait. Thus, a bank conflict incurs a one-clock penalty on the V-pipe instruction. If the requests from both the U and V pipes happen to be cache misses, then two cache line-fill requests are made to the bus controller the same time. The U-pipe read occurs first followed by the ‘V-pipe read. EQ: Anatomy of a Read Hit and Miss- Cache Line Fill Algorithm “The steps involved when the execution unit of the processor requests a memory read oycle are — 1. The directory entries at index given by A [11:5] In both the cache ways are checked. 2. If the both ways the state bits are ‘I’, then no further checking is required and it is a cache miss 3. Ifin any one of the ways the state bits are other than (i.e. M, E or $), then the page number or tag given by A [31:12] is compared with the corresponding tag entry. : 4, If these tags match, then it means that the required line is available (cache read hit). The required data is then made available immediately 5. If not, then step (3) is repeated for the other way. 6. Also, each directory entry has a parity bit that is generated and written each time the i directory is updated. This parity is checked each time the tag entry is accessed to check for Cyel hit or miss. If parity error is detected, IERR # is asserted and a shutdown special cycle is run Men and the processor stops. It is non- 2.7 Pentium Bus Operation divi Pentium processor supports a number of different types of bus cycles, The three main types are: Tis i, Single Transfer Cycles ii, Burst Cycles iii, Special Cycles T2s Downloaded from FaaDo0Engineers.com BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 4, 3 Architecture of Alpha AXP: Data types -Alpha architecture resognzes the fllowing data types. Integer datatypes | + Byte, Sits. Basic addressable unit. ‘+ Word, 16 bits. Two contiguous bytes starting on an arbitrary byte boundary. A word is addressed by the address of its least significant byte (the byte that contains bit zero). ‘+ Long word, 32 bts. Four contiguous bytes starting on an arbitrary byte boundary. A long word is addressed by the address ofits leat significant byte. + Quad word, 64 bits. Eight contiguous bytes starting on an arbitrary byte boundary. A quad word is addressed by the address of its LSB. In a 64-bit integer, bit 63 isthe sign bit. ‘The Alpha integer datatypes are shown in fig. 4.1 1 | Qoard word Figure 4.1 Alpha integer data types Although words, long words, and quad words may be stored at any byte address, better performance can be achieved if they are naturally aligned. That is, long words are stored in addresses divisible by 4 (low Order 2 bits of the address are zero), and quad words are stored in addresses divisible by 8 (low-order 3 bits are zero), Floating-point data types ‘Alpha architecture features two groups of floating-point datatypes. + VAX floating-point formats, for backward compatibility with the VAX software + TBEE standard (TEBE 754) floating-point formats, as practiced in practically all other modem systems VAX floating-point formats- 7 ‘Alpha architecture features three VAX floating-point formats: 1, F floating, 32 bits 2. G floating, 64 bts (1 bit exponent) 3. D floating, 64 bts (8 bit exponent) The memory storage ofthe above formats is illustrated in fig 4.2, and their CPU floating register storage is illustrated in fig43 Although A may be any address in memory, better performance will be attained if A is naturally aligned (Givisible by 4 for F, and by 8 for G and D formats). The main difference between the G and D formats is that G has an 11-bit exponent-field, while D has an 8-bit one. Thus, the G format has a much higher range. The D format is not fully supported on the Alpha, and no D floating-point arithmetic operations are provided. For VAX compatibility, exact D Floating point arithmetic may be provided by software emulation. Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Single Transfer Bus Cycles Using the Bus Cycle Definition Signals aks Table shows all of the bus cycles initiated by Pentium processor except the special cycles. ce MIT J pic | wir | CACHE | ICN | cycte Description Treters eis ° 0 0 1 x Interrupt acknowledge (2 Locked} 1 Transfer cycles) each cycle fill 0 0 1 1 X_ | Special cycle 1 the 0 1 0 1 x I/O Read, 32 bit or less non-cacheable_ 7 0 1 L 1 X_| 1/0 Write, 32 bit or less non-cacheable 1 1 oO 0 1 X__| Code read, 64 bits, non-cacheable 1 [OG Conte reat Boots burs ine A 1 1 0 1 x Memory read 64 bits or less non- 1 = cacheable ae Sees Memory wad 0 bigbumtine ni AT aber 1 1 Memory write 64 bits or less non- 1 cacheable. The ae i Si rn | Single transfer bus cycle is categorized into two classes as Non-Pipelined Cycles and Pipelined < for {Cycles sun Memory Read & Write Bus Cycles - Non pipelined | Ttis the simplest type of a bus cycle either with or without wait states. The following figure shows non-pipelined memory read (zero wait state) and write cycles (with 1 wait state). The cycle is divided into two T-States: Tl and T2, The sequence of operations performed in the two T-states. Ti State Itis also called the address phase, since address is placed during this state. The processor initiates the cycle by asserting address status (ADS #) signal. The ADS # output indicates that a valid bus cycle definition and address is available on the cycle definition pins (M/IO #, D/C#, W/RA) and the address bus (A3-31, BEO #, BE7#). The CACHE? output is deasserted (high) to indicate the single transfer cycle ) T2 State It is also known as data phase as data is sent out or received in this stage. If it is a write operation, the processor drives the data over the data bus at the beginning of T2 state. At the end of T2 state clock, the processor samples the signal BRDY¢ (this signal is generated by memory subsystem and sent to the microprocessor). If asserted, the signal BRDY# indicates, that the external memory subsystem has presented valid data in response to a read or the extemal memory subsystem has accepted data in response to a write. If BRDY¢ is found not asserted, the processor is forced to insert another 2 time or one wait state. Any number of wait states can be added to Pentium processor bus cycles by maintaining BRDY# inactive. ‘The deasserted BRDY# signal indicates that, the system is not ready to drive or to accept data. | i Downloaded from FaaDo0Engineers.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Fat zi 9820223893 4, 2 ssuch as IBM RS/6000 itis lose to 200. ‘Among the RISC manufacturers there are companies, which started with « RISC product, such as MIPS Computer ‘Systems (now a part of Silicon Graphics) with its RxOOO series, and Sun Microsystems with its SPARC. There are other manufacturers, known for their CISC microprocessor families, who also started their own RISC systems families, such as Intel, with its x86 family which started the RISC 860 family, and Motorola, with its M68000 family which started the RISC M28000 family. ‘Of particular note is TBM which wns actually the first to start with the development of an experimental RISC system, ‘the 801, and now feature the RISC System 6000. This effort is continued jointly by the cooperation of IBM. Motorola, and Apple in creating a new RISC-type family of microprocessors, called PowerPC, with the 6x series. DEC, some of whdse professionals opposed the RISC idea in the beginning now features its own RISC product, the Alpha AXP, considered tobe one ofthe fastest microprocessors of the early nineties. Practically all new RISC-type products, as well as some CISC, are superscalar. The MIPS R4000 and R4400 are two- ‘seue super pipelined. Of the superscalar systems, the majority are two-issue. ‘The application of RISC processors is widening. Generally speaking, most RISC processors are universal and their field of application is not limited. However, some of the most notable recent RISC applicstions are in workstations, multiprocessors, and real-time systems, primarily because of their superior performance, at a relatively low cost. The application area of RISC is expected to widen in the future. Most of the RISC systems implement following features although they may not constitute the basic principles of RISC. HLL support * Implementation of register windows © Pipelining * Delayed branch Score boarding * Dual cache = ILPAnstruction Level Parallelism-Superscalar or super pipelined) 4.3 The Alpha AXP Architecture & Features: Features: ‘The Alpha is a 64-bit RISC type microprocessor manufactured by Digital Equipment Corporation (DEC) in 1991. Its features are listed below: (tis. two-issue superscalar implementation of instruction level parallelism (LP) (O Ithas dual cache with 8Kbyte for code and 8Kbyte for data © Ithas on chip Floating Point Unit (PU), thas on chip Memory Management Unit (MMU). Instruction size is fixed with 32 bit length supporting three operands. Register to register operation and memory access by load and store instructions only. thas two sets of thirty-two 64-bit registers, RO to R31 for Integer Unit (1) and FO to F31 for FPU Itis byte addressable ie its basic addressable unit i byte, Memory is accessed using 64-bit vitual address, The minimum virtual address is 43 bits, Itoperstes atthe Frequency starting st 150 MHz and reaching 300 MH. cococoae ‘Subjects For SEM VII ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 2.34 BE-Sem-VII-COMP. ficroprocessor Notes By Prof. Far i 98202238: BE: TiState 5 ‘Non Every bus cycle starts and ends with idle state Ti. This idle state is required because of Figu ‘timing constraints associated with faster bus speed. Hence most of the signals including valid ‘tran address, bank enables, bus cycle definition signals will extend a bit in idle state following T2, CA Here CACHE# and KEN# both are deasserted to indicate that the bus cycles are no more dealing =} with intemal cache of Pentium. i ' ' 1 1 1 1 1 1 1 1 1 ' L ol ' ' 1 1 ff ieee ley nee al = o=4 1 4 Hl {pata to processpr Data td procestor | horototot hobo Heel rood Hoorooh oat eee Seu Downloaded from FaaDo0Engineers.com BE-Sem-VIl-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 4, 1 Chapter 4 - DEC Alpha AXP Family Notes by Faruk Kazi 4.1 RISC versus CISC: ‘The microprocessor families Intel x86 and Motorola M68000 are known for their abundant instruction sets, multiple addressing modes, and multiple instruction formats and sizes. Their control is micro programmed and different instructions execute within a different number of cycles. The control units of such microprocessors are naturally complex, since they have to distinguish between ¢ large numberof opcodes, addressing modes, and formats. This type of system belongs tothe category called Complex Instruction Set Computer (CISC). ‘As opposed to the traditional CISC design, in the early eighties there emerged a new trend of computer design called RISC-Reduced Instruction Set Computer. What is "reduced" in 2 RISC? Practically everything: the number of instructions, addressing modes, and formats. In an ideal RISC all instructions hove the same size (usually 32 bits) and ‘execute within a single CPU cycle. In practice, only the majority of the instructions (over 80 percent in most RISC systems) execute in a single cycle. Some of the important RISC properties ae listed below. + Single-cycle execution ofall (or at least most, over 80 percent) instructions * Single-word standard fength of ll instructions ‘Small number of instructions, otto exceed about 128 ‘Small number of instruction formats, not to exceed about * Small number of addressing modes, nt to exceed about 4 = Memory access by toad and store instructions only + Alloperations, except load and store, are regsterto-register within the CPU + Hardwired control unit * Aelatively large (at least 32) general-purpose, CPU register file Advantages of RISC: ‘The advantages of RISC based microprocessors can be summarized as + VLSI realization + Computing speed + Design cost and reliability + HLL support Shortcomings of RISC: RISC shortcomings are directly related to some of its points of advantage, The principal RISC disadvantage is its reduced number of instructions. Since a RISC has a small number of instructions, a number of functions, performed on. CISC by a single instruction, will need two, three, or more instructions on a RISC. This in tum will cause the RISC ‘code to be longer. More memory will have to be allocated for RISC programs, and the instruction traffic between the ‘memory and the CPU will be increased. Recent studies have shown that, on the average, a RISC program is about 30 percent longer than a CISC program, performing the same function. Ths is because only a minority of the instructions is used most ofthe time and this minority is usually featured on RISC systems. 4.2 Overview of RISC Development and Current Systems: ‘As can be seen from the preceding discussion, the RISC concept is not quite clear-cut; it has both advantages and shortcomings. It has encountered opposition right from its inception, The RISC controversy continued over a number of years. Notwithstanding the controversy, an important fact is notable: there are a considerable number of commercial computer products announced as RISC-type by their manufacturers. To be sure, some of them do not adhere to all the RISC properties specified above. One particular RISC *violation” is in the number of instructions. In some systems ‘Subjects For SEM Vill ROBOTICS & SYSTEM SECURITY [email protected] Downloaded from FaaDo0Engineers.com 4 2.35 |-COMP-Ad’ jicroprocessor. Prof, Far -9820223893, ‘Non-Pipelined 1/0 Read and Write Bus Cycles : of Figure illustrates VO read and write cycles. The V/O read shown in figure is a zer0 ‘wait state lid transfer, The subsequent 1/0 write is a one wait state transfer. It is important to note that, the (2, CACHE# signal in these cycles is always deasserted and KEN# is not sampled. ae . 1-2. 3 4 6 7 8 5 pnoim im in im twat Lt ‘ADS# ! oe, XG HEX | meee |) wk | IX pee | Ml wor i iene! JA: | M+ CACHED 1 1 t 1 1 1 1 wre 1 1 nh Ct ERD 1 1 1 rbesnel 1 phan | Figure - Timing of a 10 Read Followed by a 1/0 Write bus Cycle (Non- Pipelined) Downloaded from FaaDo0Engineers.com Downloaded from Faalo0Engineer's.com 2:36 gle Transfer Bus Cycle (Pipelined) ‘When back-to-back cycles are run to memory or VO devices that require one or more wait states to complete a transfer, pipelining can improve performance. These devices must decode the address and assert the NA# (Next Address) signal. When the processor samples NA# asserted, it drives the next pending bus cycle early, before the current bus cycle completes. This allows devices designed to take advantage of pipelining to decode the address early in preparation for the next transfer. These devices can also latch the current data access and retum that data to the processor, while starting the next access during the current cycle. Following Figure illustrates two bus cycle transfer sequences, one without pipelining and the other with pipelining. The first sequence consists of three back-to-back bus eycles to a device requiring two wait states to complete each transfer. The processor samples NA# deasserted at all sample points, therefore the next cycle is not pipelined early. The second sequence consists of the same three back-to-back bus cycles, However, in this example NA# is sampled asserted by the processor. This causes the processor to start the next bus cycle prior to completing the first. Notice that the three- pipelined cycles complete with four less clock cycles than the non-pipelined transfers Downloaded from FaaDo0Engineers.com Shu Du cac con Ir ger Lik fol fee BLE, COMPUTER ENGINEERING FOURTH YEAR SEMESTER Vil ‘SUBJECT: ADVANCED MICROPROCESSORS Lectures: 4 Hire por weak Theory: 100 Marks Practical: 2 Hrs per week Term work: 25 Marks Oral Exam. 25 Marks Objective: To study microprocessor basics and the fundamental principles of architecture related to advanced microprocessors. Pre- requisite: Microprocessors DETAILED SYLLABUS 7. Overview of new generation of modern microprocessors 2. Advanced Intel Microprocessors Protected Mode operation of x 86 Intel family, study of Pentium, super scalar architecture and pipelining, register set & special instructions, memory management, cache organization, bus ‘operation, branch prediction logic. 3. Study of Pentium Family of Processors Pentium I, Pentium Il, Pentium Il, Pentium IV, architectural features, comparative study. 4, Advanced RISC Microprocessors Overview of RISC Development and current systems , Alpha AXP architecture , Alpha AXP Implementation and applications 5. Study of Sun SPARC Family ‘SPARC Architecture, the Super SPARC, SPARC implementation and application 6. Standard for Bus Architecture and Ports EISA, VESA, PCI, SCSI, PCMCIA Cards and slots, ATA, ATAPI, LPT, USB, AGP, RAID 7. System Architecture for desktop and server based systems ‘Study of memory subsystems and /O subsystems, integration Issues. BOOKS Text Books: 7. Daniel Tabak, “Advanced Microprocessors", Tata McGraw Fill 2. Barry Brey , “The Intel Microprocessors, Architecture, Programming and Interfacing’ 3. Tom Shanley, “Pentium Processor System Architecture’, Addison Wesley Press References: 1. Ray Bhurchandi, “Advanced Microprocessors and peripherals", TMH 2 James Abtonakos, “The Pentium Microprocessor’, Pearson Education 3. Badri Ram, “Advanced Microprocessors and Interfacing’, TMH 4 Intel Manuals, TERMWORK 7. Term work shall consist of at least 10 practical expermiments and two assignments covering the topics of the syllabus. ‘ORAL EXAMINATION ‘An oral examination is to be conducted based on the above syllabus Downloaded from FaaDo0Engineers.com 2.37 BI 9820223893, Special Cycles: ‘The special cycles are indicated by the byte enable signals BEO# to BES#. These signals define six special cycles. They are described in Table. The bus cycle definition pins for them are in the following state: MAO# = 0, D/C# = 0 and W/R# = 1. BET | BEG# | BES# | BE4# | BE3# | BE2 | BEIM | BEOW Special Bus Cycle 1 1 1 1 1 1 ‘Shutdown 1 1 eee Flush (NVD, WBINVD instruction) Halt i 1 1 ‘Write-Back (WBINVD instruction) 1 1 1 1 Flush Acknowledge 1 1 i i 1 1 1 1 H 1 Eee 1 1 Branch Trace Message Table Special Bus Cyeles During special cycles, the data bus is undefined or floated and the address lines A3-31 are driven to ‘0’. With this condition, when the external logic detects a special cycle is in progress, then the byte enables are decoded to determine which special cycle is being run. ‘+ Shutdown Special Cycle Shutdown cycle is executed due to the following reasons: i. If any other exception occurs while Pentium is attempting to invoke the double-fault handler (Triple fault situation). ii, An internal parity error is detected. During the shutdown, the intemal caches remain in the same state unless on inquire cycle is run or cache is flushed. The pins FLUSH#, SMII# and R/S# are recognized during this state. The processor ‘comes out of shutdown if NMI, INIT or RESET is asserted. © Halt Special Cycle The processor executes the halt cycle when a HLT instruction is executed, During halt, the internal processor status is same as that was during shutdown cycle, Halt cycle can be recognized externally by the byte enables asserted differently if compared to the shutdown cycles. Pentium processor will exit the halt state if INTR is asserted and maskable interrupts are enabled in addition to the assertion of NMI, INIT or RESET. Branch Trace Message Special Cycle If the execution tracing enable bit (bitl) in the Test Register 12 (TR12) is set to one, the processor generates a branch trace massage special cycle whenever a branch is taken. The processor also asserts IBT (Instruction Branch Taken) pin. Like the other special bus cycles, the data bus is-undefined and setting for bus definition signals is same in this special cycle. The only difference is that, it does not drive “Os” on address bus. The following is driven on the address bus during branch trace message special cycle A31-A3: Bits 31-3 of the branch target linear address. BT2 ~ BTO: Bits 2-0 of the branch target linear address. (the byte enables should not be decoded for A2-A0) Downloaded from FaaDo0Engineers.com 3.18 BES wr Notes By Prof, Far Intel Dual-Core Processors: (for VIVA) : In April of 2005, Intel announced the Intel Pentium processor Extreme Edition, featuring an Intel dual-core processor, which can provide immediate advantages for people looking to buy systems that boost multitasking computing power and improve the throughput of multithreaded applications. An Intel dual-core processor consists of two complete execution cores in one physical processor (right), both running at the same frequency. Both cores share the same packaging and the same interface with the chipsetimemory. Overall, an Intel dual-core processor offers a way of delivering more cepabilties while balancing energy-efficient performance, and is the first step in the multi-core processor future. ‘An Intel dual-core processor-based PC will enable new computing experiences as it delivers value by providing additional computing resources that expand the PC's capabilities in the form of higher throughput and simultaneous computing. Imagine that a ‘dual-core processor is like a four-lane highway-—it can handle up to twice as many cars as its two-lane predecessor without making each car drive twice as fast. Similarly, with an Intel dual-core processor-based PC, people can perform ‘multiple tasks such as downloading music and gaming simultaneously ‘And when combined with Hyper-Threading Technology (HT Technology) the Intel dual-core processor is the next step in the evolution of high-performance computing, Intel dual-core products supporting Hyper-Threading Technology can process four software threads simultaneously by more efficiently using resources that otherwise may sit idle. A new Intel dual-core processor-based PC gives people the flexibility and performance to handle robust content creation or intense gaming, plus simultaneously managing background tasks such as virus scanning and downloading. Cutting-edge gamers can play the latest titles and experience ultra-realistic effects and gameplay. Entertainment ‘enthusiasts will be able to create and improve digital content while encoding other content in the: background. ‘The new Intel Core Duo processors have ushered in a new era in processor architecture design in which multi-core processors become the standard for delivering greater performance, improved performance per watt, and new capabilites across Intel's desktop, mobile, and server platforms, The Intel dual-core products also represent a vital frst step on the road to realizing Platform 2015, Intel's ‘vision for the future of computing and the evolving processor and platform architectures that support it Downloaded from FaaDo0Engineers.com Abou 2.38 ee ites By Prof. Faruk Kazi -9820223893 High if the default operand size is 32 bits. Low if the default operand size is 16 bits + Flush Special Cycle This cycle is generated by execution of two instructions- a. INVD~Invalidate ‘When this instruction is executed, the processor sets all entries to I(invalid) stage and runs flush special eycle. This cycle notifies extemal logic that all intemal cache lines have been invalidated and 2 if present should also invalidate itself. In this case, modified data in L2 is not written back to memory and is hence lost. b. WBINVD- Write Back and Invalidate The execution of this instruction causes all modified data to be written back to memory and then all the cache entries are invalidated. As each modified line is written back, the processor invalidates is entry. After all the lines are written back, the processor runs. A write-back special cycle followed by A flush special cycle. This is to inform L2 cache to invalidate its entries after writing back to main memory * Write-back Special Cycle * ‘This cycle is run after WBINVD instruction. This indicates that all modified lines in L1 data cache has been written back to main memory/ L2 cache, The write-back special cycle is followed by a flush special cycle which forces L2 cache to invalidate its entries after writing back to main memory + Flush Acknowledge Special Cycle ‘This cycle is run in response to the FLUSH# signal pin being asserted. When FLUSH# is asserted, the processor writes back all modified lines in the data cache and invalidates all cache entries in both code and data caches, This cycle notifies to the external logic that all modified lines have been written back and all cache entries invalidated Interrupt Acknowledge Bus Cycle In spite of all the cycles discussed earlier, interrupt acknowledge cycles have a unique cycle type generated on the cycle type pins. Pentium processor generates two back-to-back interrupt acknowledge bus cycles when an interrupt request (INTR) is recognized. Itis the response given by Pentium to maskable interrupt request generated on INTR pin if interrupts are enabled. The processor uses this locked pair of interrupt acknowledge cycles to communicate with the interrupt controller(s). Of course, it is the system designer’s responsibility to insert wait states if required to ‘meet the specified data setup and hold time requirements for the interrupt controllers, Subjects for SEM VIII Downloaded from FaaDo0Engineers.com : ‘ 1 r t 1 3.7 vn van 1 Notes By Prof, 3 Comparison of Pentium family pipeline: MAY08 Figure below shows, the NetBurst pipeline is twice as deep as that ofthe P6, which in turn had twice the depth of the PS"s. Increasing pipeline depth increases logic complexity and branch penalties, but it also allows clock speeds to increase. Pe Mercian Figure: Comparison of Pentium Family Pipeline * Trace Cache next Instruction Pointer: The trace cache fetch logic gets a pointer to the next instruction in the trace cache, Trace cache is intel's name for putting the L1 cache inside of the first functional unit for speed. Trace Cache Fetch: use pointer to fetch an instruction from the cache. * Drive. The two Drive stages shown in figure represent time required to move signals actoss the chip. No other work is done during these stages. NetBurst is the first pipeline with dedicated stages for wire delays. This is apparently necessary for multiple-gigahertz speeds. * Allocate and Rename: The CPU actually contains more registers than are related in the specification in order to speed things ip and be able to execute operations in a superscalar fashion (which is to say, more than one operation at once). At this time, the CPU will associate different registers with the names of the registers. * Queue: Operations are now placed into cither the memory queue or the arithmetic (everything else) queue for scheduling * Schedule: In a superscalar processor, operations are often executed out of order so that they do not step on cach other and so that they are completed as rapidly as possible. The P4 has four queues; Memory, Fast ALU, Slow ALU/General FPU, and Simple FP. All instructions get dumped into one of them for later execution by the appropriate (and linked) functional unit. Operations in each queue are sorted based on the order they were submitted, what instructions are waiting on them, and the number of cycles required by an ALU (or FPU, or LSU) to complete the instruction. * Dispatch: Instructions are moved from the queues to the functional units. = Register Files: The instructions are now loaded into the functional units for actual execution. * Execute: The functional units process the instructions in the files along with, data in the registers. This is the seventeenth stage! A lot has happened before we got here. Flags: There is a status register (sometimes called a flag register) in all CPUs of the x86 family which is used for conditional jumps. The flags are set in this stage. Branch Check: Now in the nineteenth out of twenty stages we finally check to see if the branch predictor predicted incorrectly and we have to discard some operation we have just spent eighteen stages (and a cycle) on. * In other words, the Pentium 4 is split up into 20 very short pipeline stages. Some of them are so short that they aren't long enough to fit an entire function, and so that function actually takes two stages. Chopping execution up into so many short stages means that very high clock Downloaded from FaaDo0Engineers.com a 239 BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 fe oreinackcxse —p] atleast ne ide state Wom om ti 1 Address post | CACHES wre BROYE DATA The first and the second interrupt acknowledge cycles are distinguished by the state of address bit 2 (A2-encoded from the byte enables). If A2 = I, it corresponds to the first interrupt acknowledge cycle and if A2 = 0, it indicates the second interrupt acknowledge cycle. 1" interrupt acknowledge: A2 = 1, A3 : A31 = 0 Hence addres: 2" interrupt acknowledge: A2 = 0, A3: A31 = 0 Hence address = 0 ‘The data will be returned by the interrupt controller at the end of both the cycles. The data retumed during the first cycle, is ignored by the processor, During the second cycle, the interrupt vector is retuned on the lower 8 bits of the data bus. Pentium has 256 possible interrupt vectors. Both the cycles are separately terminated when the extemal system retums BRDY#. Wait states can be added by withholding BRDY#, Pentium processor automatically generates at least one idle clock between the first and second cycle. ubjects for SEM VI Downloaded from FaaDo0Engineers.com 3.16 BE-Sem-VI-COMP-Adyan i cessor Notes By Prof, Farul ri -982022389: a = g ee eae oT as sees a eS ms | cima ucla 4 Memory subsystem ‘The processor provides three levels of on-package cache for scalable performance across a variety of workloads, At the first level, instruction and data caches are split, each 16 Kbytes in size, four-way set-associative, and with a 32- byte line size, The dual-ported data cache has a load latency of two cycles, is write-through, and is physically addressed and tagged. The L1 caches are effective on moderate-size workloads and act as a first-level filter for capturing the immediate locality of large workloads, The second cache level is 96 Kbytes in size, is six-way set-associative, and uses a 64-byte line size. The cache can handle two requests per clock via banking. This cache is also the level at which ordering requirements and semaphore operations are implemented. The L2 cache uses a four- state MESI (modified, exclusive, shared, and invalid) protocol for multiprocessor coherence. The cache is unified, allowing it to service both instruction and data side requests from the L1 caches. This approach allows optimal cache use for both instruction-heavy’ (server) and data-heavy (numeric) workloads. Since floating-point workloads often have large data working sets and are used with compiler optimizations such as data blocking, the L2 cache is the first point of service for floating- point loads, Also, because floating-point performance requires high bandwidth to the register file, the L2 cache can provide four double-precision operands per clock to the floating-point register file, using two parallel floating-point load pair instructions. The third level of on-package cache is 4 Mbytes in size, uses a 64-byte line size, and is four-way set-associative, It communicates with the processor at core frequency (800 MHz) using a 128-bit bus. This cache serves the large workloads of server- and transaction processing applications, and minimizes the cache traffic on the front-side system bus. The L3 cache also implements e MESI protocol for microprocessor coherence. Downloaded from FaaDo0Engineers.com BE-Sem-VI) cessor Notes By Prof. Faruk Kazi NOV04/MAY07: Bus State Transition Diagram Pentium processor’s state machine operates in various states. It has six bus states as follows: Ti Bus Idle State Tl Address Phase T2 Data Phase Tl2_ Address Phase (new cycle) and Data Phase (Current cycle) T2P Data Phase (1" cycle pipelined) and Data Phase (2 cycle pipelined) TD Dead State The bus control state machined diagram is shown in the following figure. The state transitions are listed below: 0 (No request pending) #5 (Bus cycte ends & fo bur cycle pending) te NAGI ‘10(Adde wait states F ERO 4 & #4 (Starts Bus cet Wino bus eyel pending) en (When fst tans er ‘comptes with no 7) (Current cycle funning, two fstanding (Stage in TZ? until Before ending bus cyci F mother the frst ansfer (Gefore ending bus cycn f othe tin tina ‘bus eyele pending Le. NAB = EM SECURITY fuitzioitbocn Downloaded from FaaDo0Engineers.com Note one | 3.15 -9820223893 BE-Sem. --Advanced Microprocessor Notes By Pt and efficiently deliver this information to the hardware, The, processor provides a six-wide and 10- stage deep pipeline, running at 800 MHz on a 0.18-micron process. This combines both abundant resources to exploit ILP and high frequency for minimizing the latency of each instruction. The resources consist of four integer units, four multimedia units, two load/store unit, three branch units, two extended-precision floating-point units, and two additional single-precision floating-point unite (FPUs). The hardware employs dynamic prefetch, branch prediction, non-blocking caches, and a register scoreboard to optimize for compilation time nondeterminism, Three levels of on-package ‘cache minimize overall memory latency. This includes a 4-Mbyte level-3 (L3) cache, accessed at core speed, providing over 12 Gbytes/s of data bandwidth. Figure below provides the block diagram of the Itanium processor. The 16-Kbyte, four-way set-associative instruction cache is fully pipelined and can deliver 32 bytes of code (two instruction bundles or six instructions) every clock. The cache is supported by a single- cycle, 64-entry instruction translation look-aside buffer (TLB). The fetched code is fed into a decoupling buffer that can hold eight bundles of code. As a result of this buffer, the machine’s front end can continue to fetch instructions into the buffer even when the back end stalls. Conversely, the buffer can continue to feed the back end even when the front end is disrupted by fetch bubbles due to branches or instruction cache misses, Hierarchy of branch predictors- The processor employs a hierarchy of branch prediction structures to deliver high-accuracy and low penalty predictions across wide spectrum of workloads. Note that if branch mis prediction led to-a full pipeline flush, there would be nine cycles of pipeline bubbles before the pipeline is full again. This would mean a heavy performance loss. After instructions are fetched in the front end, they move into the middle pipeline that disperses instructions, implements the architectural renaming of registers, and delivers operands to the wide parallel hardware. The processor has a total of nine issue ports capable of issuing up to two memory instructions (ports MO and MI), two integer (ports 10 and I1), two floating-point (ports FO and Fl), and three branch instructions (ports BO, BI, and B2) per clock. The processor’s 17 execution units are fed through the M, I, F, and B groups of issue ports. The processor provides an abundance of execution resources to exploit ILP. The integer execution core includes two memory and two integer ports, with all four ports capable of executing arithmetic, shift and-add, logical, compare, and most integer SIMD multimedia operations. The memory ports can also perform load and store operations, including loads and stores with post increment functionality. The integer ports add the ability to perform the less-common integer instructions, such as test bit, look for zero byte, and variable shift. Additional uncommon instructions are also implemented on only the first integer port. For data speculation, the software issues an advanced load instruction. When the hardware encounters ‘an advanced load, it places the address, size, and destination register of the load into the ALAT structure, The ALAT then observes all subsequent explicit store instructions, checking for overlaps of the valid advanced load addresses present in the ALAT, In the common case, there’s no match, the ALAT state is unchanged, and the advanced load result is used normally. In the case of an overlap, all address-matching advanced loads in the ALAT are invalidated. Downloaded from FaaDo0Engineers.com are 2.41 snced Microprocessor Notes By Prof, Faruk 1820223893, Note that, once NA# is sampled asserted Pentium processor latches it and will pipeline a cycle when ‘one becomes pending even if NA# is subsequently deasserted. #0 No request pending. #1 Pentium processor starts a new bus cycle by driving the address and bus cycle definition signals and asserting ADS# in T1 state. #2 Pentium processor always moves to T2 from TI to process the data transfer. The only exception for this, is the assertion of BOFF#. If BOFFY is released, the cycle is terminated and restarted. #3 Pentium maintains T2 state (adding wait states) until the transfer is over (BRDY # sampled asserted) if no new request becomes pending or if NA# is not asserted. #4 If the current transfer is complete, the processor begins from T1 state again if there is a new request pending and NA# is sampled asserted. If NA# is not asserted, the processor enters Ti state from T2 state. #5 If the current transfer is over and no other bus cycle is pending or NA# is not asserted, the ‘processor goes to the idle state Ti. #6 Before the current cycle ends, if. NA# is found asserted and another cycle becomes pending, the processor now has two outstanding cycles. ADS# is asserted for the second cycle. #7 When the current cycle is complete in T12 state, and no dead clock is needed, then the processor moves back to T2 state from T12. #8 When the processor finishes the current bus cycle but a dead clock is needed, it goes to TD state, (This happens for consecutive read and write cycles or write and read cycles) #9 When the current bus cycle is still running and no BOFF# asserted, the processor always transitions from T12 to T2P state to’ process the data transfer. Both — the current and the next-cycles execute their data phase in state T2P. #10 The processor stays in T2P until the first cycle transfer gets over. #11 The bus state goes from T2P to T2 state when the processor finishes the first transfer and no dead clock is needed (no consecutive read-write or write-read operation). #12 The processor enters TD state from T2P, if the first transfer is complete; but a dead clock is required. #13 The bus state returns to T12 state from TD if NA# is sampled asserted and there is a new request pending. #14 If there is no new request pending or NA# is not asserted, the transitions of processor are from TD to T2 state Subjects for SEM VIII OBOTICS & SYSTEM Downloaded from FaaDo0Engineers.com 314 BE-Ser -COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 Summary of Pentium Family of Processors: (Also useful for VIVA Exam) Processor Name (code | Pentium | Pentium Pro | Pentium i | Pentium II | Pentium name) Processor | Processor | Processor | Processor | Processor (P5P54C) | (P8) (Kamat | (Katmai! | (Witamette! Deschutes) | Coppermine) | Northwood) Introduced Canam | TieIes | oso7e7 | o2nges | 11/2000 Operations Per Clock] 2 3 3 3 6 | Cycle GOMEZ COMHz OME TOOME ‘400MEZ Max Clock Speed system bus: | system bus: | systembus: | system bus: | system bus: 1S0MHz | 180MHz | 333MHz | 1.0GHz | 240GHz 66M 66Meiz | 100MHz | 133MHz | S33MHz system bus: | system bus: | system bus: | system bus: | system bus: 200MHz | 200M@z | 4soMHz | “1.4GHz | 2.53GHe. “400MHz Bus Frequency 6oMiiz, | 60MUz, | 66MHz, | 100MHz, | (i00*4), s6Miz | 66MHz | 100MEz | 133MHz | _533MEz (334) Number of Transistors | 3,100,000 | 5,500,000 | 7,500,000 | 24,000,000 | 42,000,000 (@8misron) | (0.35 micron) | “(035 | (@.13 micron) | (0.13 ticron) micron) Li Cache 16KB 16KB 32KB 32KB | 12kpop+ (SKB Code | (8KB Code (I6KB Code | (16KB Code 8KB Data +8KB Dats) | +8KB Date) |” +6KB | +16KB Dats) | Daw 12 Cache (eftchip-not | IMB snkp | siaKB 512KB specified) | (onchip) | (oftchin) | (onchip) | (on-chip) Addressable Memory 4GB 6408 6408 6408 64GB Integer Pipelines 2 2 2 2 4 Floating Point Pipelines 1 L 1 1 2 Supencalar | Intel's fat ue | Dval Data Pretcich | Copa of Brief Description schteewre | server! independent | Logi, Level2_ | delivering workstation | bus, dynamic | Advanced | 42GB of data ip execution, | Transfer Cache | persecond Intel Mx into and cut of Lechaoogy | the processor 3.8 Intel Itanium Processor: pEcovmayos (Please refer class notes for EQ answer) The Itanium processor is the first implementation of the IA~64 instruction set architecture (ISA). The design team optimized the processor to mect a wide range of requirements: high performance on Intemet servers and workstations, support for 64-bit addressing, reliability for mission-critical applications, full LA-32 instruction set compatibility in hardware, and scalability across a range of operating systems and platforms. The processor employs EPIC (explicitly parallel instruction computing) design concepts for a tighter coupling between hardware and software. In this design style the hardware-software interface lets the software exploit all available compilation time information Downloaded from FaaDo0Engineers.com 2.42 BE-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Faruk Kazi -9820223893 Burst Bus cycles In case of L1 data cache miss, the required line has to be brought in form external memory or L2 cache, This is done as a cache line fill operation in a burst cycle. Burst bus cycles are made up of four consecutive bus cycles. The processor outputs the start address for only the first required quadword. The memory subsystem should latch this address and compute the addresses for the other three quadwords in the burst sequence. The fastest burst cycle possible requires 2 clocks for the first data item to be retumed/driven with the subsequent data items retumed/driven in every clock. Thus a fast burst cycle takes 5 clocks to complete, When a cache line fill operation has to be done for the LI cache, first the L2 cache is accessed. Ifit is an L2 cache, hit, we have a burst cycle with no wait states as explained latter. If it is an L2 cache miss, then the slow DRAM has to be accessed which results in a slow burst cycle as explained latter. Burst Read Cycle ‘We consider a U pipeline request resulting in a L1 data cache miss and L2 hit. ‘© The bus cycle begins with TI when processor outputs the address and bus cycle definition signals on A (31-3), BEO #-BE7#, M/1O # and D/C#. ‘«ADS¢ is asserted as the address is driven indicating that the address and bus cycle definition ‘current on the bus are valid. * CACHE is driven low, to indicate that the processor wishes to perform a burst line fill. WIR # driven low indicates that this should be a burst read cycle. * L2 cache on receiving the address finds a cache hit which also implies that the address is cacheable. Hence, L2 cache asserts KEN # on the clock that it first retums BRDY¢# asserted. (KEN # is sampled only once during a cycle to determine cacheabilitty), «Since the processor samples BRDY# asserted, it reads the first quadward that has been placed on the data bus by L2 cache. © the processor also samples WB/WT # signal and finds it asserted indicating that the line should be placed in L1 cache in the ’S" state. Le. write through policy. * Since this is a burst cycle, L2 cache computes the additional addresses as described before and then supplies the corresponding data It asserts BRDY# to indicate valid data on the data bus. «When the first quadword is received by the processor, it sends the requested data to the U- pipe immediately and also stores the quadword in the 32-byte line-fill buffer. Each of the remaining quadwords is placed in the line-fill buffer as they are received. When the entire 32-byte line is placed in the buffer, the entire cache line is written into cache memory and the cache directory is updated. Downloaded from FaaDo0Engineers.com. RE BRS sas GF RR aat als es on res che ons the 3.13 OMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 improvements to the execution units over that of the P6 microarchitecture, For example, the arithmetic logic units operate twice as fast as previous microarchitectures. . ee ea ‘As with the previous implementations, the retirement section receives the results of the executed pops, from the execution core and processes the results so that the proper architectural state is updated according to the original program order. For semantically correct execution, the results of 1A-32 instructions must be committed in original program order before they are retired, Exceptions may be raised as instructions are retired. ‘Thus, exceptions cannot occur speculatively, they occur in the correct order, and the machine can be correctly restarted after an exception. When a yop completes and writes its result to the destination, itis retired. Up to three pops may be retired per cycle. Again, the ROB is the unit in the processor which buffers completed nops, updates the architectural state in order, and manages the ordering of exceptions. The retirement section also keeps track of branches and sends updated branch target information to the branch target buffer (BTB) to update branch history. 3.7 The Intel Pentium M Processor Family ‘The Intel Pentium M processor family is designed for low power consumption. It’s enhanced microarchitecture includes the following features: U. Support for Intel Architecture with Dynamic Execution U_ Avhigh performance, low-power core manufactured using Intel’s advanced process technology with copper interconnect On-die, primary 32-Kbyte instruction cache and 32-kbyte write-back data cache Oncdie, second-level cache (up to 2-MByte) with Advanced Transfer Cache Architecture ‘Advanced Branch Prediction and Data Prefetch Logic Support for MMX Technology, Streaming SIMD instructions, and the SSE2 instruction ‘A 400 MHz, Source-Synchronous Processor System Bus U_ Advanced power management using Enhanced Intel SpeedSt cceece Downloaded from FaaDo0Engineers.com 2.43 2 Microprocessor Notes By Prof. 9820223893 MP-Advane :L2 pof tired vther A313] BEOH-BE7 8 8 8 3 first Thus 0st tthe wait hich wre ition BROYE ition fill, ss is sted. aced Figure - Burst Read Cycle (Basic) line fore data xe U- of the ontire id the Downloaded from FaaDo0Engineers.com 3.12 123893, BESS ced Microproces: By Prof. Intel’s MMX technology to 128 bits and supports packed integer operations. While the extended ‘width ofthe operation used to be 64 bits, these new instructions double the SIMD integer bandwidth over SSE/MMX technology. This accelerates a broad range of applications, including video, speech, and imago and photo processing, The new 64-bit adds/subtracts and 32x32 unsigned multiply provide Significant enhancements to encryption operations as well. As we move. into. the futuro, eneryption/decryption capabilities will be more important in driving a secure e-Business infrastructure for the connected world. The 128-bit SIMD double precision floating-point delivers the capability to execute two 64-bit double precision floating point instructions at once, doubling the performance Capability. In addition, it offers a full set of SIMD double precision floating-point operations, end additional operations that convert between double and single precision. This precision floating point results in the acceleration of content creation, financial, engineering, and scientific applications. Balanced Platform Solution ‘As part of a complete platform solution, the Intel Pentium 4 processor was designed in tandem with the Intel 850 chipset 1 ereate a powerful new platform for high-performance users, The 400-MEz system bus in the processor is balanced by dual RDRAM memory channels in the $50 chipset that perate in lock-step to deliver 3.2 GBIs of memory bandwidth. Coupled with more efficient protocols nd the 400-MHz system bus, the Intel Pentium 4 processor and Intel 850 chipset deliver three times the bandwidth of platforms based on high-performance Intel Pentium III processors. The increased bandwidth enables faster memory acquisitions, which increase performanee-on-any- application requiring intensive memory accesses such as many 3D and video applications. More on Intel NetBurst MicroArchitecture: Figure 3.6 is an overview of the Intel NetBurst microarchitecture, ‘This microarchitectue pipeline is rade up of three sections: (1) the front end pipeline, (2) the out-of-order execution core, and (3) the retirement wit, “The concept behind the Intel NetBurst microarchitecture (Pentium 4 processor, Intel Xeon processor), was to improve the throughput, improve the efficiency of the out-of-order execution engine, and to Create a processor that can reach much higher frequencies with higher performance relative to the PS and P6 microarchitectures, while maintaining backward compatibility. ‘The Intel NetBurst microarchitecture addressed some of the common problems found in high-speed, pipelined microprocessors. Limiting factors for processor performance were delays from pre-fetch dnd decoding of the instructions to uops the efficiency of the branch prediction algorithm, and cache misses, The execution trace cache addresses these problems by storing decoded 1A-32 instructions. Instructions are fetched and decoded by a translation engine, which builds the decoded instruction into sequences of wops called traces, which are then stored in the trace cache, The execution ‘trace Cache stores these pops in the path of predicted program execution flow, where the results of branches sn the code are integrated into the same cache line. This increases the instruction flow from the cache and makes better use of the overall cache storage space since the cache no longer stores instructions that are branched over and never executed, The trace cache delivers up to three wops per clock to the core. Branch targets are predicted based on their linear address using branch prediction logic and fetched as soan as possible, Branch targets are fetched from the execution trace cache if they are cached there; otherwise, they are fetched from the memory hierarchy. The translation engine's branch prediction information is used to form traces along the most likely paths. “The cores ability to execute instructions out of order remains a key factor in enabling parallelism, The processor employs several bufers to smooth the flow of ops. This implies that when one portion of the entire processor pipeline experiences a delay, that delay may be covered by other operations executing in parallel (for example, in the core) or by the execution of pops which were previously queued up in a boffer (for example, in the front end). The NetBurst microarchitecture adds further Downloaded from FaaDodEngineer's.com 244 BE-Sem-ViI-COMP-Advanced Microprocessor Notes By Prof, Faruk Kazi -9820223893 2.8 Branch Prediction (Regularly asked EQ) The Pentium processor includes branch prediction logic, allowing it to avoid pipeline stalls if it correctly predicts whether or not the branch will be taken when the branch instruction is executed, ‘When a branch operation is correctly predicted, no performance penalty is incurred. However, when branch prediction is not correct, a three cycle penalty is incurred if the branch is executed in the U pipeline and a four cycle penalty if the branch is in the V pipeline. The prediction mechanism is implemented using a four-way, set-associative cache with 256 entries. This is referred to as the branch target buffer, or BTB, The directory entry for each line contains the following information. + A valid bit that indicates whether or not the entry is in use. History bits that track how often the branch has been taken each time that it entered the pipeline before. + The source memory address that the branch instruction was fetched from. The branch target buffer, or BTB, is a look-aside cache that sits off to the side of the D1 stages of the two pipelines and monitors for branch instructions. Prefetcher BIB Hit Figure — Illustrates the relationship of the D1 pipeline stages and the BTB. ‘The first time that a branch instruction enters either pipeline, the BTB uses its source memory address to perform a lookup in the cache. Since the instruction has not been seen before, this results ‘a BTB miss. This essentially means that the branch prediction logic has no history on the instruction. It therefore predicts that the branch will not be taken when the instruction reaches the execution stage of the pipeline, and does not instruct the prefetcher to alter program flow. Even ‘unconditional jumps will be predicted as not-taken the first time that they are seen by the BTB. ‘When the instruction reaches the execution stage, the branch will either be taken or not taken. If taken, the next instruction to be executed should be the one fetched from the branch target address. ‘Subjects for Downloaded from FaaDo0Engineers.com aml BE-Sem- If the bra sequential When th predictor made cor history bi one of fo 1. Stro The hist marked 1 strongly 2. Wea It is upgt the corre i 3. We: Ifa bran branch n 4, Stre If a brat ‘When a ae in ne If bran indicat prefetc addres branch 3 Microprocessor Notes By Prof. Faruk: Kazi 9820223893 Bis). With 3.2 GB/s of system bandwidth, the Intel Pentium 4 processor delivers the highest ‘bandwidth desktop bus currently inthe industry. Hyper-pipelined Technology : With the Intel Pentium 4 processor, Intel has doubled the pipeline depth to 20 stages, enabling a higher clock fréquency. The additional pipeline stages establish a now baseline for processor speed, delivering =1.40 GHz at launch on our 0.18 micron process. This higher core frequency significantly increases processor performance and frequency capability and provides the scalability needed for future applications. Advanced Dynamic Execution The Advanced Dynamic Execution engine is a very deep, out-of-order speculative execution engine that keeps the execution units executing instructions. It does so by providing a very large window of instructions from which the execution units can choose. The large out-of-order instruction window allows the processor to significantly reduce stalls that can occur while instructions are waiting for dependencies to resolve, One of the more common forms of stalls is waiting for data to be loaded from memory on a cache miss. This aspect is very important in high-frequency designs, as the latency to main memory increases relative to the core frequency. The NetBurst microarchitecture can have up to 126 instructions in this window (in flight) vs. the previous P6 microarchitecture’s much smaller window of 42 instructions. The Advanced Dynamic Execution engine also delivers an enhanced branch prediction capability that allows the Pentium 4 processor to be more accurate in predicting program branches. This has the net effect of reducing the number of branch mispredictions by about 33 percent over the P6 microarchitecture’s branch prediction capability. It does this by implementing a 4-KB branch target buffer that stores more detail on the history of past branches, as well as by implementing @ more advanced branch prediction algorithm. This enhanced branch prediction capability is one of the key design elements that reduce the overall sensitivity of the NetBurst ‘microarchitecture to the branch misprediction penalty. Rapid Execution Engine The two Arithmetic Logic Units (ALUs) in the Intel Pentium 4 processor run at twice the core frequency of the processor. This makes it possible to execute basic integer instructions (such as add, subtract, logical AND, and logical OR) in half a clock cycle, with higher execution throughput and reduced latency of execution, With the 1.40-GHz Intel Pentium 4 processor, each of the ALUs is running at 2.80 GHz, increasing performance on integer-based applications. Revolutionary Cache Subsystem In order to increase performance and scalability, the Intel Pentium 4 processor features an innovative ‘new cache subsystem designed to optimize data transfer to the core. An execution trace cache stores 12K decoded instructions in the order of program flow instead of predecoded instructions that cannot take code branches into consideration. The execution trace cache removes the decoder from the main instruction loop and results in a higher performance, more efficient level 1 instruction cache, The Intel Pentium 4 processor also includes a level 2 Advanced Transfer Cache (ATC). While still only 256 KB in size, this ATC improves the data transfer rate between the on-die level 2 cache and the processor core to 44.8 GB/s at 1.40 GHz, compared with 16 GB/s on a 1-GHz Intel Pentium Ill processor. The evel 2 Advanced Transfer Cache is able to clock 256 bits (32 bytes) of data into and out of the cache on every clock eycle, unlike previous microarchitectures. The overall gain with the new cache subsystem is that the transfer rate between the cache subsystem and the processor core is optimized over previous subsystems in both bandwidth and latency. The enhanced cache subsystem delivers increased performance and response on a wide variety of applications. Streaming SIMD Extensions 2 (SSE2) ‘To make the SIMD instruction set even more powerful, the Intel Pentium 4 processor provides 144 new performance improving instructions, including 128 bit SIMD double precision floating point, 128-bit SIMD integer, and improved cache and memory management instructions. The SSE2 extends Downloaded from FaaDo0Engineers.com E-Sem-VII-COMP-Advanced Microprocessor Notes By Prof. Fat If the branch is not taken the next instruction executed should be the one fetched from the next sequential memory address after the branch instruction. fit me | When the branch is taken for the fist time, the execution unit provides feedback to the branch eU | _ Prediction logic. The branch target address is sent back and recorded in the BTB. A directory entry is cis, Made containing the source memory address that the branch instruction was fetched from and the the |__ history bits are set to indicate that the branch has been strongly taken. The history bits can indicate {one of four possible states. | 1. Strongly Taken the |The history bits are initialized to this state when the entry is first made. In addition, if a branch marked weakly taken is taken again, it is upgraded to strongly taken stage. When a branch marked strongly taken is not taken the next time, it is downgraded to weakly taken. ‘the | 2% Weakly Taken It is upgraded to the strongly taken state when a branch marked weakly taken is taken again. When the corresponding marked branch is not taken, then it is downgraded to weakly not taken state. Te DI stage, « hil on strongly y taken entry ¥ positive prediction. | (ie, the branch is predicted taken) . z ‘| 3. Weakly Not Taken Ifa branch marked weakly not taken is taken again, it is upgraded to the weakly taken state. When a branch marked weakly not taken is not taken the next time, it is downgraded to strongly not taken. 4. Strongly Not Taken. If a branch marked strongly not taken is taken again it is upgraded to the weakly not taken state. When a branch marked strongly not taken is not taken the next time, it remains in the strongly not taken state. Tn DI Stage, a | negative, Movement when branch is not taken —_—————— ory sults * the js ve Movement when branch is taken sven If branch predicted taken, the BTB supplies the branch target address back to the prefetcher and indicates that a positive prediction is being made. In response, the prefetcher switches to the opposite prefetch queue and immediately begins to prefetch from memory starting at the branch target alt address. The instructions fetched are supplied to the instruction pipelines immediately behind the Tess. branch instruction. Downloaded from FaaDo0Engineers.com

You might also like