Component Operation: 16.1 Pipeline and Instruction Flow
Component Operation: 16.1 Pipeline and Instruction Flow
PF Prefetch
F Fetch (embedded Pentium processor with MMX technology only)
D1 Instruction Decode
D2 Address Generate
EX Execute - ALU and Cache Access
WB Writeback
The embedded Pentium processor is a superscalar machine, built around two general purpose
integer pipelines and a pipelined floating-point unit capable of executing two instructions in
parallel. Both pipelines operate in parallel, allowing integer instructions to execute in a single clock
in each pipeline. Figure 16-1 depicts instruction flow in the embedded Pentium processor.
The pipelines in the embedded Pentium processor are called the “u” and “v” pipes and the process
of issuing two instructions in parallel is termed “pairing.” The u-pipe can execute any instruction in
the Intel architecture, whereas the v-pipe can execute “simple” instructions as defined in ““Pairing
Two MMX™ Instructions” on page 16-194” section of this chapter. When instructions are paired,
the instruction issued to the v-pipe is always the next sequential instruction after the one issued to
the u-pipe.
Clk0 Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk0 Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8
PF i1 i3 i5 i7 PF i1 i3 i5 i7
i2 i4 i6 i8 i2 i4 i6 i8
D1 i1 i3 i5 i7 F i1 i3 i5 i7
i2 i4 i6 i8 i2 i4 i6 i8
D2 i1 i3 i5 i7 D1 i1 i3 i5 i7
i2 i4 i6 i8 i2 i4 i6 i8
EX i1 i3 i5 i7 D2 i1 i3 i5 i7
i2 i4 i6 i8 i2 i4 i6 i8
WB
i1 i3 i5 i7 EX
i1 i3 i5 i7
i2 i4 i6 i8 i2 i4 i6 i8
WB
i1 i3 i5 i7
i2 i4 i6 i8
The first stage of the pipeline is the Prefetch (PF) stage in which instructions are prefetched from
the on-chip instruction cache or memory. Because the processor has separate caches for
instructions and data, prefetches do not conflict with data references for access to the cache. If the
requested line is not in the code cache, a memory reference is made. In the PF stage, two
independent pairs of line-size (32-byte) prefetch buffers operate in conjunction with the branch
target buffer. This allows one prefetch buffer to prefetch instructions sequentially while the other
prefetches according to the branch target buffer predictions. The prefetch buffers alternate their
prefetch paths. In the embedded Pentium processor with MMX technology, four 16-byte prefetch
buffers operate in conjunction with the BTB to prefetch up to four independent instruction streams.
See the “Instruction Prefetch” on page 16-181 for further details on prefetch buffers.
In the embedded Pentium processor with MMX technology only, the next pipeline stage is Fetch
(F), which is used for instruction length decode. It replaces the D1 instruction-length decoder and
eliminates the need for end-bits to determine instruction length. Also, any prefixes are decoded in
the F stage. The Fetch stage is not supported by the embedded Pentium processor (at 100, 133, 166
MHz) or the embedded Pentium processor with VRT.
The embedded Pentium processor with MMX technology also features an instruction FIFO
between the F and D1 stages. This FIFO is transparent; it does not add additional latency when it is
empty. During every clock cycle, two instructions can be pushed into the instruction FIFO
(depending on availability of the code bytes, and on other factors such as prefixes). Instruction
pairs are pulled out of the FIFO into the D1 stage. Since the average rate of instruction execution is
less than two per clock, the FIFO is normally full. As long as the FIFO is full, it can buffer any
stalls that may occur during instruction fetch and parsing. If such a stall occurs, the FIFO prevents
the stall from causing a stall in the execution stage of the pipe. If the FIFO is empty, an execution
stall may result from the pipeline being “starved” for instructions to execute. Stalls at the FIFO
entrance may be caused by long instructions or prefixes, or “extremely misaligned targets” (i.e.,
Branch targets that reside at the last bytes of 16-aligned bytes).
The pipeline stage after the PF stage in the embedded Pentium processor is Decode1 (D1), in which
two parallel decoders work to decode and issue the next two sequential instructions. The decoders
determine whether one or two instructions can be issued contingent upon the instruction pairing
rules described in “Pairing Two MMX™ Instructions” on page 16-194.” The embedded Pentium
processor requires an extra D1 clock to decode instruction prefixes. Prefixes are issued to the u-
pipe at the rate of one per clock without pairing. After all prefixes have been issued, the base
instruction is issued and paired according to the pairing rules. The one exception to this is that the
embedded Pentium processor decodes near conditional jumps (long displacement) in the second
opcode map (0FH prefix) in a single clock in either pipeline. The embedded Pentium processor
with MMX technology handles 0FH as part of the opcode and not as a prefix. Consequently, 0FH
does not take one extra clock to get into the FIFO. Note that in the embedded Pentium processor
with MMX technology, MMX instructions can be paired. This is discussed in “Pairing Two
MMX™ Instructions” on page 16-194.
The D1 stage is followed by Decode2 (D2) in which addresses of memory resident operands are
calculated. In the Intel486™ processor, instructions containing both a displacement and an
immediate or instructions containing a base and index addressing mode require an additional D2
clock to decode. The embedded Pentium processor removes both of these restrictions and is able to
issue instructions in these categories in a single clock.
The embedded Pentium processor uses the Execute (EX) stage of the pipeline for both ALU
operations and for data cache access; therefore, those instructions specifying both an ALU
operation and a data cache access require more than one clock in this stage. In EX, all u-pipe
instructions and all v-pipe instructions except conditional branches are verified for correct branch
prediction. Microcode is designed to utilize both pipelines; therefore, those instructions requiring
microcode execute faster.
The final stage is Writeback (WB), in which instructions are enabled to modify the processor state
and complete execution. In this stage, v-pipe conditional branches are verified for correct branch
prediction.
During their progression through the pipeline, instructions may be stalled due to certain conditions.
Both the u-pipe and v-pipe instructions enter and leave the D1 and D2 stages in unison. When an
instruction in one pipe is stalled, the instruction in the other pipe is also stalled at the same pipeline
stage. Thus both the u-pipe and the v-pipe instructions enter the EX stage in unison. Once in EX, if
the u-pipe instruction is stalled, then the v-pipe instruction (if any) is also stalled. If the v-pipe
instruction is stalled, then the instruction paired with it in the u-pipe is not allowed to advance. No
successive instructions are allowed to enter the EX stage of either pipeline until the instructions in
both pipelines have advanced to WB.
The embedded Pentium processor with MMX technology’s prefetch stage has four 16-byte buffers
that can prefetch up to four independent instruction streams, based on predictions made by the
BTB. In this case, the Branch Target Buffer predicts whether the branch will be taken or not in the
PF stage. The embedded Pentium processor with MMX technology features an enhanced two-stage
Branch prediction algorithm, compared to the embedded Pentium processor.
For more information on branch prediction, see “Component Introduction” on page 15-175.
Simple instructions are entirely hardwired; they do not require any microcode control and, in
general, execute in one clock. The exceptions are the ALU mem,reg and ALU reg,mem
instructions which are three and two clock operations, respectively. Sequencing hardware is used to
allow them to function as simple instructions. The following integer instructions are considered
simple and may be paired:
• mov reg, reg/mem/imm
• mov mem, reg/imm
• alu reg, reg/mem/imm
• alu mem, reg/imm
• inc reg/mem
• dec reg/mem
• push reg/mem
• pop reg
• lea reg,mem
• jmp/call/jcc near
• nop
• test reg, reg/mem
• test acc, imm
In addition, conditional and unconditional branches may be paired only if they occur as the second
instruction in the pair. They may not be paired with the next sequential instruction. Also,
SHIFT/ROT by 1 and SHIFT by IMM may pair as the first instruction in a pair.
The register dependencies that prohibit instruction pairing include implicit dependencies via
registers or flags not explicitly encoded in the instruction. For example, an ALU instruction in the
u-pipe (which sets the flags) may not be paired with an ADC or an SBB instruction in the v-pipe.
There are two exceptions to this rule. The first is the commonly occurring sequence of compare and
branch, which may be paired. The second exception is pairs of pushes or pops. Although these
instructions have an implicit dependency on the stack pointer, special hardware is included to allow
these common operations to proceed in parallel.
Although two paired instructions generally may proceed in parallel independently, there is an
exception for paired “read-modify-write” instructions. Read-modify-write instructions are ALU
operations with an operand in memory. When two of these instructions are paired, there is a
sequencing delay of two clocks in addition to the three clocks required to execute the individual
instructions.
Although instructions may execute in parallel, their behavior as seen by the programmer is exactly
the same as if they were executed sequentially.
Information regarding pairing of FPU and MMX instructions is discussed in “Floating-Point Unit”
on page 16-185 and “Intel MMX™ Technology Unit” on page 16-189 For additional details on
code optimization, refer to Optimizing for Intel’s 32-Bit Processors (order number 241799).
The processor accesses the BTB with the address of the instruction in the D1 stage. It contains a
Branch prediction state machine with four states: (1) strongly not taken, (2) weakly not taken, (3)
weakly taken, and (4) strongly taken. In the event of a correct prediction, a branch executes without
pipeline stalls or flushes. Branches that miss the BTB are assumed to be not taken. Conditional and
unconditional near branches and near calls execute in one clock and may be executed in parallel
with other integer instructions. A mispredicted branch (whether a BTB hit or miss) or a correctly
predicted branch with the wrong target address causes the pipelines to be flushed and the correct
target to be fetched. Incorrectly predicted unconditional branches incur an additional three clock
delay, incorrectly predicted conditional branches in the u-pipe incur an additional three clock delay,
and incorrectly predicted conditional branches in the v-pipe incur an additional four clock delay.
The benefits of branch prediction are illustrated in the following example. Consider the following
loop from a benchmark program for computing prime numbers:
for(k=i+prime;k<=SIZE;k+=prime)
flags[k]=FALSE;
A popular compiler generates the following assembly code (prime is allocated to ECX, K is
allocated to EDX, and AL contains the value FALSE):
inner_loop:
mov byte ptr flags[edx],al
add edx,ecx
cmp edx, SIZE
jle inner_loop
Each iteration of this loop executes in six clocks on the Intel486™ processor. On the embedded
Pentium processor, the MOV is paired with the ADD; the CMP with the JLE. With branch
prediction, each loop iteration executes in two clocks.
Note: The dynamic branch prediction algorithm speculatively runs code fetch cycles to addresses
corresponding to instructions executed some time in the past. Such code fetch cycles are run based
on past execution history, regardless of whether the instructions retrieved are relevant to the
currently executing instruction sequence.
One effect of the branch prediction mechanism is that the processor may run code fetch bus cycles
to retrieve instructions that are never executed. Although the opcodes retrieved are discarded, the
system must complete the code fetch bus cycle by returning BRDY#. It is particularly important
that the system return BRDY# for all code fetch cycles, regardless of the address.
It should also be noted that upon entering SMM, the branch target buffer (BTB) is not flushed and
thus it is possible to get a speculative prefetch to an address outside of SMRAM address space due
to branch predictions based on code executed prior to entering SMM. If this occurs, the system
must still return BRDY# for each code fetch cycle.
Furthermore, the processor may run speculative code fetch cycles to addresses beyond the end of
the current code segment (approximately 100 bytes past end of last executed instruction). Although
the processor may prefetch beyond the CS limit, it will not attempt to execute beyond the CS limit.
Instead, it will raise a GP fault. Thus, segmentation cannot be used to prevent speculative code
fetches to inaccessible areas of memory. On the other hand, the processor never runs code fetch
cycles to inaccessible pages (i.e., not present pages or pages with incorrect access rights), so the
paging mechanism guards against both the fetch and execution of instructions in inaccessible
pages.
For memory reads and writes, both segmentation and paging prevent the generation of bus cycles
to inaccessible regions of memory. If paging is not used, branch prediction can be disabled by
setting TR12.NBP (bit 0)1 and flushing the BTB by loading CR3 before disabling any areas of
memory. Branch prediction can be re-enabled after re-enabling memory.
FAR CALL
c000H
A6104-01
The branch prediction mechanism of the embedded Pentium processor, however, predicts that the
RET instruction is going to transfer control to the segment at address C000H and performs a
prefetch from that address prior to the OUT instruction that re-enables that memory address. The
result is that no BRDY is returned for that prefetch cycle and the system hangs.
In this case, branch prediction should be disabled (by setting TR12.NBP and flushing the BTB by
loading CR3) prior to disabling memory at address C000H, and re-enabled after the RET
instruction by clearing TR12.NBP as indicated above. (See Chapter 26, “Model Specific Registers
and Functions” for more information on register operation.)
In the embedded Pentium processor with MMX technology, the branch prediction algorithm
changes from the embedded Pentium processor in the following ways:
• BTB Lookup is done when the branch is in the PF stage.
• The BTB Lookup tag is the Prefetch address.
• A Lookup in the BTB performs a search spanning sixteen consecutive bytes.
• BTB can contain four branch instructions for each line of 16 bytes.
• BTB is constructed from four independent Banks. Each Bank contains 64 entries and is 4-way
associative.
• Enhanced two-stage branch prediction algorithm.
For information on code optimization, please refer to Optimizing for Intel’s 32-Bit Processors
(order number 241799).
PF Prefetch
F Fetch (applicable to the embedded Pentium processor with MMX technology only)
D1 Instruction decode
D2 Address generation
EX Memory and register read; conversion of FP data to external memory format and memory
write
X1 Floating-Point Execute stage one; conversion of external memory format to internal FP
data format and write operand to FP register file; bypass 1 (bypass 1 is described in “FPU
Bypasses” on page 16-188)
X2 Floating-Point Execute stage two
WF Perform rounding and write floating-point result to register file; bypass 2 (bypass 2 is
described in “FPU Bypasses” on page 16-188)
ER Error Reporting/Update Status Word
The embedded Pentium processor stack architecture instruction set requires that all instructions
have one source operand on the top of the stack. Since most instructions also have their destination
as the top of the stack, most instructions see a “top of stack bottleneck.” New source operands must
be brought to the top of the stack before we can issue an arithmetic instruction on them. This calls
for extra usage of the exchange instruction, which allows the programmer to bring an available
operand to the top of the stack. The processor FPU uses pointers to access its registers to allow fast
execution of exchanges and the execution of exchanges in parallel with other floating-point
instructions. An FP exchange that is paired with other FP instructions takes zero clocks for its
execution. Because such exchanges can be executed in parallel, it is recommended that one use
them when necessary to overcome the stack bottleneck.
Note that when exchanges are paired with other floating-point instructions, they should not be
followed immediately by integer instructions. The processor stalls such integer instructions for a
clock if the FP pair is declared safe, or for four clocks if the FP pair is unsafe.
Also note that the FP exchange must always follow another FP instruction to get paired. The
pairing mechanism does not allow the FP exchange to be the first instruction of a pair that is issued
in parallel. If an FP exchange is not paired, it takes one clock for its execution.
For normal data, the rules used on the embedded Pentium processor for declaring an instruction
safe are as follows.
Note that arithmetic overflow of the double precision format occurs when the unbiased exponent of
the result is ≥400H, and underflow occurs when the exponent is ≤−3FFH. Hence, the SIR
algorithm on the embedded Pentium processor allows improved throughput on a much greater
range of numbers than that spanned by the double precision format.
With bypass 1, the result of a floating-point load (that writes to the register file in the X1 stage) can
bypass the X1 stage write and be sent directly to the operand fetch stage or E stage of the next
instruction.
With bypass 2, the result of any arithmetic operation can bypass the WF stage write to the register
file, and be sent directly to the desired execution unit as an operand for the next instruction.
Note that the FST instruction reads the register file with a different timing requirement, so that for
the FST instruction, which attempts to read an operand in the E stage:
1. There is no bypassing the X1 stage write port and the E stage read port, i.e., no added bypass
for FLD followed by FST. Thus FLD (double) followed by FST (double) takes four clocks
(two for FLD, and two for FST).
2. There is no bypassing the WF stage write port and the E stage read port. The E stage read for
the FST happens only in the clock following the WF write for any preceding arithmetic
operation.
Furthermore, there is no memory bypass for an FST followed by an FLD from the same memory
location.
Note that all FP instructions update the status word only in the ER stage. Hence there is a built-in
status word interlock between FP instruction1 and the FSTSW AX instruction. The above piece of
code takes nine clocks before execution of code begins at the target of the jump. These nine clocks
are counted as:
Note that if there is a branch mispredict, there is a minimum of three clocks added to the clock
count of nine.
It is recommended that such attempts to branch upon numeric condition codes be preceded by
integer instructions; i.e., you should insert integer instructions in between FP instruction1 and the
FSTSW AX instruction that is the first instruction of the “numeric test and branch” construct. This
allows the elimination of up to four clocks (the 4 E-stage stalls on FSTSW AX) from the cost
attributed to this construct, so that numeric branching can be accomplished in five clocks.
MMX technology defines a simple and flexible software model, with no new mode or operating-
system visible state. All existing software runs correctly, without modification, on Intel
architecture processors that incorporate MMX technology, even in the presence of existing and
new applications that incorporate this technology.
The following sections of this chapter describe the basic programming environment for the
technology, the MMX technology register set, data types and instruction set. Detailed descriptions
of the MMX instructions are provided in Chapter 3 of the Intel Architecture Software Developer’s
Manual, Volume 2. The manner in which the MMX technology extensions fit into the Intel
architecture system programming model is described in Chapter 10 of the Intel Architecture
Software Developer’s Manual, Volume 3.
63 0
MM7
MM6
MM5
MM4
MM3
MM2
MM1
MM0
A6106-01
Although the MMX registers are defined in the Intel architecture as separate registers, they are
aliased to the registers in the FPU data register stack (R0 through R7). (See Chapter 10 in the Intel
Architecture Software Developer’s Manual, Volume 3, for a more detailed discussion of MMX
technology register aliasing.)
The bytes in the packed bytes data type are numbered 0 through 7. Byte 0 is contained in the least
significant bits of the data type (bits 0 through 7) and byte 7 is contained in the most significant bits
(bits 56 through 63). The words in the packed words data type are numbered 0 through 4. Word 0 is
contained in the bits 0 through 15 of the data type and word 4 is contained in bits 48 through 63.
The doublewords in a packed doublewords data type are numbered 0 through 1. Doubleword 0 is
contained in bits 0 through 31 and doubleword 1 is contained in bits 32 through 63.
A6107-01
The MMX instructions move the packed data types (packed bytes, packed words or packed
doublewords) and the quadword data type to-and-from memory or to-and-from the Intel
architecture general-purpose registers in 64-bit blocks. However, when performing arithmetic or
logical operations on the packed data types, the MMX instructions operate in parallel on the
individual bytes, words or doublewords contained in a 64-bit MMX register.
When operating on the bytes, words and doublewords within packed data types, the MMX
instructions recognize and operate on both signed and unsigned byte integers, word integers and
doubleword integers.
The SIMD execution model supported in the MMX technology directly addresses the needs of
modern media, communications and graphics applications, which often use sophisticated
algorithms that perform the same operations on a large number of small data types (bytes, words
and doublewords). For example, most audio data is represented in 16-bit (word) quantities. The
MMX instructions can operate on four of these words simultaneously with one instruction. Video
and graphics information is commonly represented as palletized 8-bit (byte) quantities. Here, one
MMX instruction can operate on eight of these bytes simultaneously.
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
The 64-bit access mode is used for 64-bit memory access, 64-bit transfer between registers, all
pack, logical and arithmetic instructions, and some unpack instructions.
The 32-bit access mode is used for 32-bit memory access, 32-bit transfer between integer registers
and MMX technology registers, and some unpack instructions.
These instructions provide a rich set of operations that can be performed in parallel on the bytes,
words or doublewords of an MMX packed data type.
When operating on the MMX packed data types, the data within a data type is cast by the type
specified by the instruction. For example, the PADDB (add packed bytes) instruction adds two
groups of eight packed bytes. The PADDW (add packed words) instruction, which adds packed
words, can operate on the same 64 bits as the PADDB instruction treating 64 bits as four 16-bit
words.
The embedded Pentium processor with MMX technology adds an additional fetch stage to the
pipeline. The instruction bytes are prefetched from the code cache in the prefetch (PF) stage, and
they are parsed into instructions (and prefixes) in the fetch (F) stage. Additionally, any prefixes are
decoded in the F stage.
When instructions execute in the two pipes, their behavior is exactly the same as if they were
executed sequentially. When a stall occurs, successive instructions are not allowed to pass the
stalled instruction in either pipe. Figure 16-6 shows the pipelining structure for this scheme.
Figure 16-6. MMX™ Technology Pipeline Structure
PF F D1 D2 EX WB
EX EX2
MMX instruction pipeline
integrated in integer pipeline
EX1 EX2 EX3
Integer pipeline only
A6109-01
Instruction parsing is decoupled from the instruction decoding by means of an instruction FIFO,
which is situated between the F and D1 (Decode 1) stages. The FIFO has slots for up to four
instructions. This FIFO is transparent, it does not add additional latency when it is empty.
Every clock cycle, two instructions can be pushed into the instruction FIFO (depending on the
availability of the code bytes, and on other factors such as prefixes). Instruction pairs are pulled out
of the FIFO into the D1 stage. Since the average rate of instruction execution is less than two per
clock, the FIFO is normally full. If the FIFO is full, then the FIFO can buffer a stall that may have
occurred during instruction fetch and parsing. If this occurs, then that stall will not cause a stall in
the execution stage of the pipe. If the FIFO is empty, then an execution stall may result from the
pipeline being “starved” for instructions to execute. Also, if the FIFO contains only one instruction,
then the instruction will not pair. Additionally, if an instruction is longer than 7 bytes, then only one
instruction will be pushed into the FIFO. Figure 16-6 details the MMX pipeline on superscalar
processors and the conditions where a stall may occur in the pipeline.
The data cache fully supports the MESI (modified/exclusive/shared/invalid) cache consistency
protocol. The code cache is inherently write protected to prevent code from being inadvertently
corrupted, and as a consequence supports a subset of the MESI protocol, the S (shared) and I
(invalid) states.
The caches have been designed for maximum flexibility and performance. The data cache is
configurable as writeback or writethrough on a line-by-line basis. Memory areas can be defined as
non-cacheable by software and external hardware. Cache writeback and invalidations can be
initiated by hardware or software. Protocols for cache consistency and line replacement are
implemented in hardware, easing system design.
On the embedded Pentium processor, replacement in both the data and instruction caches is
handled by the LRU mechanism, which requires one bit per set in each of the caches. The
embedded Pentium processor with MMX technology uses a pseudo-LRU replacement algorithm
that requires three bits per set in each of the caches. When a line must be replaced, the cache selects
which of L0:L1 and L2:L3 was least recently used. Then the cache determines which of the two
lines was least recently used and marks it for replacement. This decision tree is shown in
Figure 16-7.
Figure 16-7. Pseudo-LRU Cache Replacement Strategy
B0 = 0?
Yes: L0 or L1 least No: L2 or L3 least
recently used recently used
B1 = 0? B2 = 0?
Yes No Yes No
The data cache consists of eight banks interleaved on 4-byte boundaries. The data cache can be
accessed simultaneously from both pipes, as long as the references are to different cache banks. A
conceptual diagram of the organization of the data and code caches is shown in Figure 16-8. The
data cache supports the MESI writeback cache consistency protocol, which requires two state bits,
while the code cache supports the S and I state only and therefore requires only one state bit.
Figure 16-8. Conceptual Organization of Code and Data Caches
MESI MESI
State State
LRU
Data Set
Tag Address Tag Address
Cache
WAY 0 WAY 1
State Bit State Bit
(S or I) (S or I)
LRU
Code
Cache Set Tag Address Tag Address
WAY 0 WAY 1
A6112-01
The storage array in the data cache is single ported but interleaved on 4-byte boundaries to be able
to provide data for two simultaneous accesses to the same cache line.
Each of the caches are parity protected. In the instruction cache, there are parity bits on a quarter
line basis and there is one parity bit for each tag. The data cache contains one parity bit for each tag
and a parity bit per byte of data.
Each of the caches are accessed with physical addresses and each cache has its own TLB
(translation lookaside buffer) to translate linear addresses to physical addresses. The TLBs
associated with the instruction cache are single-ported whereas the data cache TLBs are fully dual-
ported to be able to translate two independent linear addresses for two data references
simultaneously. The tag and data arrays of the TLBs are parity protected with a parity bit associated
with each of the tag and data entries in the TLBs.
The data cache of the embedded Pentium processor has a 4-way set associative, 64-entry TLB for
4-Kbyte pages and a separate 4-way set associative, 8-entry TLB to support 4-Mbyte pages. The
code cache has one 4-way set associative, 32-entry TLB for 4-Kbyte pages and 4-Mbyte pages,
which are cached in 4-Kbyte increments. Replacement in the TLBs is handled by a pseudo-LRU
mechanism (similar to the Intel486 processor) that requires 3 bits per set. The embedded Pentium
processor with MMX technology has a 64-entry fully associative data TLB and a 32-entry fully
associative code TLB. Both TLBs can support 4-Kbyte pages as well as 4-Mbyte pages.
When the L1 cache is disabled (CR0.NW and CR0.CD bits are both set to ‘1’) external snoops are
accepted in a DP system and inhibited in a UP system. Note that when snoops are inhibited,
address parity is not checked, and APCHK# will not be asserted for a corrupt address. When
snoops are accepted, address parity is checked (and APCHK# will be asserted for corrupt
addresses).
To completely disable the cache, the following two steps must be performed:
1. CD and NW must be set to 1.
2. The caches must be flushed.
If the cache is not flushed, cache hits on reads will still occur and data will be read from the cache.
In addition, the cache must be flushed after being disabled to prevent any inconsistencies with
memory.
The PCD bit controls cacheability on a page-by-page basis. The PCD bit is internally ANDed with
the KEN# signal to control cacheability on a cycle-by-cycle basis. PCD = 0 enables cacheing,
while PCD = 1 disables it. Cache linefills are enabled when PCD = 0 and KEN# = 0.
The value driven on PWT is a function of the PCD bits in CR3, the page directory pointer, the page
directory entry and the page table entry, and the PG bit in CR0 (CR0.CD does not affect PWT).
CR0.CD = 1
If cacheing is disabled, the PCD pin is always driven high. CR0.CD does not affect the PWT pin.
CR0.PG = 0
If paging is disabled, the PWT pin is forced low and the PCD pin reflects the CR0.CD. The PCD
and PWT bits in CR3 are assumed 0 during the caching process.
The PCD and PWT bits from the last entry (can be either PDE or PTE, depends on 4 Mbyte or 4
Kbyte mode) are cached in the TLB and are driven anytime the page mapped by the TLB entry is
referenced.
During TLB refresh cycles when the PDE and PTE entries are read, the PWT and PCD bits are
obtained as shown in Table 16-3 and Table 16-4.
CR3 PDE
PDE PTE
PTE All other paged mem references
CR3 PDE
PDE All other paged mem references
Linear Address
31 22 12 0
Table
Directory (optional) Offset
10 10 31 0
31 0 + PCD, PWT
+ PCD, PWT
Page Table
31 0
PCD, PWT
Page Directory PWT
CR3 PWT
PG (Paging Enable)
CR0 CD (Cache Disable)
PCD
PCD
Cache transition to
E-state enable
WB/WT#
Cache Inhibit
TR12.3 CACHE#
Unlocked Memory Reads CI
Writeback Cycle
A6070-01
Inquire cycles are driven to the processor when a bus master other than the processor initiates a
read or write bus cycle. Inquire cycles are driven to the processor when the bus master initiates a
read to determine if the processor data cache contains the latest information. If the snooped line is
in the processor data cache in the modified state, the processor has the most recent information and
must schedule a writeback of the data. Inquire cycles are driven to the processor when the other bus
master initiates a write to determine if the processor code or data cache contains the snooped line
and to invalidate the line if it is present. Inquire cycles are described in detail in Chapter 19, “Bus
Functional Description.”
Flushing the cache through hardware is accomplished by driving the FLUSH# pin low. This causes
the cache to write back all modified lines in the data cache and mark the state bits for both caches
invalid. The Flush Acknowledge special cycle is driven by the processor when all writebacks and
invalidations are complete.
The INVD and WBINVD instructions cause the on-chip caches to be invalidated also. WBINVD
causes the modified lines in the internal data cache to be written back, and all lines in both caches
to be marked invalid. After execution of the WBINVD instruction, the Writeback and Flush special
cycles are driven to indicate to any external cache that it should write back and invalidate its
contents.
INVD causes all lines in both caches to be invalidated. Modified lines in the data cache are not
written back. The Flush special cycle is driven after the INVD instruction is executed to indicate to
any external cache that it should invalidate its contents. Care should be taken when using the INVD
instruction that cache consistency problems are not created.
Note that the implementation of the INVD and WBINVD instructions are processor dependent.
Future processor generations may implement these instructions differently.
Every line in the data cache is assigned a state dependent on both processor generated activities and
activities generated by other bus masters (snooping). The embedded Pentium processor Data Cache
Protocol consists of four states that define whether a line is valid (HIT/MISS), if it is available in
other caches, and if it has been MODIFIED. The four states are the M (Modified), E (Exclusive), S
(Shared) and the I (Invalid) states and the protocol is referred to as the MESI protocol. A definition
of the states is given below:
M - Modified: An M-state line is available in only one cache and it is also MODIFIED
(different from main memory). An M-state line can be accessed (read/written to)
without sending a cycle out on the bus.
E - Exclusive: An E-state line is also available in only one cache in the system, but the line is
not MODIFIED (i.e., it is the same as main memory). An E-state line can be
accessed (read/written to) without generating a bus cycle. A write to an E-state
line causes the line to become MODIFIED.
S - Shared: This state indicates that the line is potentially shared with other caches (i.e., the
same line may exist in more than one cache). A read to an S-state line does not
generate bus activity, but a write to a SHARED line generates a write-through
cycle on the bus. The write-through cycle may invalidate this line in other
caches. A write to an S-state line updates the cache.
I - Invalid: This state indicates that the line is not available in the cache. A read to this line
will be a MISS and may cause the processor to execute a LINE FILL (fetch the
whole line into the cache from main memory). A write to an INVALID line
causes the processor to execute a write-through cycle on the bus.
Table 16-5. Data Cache State Transitions for UNLOCKED Processor Initiated Read Cycles†
Present Next
Pin Activity Description
State State
Note the transition from I to E or S states (based on WB/WT#) happens only if KEN# is sampled
low with the first of BRDY# or NA#, and the cycle is transformed into a LINE FILL cycle. If
KEN# is sampled high, the line is not cached and remains in the I state.
A write to a SHARED line in the data cache generates a write cycle on the processor bus to update
memory and/or invalidate the contents of other caches. If the PWT pin is driven high when the
write cycle is run on the bus, the line is be updated and will stay in the S-state regardless of the
status of the WB/WT# pin that is sampled with the first BRDY# or NA#. If PWT is driven low, the
status of the WB/WT# pin sampled along with the first BRDY# or NA# for the write cycle
determines which state (E:S) the line transitions to.
The state transition from S to E is the only transition in which the data and the status bits are not
updated at the same time. The data is updated when the write is written to the processor write
buffers. The state transition does not occur until the write has completed on the bus (BRDY# has
been returned). Writes to the line after the transition to the E-state do not generate bus cycles.
However, it is possible that writes to the same line that were buffered or in the pipeline before the
transition to the E-state generate bus cycles after the transition to E-state.
An inactive EWBE# input stalls subsequent writes to an E- or an M-state line. All subsequent
writes to E- or M-state lines are held off until EWBE# is returned active.
Table 16-6. Data Cache State Transitions for Processor Initiated Write Cycles
Present Next
Pin Activity Description
State State
The state transition tables for inquire cycles are given below:
In the case of a read hit, the cycle is serviced internally to the processor and no bus activity is
generated. In the case of a read miss, the read is sent to the external bus and may be converted to a
linefill.
Lines are never overwritten in the code cache. Writes generated by the processor are snooped by
the code cache. If the snoop is a hit in the code cache, the line is invalidated. If there is a miss, the
code cache is not affected.
The embedded Pentium processor with MMX technology has four write buffers that can be used by
either the u-pipe or v-pipe. Posting writes to these buffers enables the pipe to continue advancing
when consecutive writes to memory occur. The writes will be executed on the bus as soon as it is
free, in FIFO order. Reads cannot bypass writes posted in these buffers.
Write
4 Buffers External Bus
U-pipe
D1 D2 EX WB
PF F
D1 D2 EX WB
V-pipe Write
Write
1 Buffer External Bus
U-pipe
D1 D2 EX WB
PF
D1 D2 EX WB
V-pipe 1 Buffer External Bus
Write
A6113-01
The embedded Pentium processor supports strong write ordering only. That is, writes generated by
the embedded Pentium processor are driven to the bus or updated in the cache in the order in which
they occur. The embedded Pentium processor does not write to E or M-state lines in the data cache
if there is a write in either write buffer, if a write cycle is running on the bus, or if EWBE# is
inactive.
Note that only memory writes are buffered and I/O writes are not. There is no guarantee of
synchronization between completion of memory writes on the bus and instruction execution after
the write. The OUT instruction or a serializing instruction needs to be executed to synchronize
writes with the next instruction. Refer to “Serializing Operations” on page 16-206 for more
information.
No re-ordering of read cycles occurs on the embedded Pentium processor. Specifically, the write
buffers are flushed before the IN instruction is executed.
Note that if an OUT instruction is used to modify A20M#, this will not affect previously prefetched
instructions. A serializing instruction must be executed to guarantee recognition of A20M# before
a specific instruction.
The embedded Pentium processor serializes instruction execution after executing one of the
following instructions: MOV to Debug Register, MOV to Control Register, INVD, INVLPG,
IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD, CPUID, RSM and WRMSR.
The CPUID instruction can be executed at any privilege level to serialize instruction execution.
When the processor serializes instruction execution, it ensures that it has completed any
modifications to memory, including flushing any internally buffered stores; it then waits for the
EWBE# pin to go active before fetching and executing the next instruction. Systems may use the
EWBE# pin to indicate that a store is pending externally. In this manner, a system designer may
ensure that all externally pending stores complete before the processor begins to fetch and execute
the next instruction.
The processor does not generally writeback the contents of modified data in its data cache to
external memory when it serializes instruction execution. Software can force modified data to be
written back by executing the WBINVD instruction.
Whenever an instruction is executed to enable/disable paging (that is, change the PG bit of CR0),
this instruction must be followed with a jump. The instruction at the target of the branch is fetched
with the new value of PG (i.e., paging enabled/disabled); however, the jump instruction itself is
fetched with the previous value of PG. Intel386™, Intel486 and embedded Pentium processors
have slightly different requirements to enable and disable paging. In all other respects, an MOV to
CR0 that changes PG is serializing. Any MOV to CR0 that does not change PG is completely
serializing.
Whenever an instruction is executed to change the contents of CR3 while paging is enabled, the
next instruction is fetched using the translation tables that correspond to the new value of CR3.
Therefore the next instruction and the sequentially following instructions should have a mapping
based upon the new value of CR3.
Although the I/O instructions are not “serializing” because the processor does not wait for these
instructions to complete before it prefetches the next instruction, they do have the following
properties that cause them to function in a manner that is identical to previous generations. I/O
reads are not re-ordered within the processor; they wait for all internally pending stores to
complete. Note that the embedded Pentium processor does not sample the EWBE# pin during
reads. If necessary, external hardware must ensure that externally pending stores are complete
before returning BRDY#. This is the same requirement that exists on Intel386 and Intel486
processor systems. The OUT and OUTS instructions are also not “serializing,” as they do not stop
the prefetcher. They do, however, ensure that all internally buffered stores have completed, that
EWBE# has been sampled active indicating that all externally pending stores have completed and
that the I/O write has completed before they begin to execute the next instruction. Note that unlike
the Intel486 processor, it is not necessary for external hardware to ensure that externally pending
stores are complete before returning BRDY#.
On the embedded Pentium processor with MMX technology, serializing instructions require an
additional clock to complete compared to the embedded Pentium processor due to the additional
pipeline stage.
A dedicated replacement writeback buffer stores writebacks caused by linefills that replace
modified lines in the data cache. In addition, an external snoop writeback buffer stores writebacks
caused by a inquire cycles that hit modified lines in the data cache. Finally, an internal snoop
writeback buffer stores writebacks caused by internal snoop cycles that hit modified lines in the
data cache. Internal and external snoops are discussed in detail in “Cache Consistency Cycles
(Inquire Cycles)” on page 19-353. Write cycles are driven to the bus with the following priority:
1. Contents of external snoop writeback buffer
2. Contents of internal snoop writeback buffer
3. Contents of replacement writeback buffer
4. Contents of write buffers.
Note that the contents of the write buffer that was written into first are driven to the bus first. If
both write buffers were written to in the same clock, the contents of the u-pipe buffer is written out
first. In the embedded Pentium processor with MMX technology, the write buffers are written in
order, even though there is no u-pipe buffer and no v-pipe buffer.
The embedded Pentium processor implements two linefill buffers, one for the data cache and one
for the code cache. As information (data or code) is returned to the processor for a cache linefill, it
is written into the linefill buffer. After the entire line has been returned to the processor it is
transferred to the cache. Note that the processor requests the needed information first and uses that
information as soon as it is returned. The processor does not wait for the linefill to complete before
using the requested information.
If a line fill causes a modified line in the data cache to be replaced, the replaced line remains in the
cache until the linefill is complete. After the linefill is complete, the line being replaced is moved
into the replacement writeback buffer and the new linefill is moved into the cache.
The embedded Pentium processor interrupt priority scheme is shown in Table 16-8.
In this manual, in order to distinguish between two processors in dual processing mode, one
processor is designated as the Primary processor and the other as the Dual processor. Note that this
is a different concept than that of “master” and “checker” processors.
The Dual processor is a configuration option of the embedded Pentium processor. The Dual
processor must operate at the same bus and core frequency and bus/core ratio as the Primary
processor.
The Primary and Dual processors include logic to maintain cache consistency between the
processors and to arbitrate for the common bus. The cache consistency and bus arbitration activity
causes the dual processor pair to issue extra bus cycles that does not appear in a embedded Pentium
processor uniprocessor system.
Chapter 17, “Microprocessor Initialization and Configuration,” describes in detail how the DP
bootup, cache consistency, and bus arbitration mechanisms operate. In order to operate properly in
dual processing mode, the Primary and Dual processors require a private APIC, cache consistency,
and bus arbitration interfaces, as well as a multiprocessing-ready operating system.
The dual processor interface allows the Dual processor to be added for a substantial increase in
system performance. The interface allows the Primary and Dual processor to operate in a coherent
manner that is transparent to the system.
The memory subsystem transparency was the primary goal of the cache coherency and bus
arbitration mechanisms.
The Primary and Dual processors implement a fair arbitration scheme. If the Least Recent Master
(LRM) requests the bus from the Most Recent Master (MRM), the bus is granted. The embedded
Pentium processor arbitration scheme provides no penalty to switch from one master to the next. If
pipelining is used, the two processors pipeline into and out of each other’s cycles according to the
embedded Pentium processor specification.
Cache coherency is maintained between the two processors by snooping on every bus access. The
LRM must snoop with every ADS# assertion of the MRM. Internal cache states are maintained
accordingly. If an access hits a modified line, a writeback is scheduled as the next cycle, in
accordance with the embedded Pentium processor specification.
Using the Dual processor may require special design considerations. Refer to Chapter 18,
“Hardware Interface” for more details.
The dual processor pair appears to the system bus as a single, unified processor. The operation is
identical to a uni-processor embedded Pentium processor, except as noted in “Summary of Dual
Processing Bus Cycles” on page 19-363. The interface shields the system designer from the cache
consistency and arbitration mechanisms that are necessary for dual processor operation.
Both the Primary and Dual processors contain local APIC modules. The system designer is
recommended to supply an I/O APIC or other multiprocessing interrupt controller in the chip set
that interfaces to the local APIC blocks over a three-wire bus. The APIC allows directed interrupts
as well as inter-processor interrupts.
The Primary and Dual processors, when operating in dual processing mode, require the local APIC
modules to be hardware enabled in order to complete the bootup handshake protocol. This method
is used to “wake up” the Dual processor at an address other than the normal Intel architecture high
memory execution address. On bootup, if the Primary processor detects that a Dual processor is
present, the dual processor cache consistency and arbitration mechanisms are automatically
enabled. The bootup handshake process is supported in a protocol that is included in the embedded
Pentium processor. See Chapter 17, “Microprocessor Initialization and Configuration,” for more
details on the APIC.
The dual processor pair must arbitrate for use of the system bus as requests are generated. The
processors implement a fair arbitration mechanism.
When the LRM processor needs to run a cycle on the bus it submits a request for bus ownership to
the MRM. The MRM processor grants the LRM processor bus ownership as soon as all
outstanding bus requests have finished on the processor bus. The LRM processor assumes the
MRM state, and the processor that was just the MRM, becomes the LRM. Figure 16-12 further
illustrates this point:
Diagram (a) of Figure 16-12 shows a configuration where the Primary processor is in the MRM
state and the Dual processor is in the LRM state. The Primary processor is running a cycle on the
system bus when it receives a bus request from the Dual processor. In diagram (b) of Figure 16-12
the MRM (still the Primary processor) has received an indication that the bus request has finished.
The bus ownership has transferred in diagram (c) of Figure 16-12, where the Dual processor is now
the MRM. At this point, the Dual processor starts a bus transaction and continues to own the bus
until the LRM requests the bus.
Figure 16-12. Dual Processor Arbitration Mechanism
LRM MRM
A situation can arise where the Primary and Dual processors are operating in dual processor mode
with shared code or data. The first-level caches attempt to cache this code and data whenever
possible (as indicated by the page cacheability bits and the cacheability pins). The private cache
coherency mechanism guarantees data consistency across the processors. If any data is cached in
one of the processors, and the other processor attempts to access the data, the processor containing
the data notifies the requesting processor that it has cached the data. The state of the cache line in
the processor containing the data changes depending on the current state and the type of request the
other processor has made.
In some cases, the data returned by the system is ignored. This constraint is placed on the dual
processor cache consistency mechanism so that the dual processor pair looks like a single
processor to the system bus. However, in general, bus accesses are minimized to efficiently use the
available bus bandwidth.
The basic coherency mechanism requires the processor that is in the LRM state to snoop all MRM
bus activity. The MRM processor running a bus cycle watches the LRM processor for an indication
that the data is contained in the LRM cache. The following diagrams illustrate the basic coherency
mechanism. These figures show an example in which the Primary processor (the MRM) is
performing a cache line fill of data. The data requested by the Primary processor is cached by the
Dual processor (the LRM), and is in the modified state.
In diagram (a) of Figure 16-13, the Primary processor has already negotiated with the Dual
processor for use of the system bus and has started a cycle. As the Primary processor starts running
the cycle on the system bus, the Dual processor snoops the transaction. The key for the start of the
snoop sequence for the LRM processor is an assertion of ADS# by the MRM processor.
Diagram (b) of Figure 16-13 shows the Dual processor indicating to the Primary processor that the
requested data is cached and modified in the Dual processor cache. The snoop notification
mechanism uses a dedicated, two-signal interface that is private to the dual processor pair. At the
same time that the Dual processor indicates that the transaction is contained as Modified in the its
cache, the Dual processor requests the bus from the Primary processor (still the MRM). The MRM
processor continues with the transaction that is outstanding on the bus, but ignores the data
returned by the system bus.
After the Dual processor notifies the Primary processor that the requested data is modified in the
Dual processor cache, the Dual processor waits for the bus transaction to complete. At this point,
the LRM/MRM state will toggle, with the Primary processor becoming the LRM processor and the
Dual processor becoming the MRM processor. This sequence of events is shown in diagram (c) of
Figure 16-13.
Diagram (c) of Figure 16-13 also shows the Dual processor writing the data back on the system
bus. The write back cycle looks like a normal cache line replacement to the system bus. The final
state of the line in the Dual processor is determined by the value of the W/R# pin as sampled during
the ADS# assertion by the Primary processor.
Finally, diagram (d) of Figure 16-13 shows the Primary processor re-running the bus transaction
that started the entire sequence. The requested data is returned by the system as a normal line fill
request without intervention from the LRM processor.
The Advanced Programmable Interrupt Controller (APIC) is an on-chip interrupt controller that
supports multiprocessing. In a uniprocessor system, the APIC may be used as the sole system
interrupt controller, or may be disabled and bypassed completely.
In a multiprocessor system, the APIC operates with an additional and external I/O APIC system
interrupt controller. The dual-processor configuration requires that the APIC be hardware enabled.
The APICs of the Primary and Dual processors are used in the bootup procedure to communicate
start-up information.
On the embedded Pentium processor, the APIC uses 3 pins: PICCLK, PICD0, and PICD1.
PICCLK is the APIC bus clock while PICD0-PICD1 form the two-wire communication bus.
To use the 8259A interrupt controller, or to completely bypass it, the APIC may be disabled using
the APICEN pin. You must use the local APICs when using the dual-processor component.
In a dual-processor configuration, the local APIC may be used with an additional device similar to
the I/O APIC. The I/O APIC is a device that captures all system interrupts and directs them to the
appropriate processors via various programmable distribution schemes. An external device
provides the APIC system clock. Interrupts that are local to each processor go through the APIC on
each chip. A system example is shown in Figure 16-14.
Figure 16-14. APIC System Configuration
Primary Dual
Processor Processor
Local Local
Interrupts Interrupts
Local 3.3V Local
LINT0 LINT0
APIC APIC
LINT1 LINT1
PICD1
PICD0
PICCLK
CLK
Generator
System I/O
Interrupts
I/O APIC
16
8259A
A6117-01
The APIC devices in the Primary and Dual processors may receive interrupts from the I/O APIC
via the three-wire APIC bus, locally via the local interrupt pins (LINT0, LINT1), or from the other
processor via the APIC bus. The local interrupt pins, LINT0 and LINT1, are shared with the INTR
and NMI pins, respectively. When the APIC is bypassed (hardware disabled) or programmed in
“through local” mode, the 8259A interrupt (INTR) and NMI are connected to the INTR/LINT0 and
NMI/LINT1 pins of the processor. Figure 16-15 shows the APIC implementation in the embedded
Pentium processor. Note that the PICCLK has a maximum frequency of 16.67 MHz.
When the local APIC is hardware enabled, data memory accesses to its 4 Kbyte address space are
executed internally and do not generate an ADS# on the processor bus. However, a code memory
access in the 4 KByte APIC address space will not be recognized by the APIC and will generate a
cycle on the processor bus.
Note: Internally executed data memory accesses may cause the address bus to toggle even though no
ADS# is issued on the processor bus.
Figure 16-15. Local APIC Interface
Pentium® Processor
INIT
SMI#
Pentium® Processor
LINT0 / INTR Interrupt Logic
LINT1 / NMI Local APIC
Module
PICCLK
PICD0
PICD1 APIC Enable
A6118-01
To use the Through Local Mode of the local APIC, the APIC must be enabled in both hardware and
software. This is done by programming two local vector table entries, LVT1 and LVT2, at
addresses 0FEE00350H and 0FEE00360H, as external interrupts (ExtInt) and NMI, respectively.
The 8259A responds to the INTA cycles and returns the interrupt vector to the processor.
The local APIC should not be sent any interrupts prior to it’s being programmed. Once the APIC is
programmed it can receive interrupts.
Note that although external interrupts and NMI are passed through the local APIC to the core, the
APIC can still receive messages on the APIC bus.
• The APIC can continue to receive SMI, NMI, INIT, “startup,” and remote read messages.
• Local interrupts are masked.
• Software can enable/disable the APIC at any time. After software disabling the local APICs,
pending interrupts must be handled or masked by software.
• The APIC PICCLK must be driven at all times.
bit 24 BE0#
bit 25 BE1#
bit 26 BE2#
bit 27 BE3#
Warning: An APIC ID of all 1s is an APIC special case (i.e., a broadcast) and must not be used. Since the
Dual processor inverts the lowest order bit of the APIC ID placed on the BE pins, the value “1110”
should also be avoided when operating in Dual Processing mode.
In a dual processor configuration, the OEM and Socket 5 should have the four BE pairs tied
together. The OEM processor loads the value seen on these four pins at RESET. The dual processor
loads the value seen on these pins and automatically inverts bit 24 of the APIC ID Register. Thus,
the two processors have unique APIC ID values.
These four pins must be valid and stable two clocks before and after the falling edge of RESET.
Note: Only the Low-power Embedded Pentium Processor with MMX technology has a BF2 pin.
The external bus frequency is set on power-up RESET through the CLK pin. The processor
samples the BFn pins on the falling edge of RESET to determine which bus-to-core ratio to use.
When the BFn pins are left unconnected, the embedded Pentium processor defaults to the 2/3 ratio
and the embedded Pentium processor with MMX technology defaults to the 1/2 ratio. BFn settings
must not change its value while RESET is active. Changing the external bus speed or bus-to-core
ratio requires a “power-on” RESET pulse initialization. Once a frequency is selected, it may not be
changed with a warm-reset (15 clocks). The BF pin must meet a 1 ms setup time to the falling edge
of RESET.
Each embedded Pentium processor is specified to operate within a single bus-to-core ratio and a
specific minimum to maximum bus frequency range (corresponding to a minimum to maximum
core frequency range). Operation in other bus-to-core ratios or outside the specified operating
frequency range is not supported. Tables 16-10 through 16-12 summarize these specifications.
.
0 0 0 2/5 66/166
1 0 0 1/4 66/266
Int CLK
Ext CLK
Int Data A A A
Output
Ext Data A
Ext Data B
Input
Int Data B
A6122-01
1 2 3 4 5
Int CLK
Ext CLK
Int Data A B
Output
A B
Ext Data
Ext Data A B
Input
Int Data A B
A6123-01
Figure 16-18 shows how the embedded Pentium processor prevents data from changing in clock 2,
where the 2/3 external clock rising edge occurs in the middle of the internal clock phase, so it can
be properly synchronized and driven.
Figure 16-18. Processor 2/5 Bus Internal/External Data Movement
1 2 3 4 5 6
Int CLK
Ext CLK
Int Data A B
Output
Ext Data A B
Ext Data C D
Input
Int Data C D
A6119-01
Int CLK
Ext CLK
Int Data A A
Output
Ext Data A
Ext Data B
Input
Int Data B
A6125-01
Stop clock is enabled by asserting the STPCLK# pin of the embedded Pentium processor. While
asserted, the embedded Pentium processor stops execution and does service interrupts, but allows
external and interprocessor (Primary and Dual processor) snooping.
AutoHalt Powerdown is entered once the embedded Pentium processor executes a HLT instruction.
In this state, most internal units are powered-down, but the embedded Pentium processor
recognizes all interrupts and snoops.
Embedded Pentium processor pin functions (D/P#, etc.) are not affected by STPCLK# or AutoHalt.
For additional details on power management, refer to Chapter 24, “Power Management.”
The following EAX value is defined for the CPUID instruction executed with EAX = 1. The
processor version EAX bit assignments are given in Figure 16-20. Table 16-13 lists the feature flag
bits assignment definitions.
Figure 16-20. EAX Bit Assignments for CPUID
31 14 13 12 11 8 7 4 3 0
A6126-01
The family field for the embedded Pentium processor family is 0101B (5H). The model value for
the embedded Pentium processor is 0010B (2H) or 0111B (7H), and the model value for the
embedded Pentium processor with MMX technology is 0100B (4H). The model value for the low-
power embedded Pentium processor with MMX technology is 1000B (8H)
Note: Use the MMX technology feature bit (bit23) in the EFLAGS register, not the model value, to detect
the presence of the MMX technology feature set.
For specific information on the stepping field, consult the embedded Pentium processor family
Specification Update. The type field is defined in Table 16-14.
Two instructions, RDMSR and WRMSR (read/write model specific registers) are used to access
these registers. When these instructions are executed, the value in ECX specifies which model
specific register is being accessed.
Software must not depend on the value of reserved bits in the model specific registers. Any writes
to the model specific registers should write “0” into any reserved bits.
For more information, refer to Chapter 26, “Model Specific Registers and Functions.”