March  06, 2003 (updated)

Looking at Intel's Prescott die 

(by Hans de Vries)

 

 

We take a hard look at Intel's Prescott die to see if we can discover any undisclosed features

 

Intel gave it's first presentation on the new Prescott x86 processor during it's Spring 2003 Developers Forum. The processor will be introduced in Q4 2003 on Intel's new 90 nm strained silicon process. Intel had already stated that Prescott would be a significant extension to the Pentium 4 Netburst architecture and the first glimpses of Prescott layout show just that. It is a completely new lay-out with numerous changes. Looking closely at the die one can see (or at least imagine...) numerous improvements many of them not yet publicly disclosed by Intel. 

  The info given by Intel until now.  

 

            -   A larger L2 Unified Cache:            1024 kByte versus 512 kByte  for Northwood. 

            -   A larger L1 Data Cache:               16 kByte versus 8 kByte   for Pentium 4 .

            -   An extremely low clock skew:       7 picoseconds versus 22 picoseconds for Northwood.

            -   "La Grande" Support                     protection for providers or consumers?  It's what exactly?   

            -   13 new instructions  (PNI)

 Intel states also many improvements.  ( but improved how? )

 

            -   Improved Hyper threading technology.

            -   Improved pre-fetcher 

            -   Improved branch predictor.

            -   Improved Integer Multiply latency.

            -   Improved Power management.

            -   Additional Write Combining buffers

 

Our starting point:  Spring IDF 2003 foils showing (some) layout information:

 

  In the article below we'll discus the following new features we discovered:

 

         (1)     Instruction Trace Cache extended from 12 to 16 k uOps ?

         (2)     4 instructions/cycle fetch and retire ? (up from 3)

         (3)     Floating Point unit changed location on the Die

         (4)     Two (!) Rapid Execution Engine's ?

         (5)     Very wide high speed L3 Cache Bus ?

         (6)     Prescott die size 109 mm2 (updated March 7, 2003)

 

A speculative Die diagram :

 

 

The many small white rectangles in the die diagram are so-called Macro-cells. These are blocks that are pre-routed like Rams and Roms but also critical units that have been laid out by hand such as the high speed ALU's of the Rapid Execution Engine. The automated placement and routing software then handles the rest of die.

 

  HOME

 

  

(1)  Instruction Trace Cache Extended from 12 to 16k uOps ?.

 

Comparing SRAM sizes 

 

We see a significant increase in size when we compare the Trace Caches on Northwood and Prescott with the size of their 256 kByte L2 Cache blocks.  If we may presume that Intel used it's densest type of SRAM for both large structures then we can obtain an indication of Trace Cache sizes in bytes as well.

 

                                Northwood Trace Cache:    256 kByte / 2.4   =     ~  106 kByte  +/- ?

                                Prescott Trace Cache:        256 kByte / 1.6   =     ~  160 kByte  +/- ?

 

The Trace Cache CPUID

 

Northwoods Trace Cache contains 12 kOps.  That's 4096 lines of  3 uOps each. One line can be read each cycle. 

( The actual implementation may provide 6 uOps every 2 cycles, at least according to some patents )

The best place to look for the Prescott Trace Cache size in uOps is the CPUID table. The Trace Cache values were already published with the introduction of the Willamette and are still the same in the latest Prescott PNI document.

 In this table we find the following 3 entries for the Trace Cache. 

 

70h:

12 kOps,  Trace Cache,  8-way set associative

71h:

16 kOps,  Trace Cache,  8-way set associative

72h:

32 kOps,  Trace Cache,  8-way set associative

 

So it looks that we might expect a value of 71h in Prescott:   16k uOps.

 

HOME 

 

 

(2)  Four instructions/cycle fetch and retire (up from 3).

 

Double odd or 4-way

 

So if we have 16k uOps  (16,384 uOps), how many lines of three uOps do we have?

16,384 / 3 = 5461.33333 lines? or maybe a whole number 5461 then ?  The 3 is already an awkward number. 5461 entries in a memory is equally odd. It would not be impossible if the addressing was fully associative but the same CPUID table mentions that the Trace Cache is 8 way set-associative. What this means is that the number of entries divided by 8 must be a power of 2.  Now clearly 5461/8 = 682.375 is nowhere near a power of 2 !

 

Doing it four way

 

We get nice numbers again however if we presume that each Trace Cache entry now has 4 instructions up from 3.

The Trace Cache keeps the same 4096 entries but now with each containing 4 instructions. This would mean that the Prescott can send 4 instructions per cycle into the processing pipeline up from 3.  One can be happy if an average program reaches effectively 1.5  instructions per cycle so 4 per cycle is sufficient to fully support at least 2 threads.

 

Looking at the layout.

 

The pipeline stages following the Trace Cache affected by this change would be stages 6 through 9:  ALLOCATE, RENAME1, RENAME2 and QUEUE. These stages are located at the top-right corner of the die. We see that the whole layout has been thoroughly re-arranged, including a total vertical flip. The instructions flow from left to right. At the start we expect the micro-code rom plus the micro-code sequencer followed by a queue for the instructions from the Trace Cache and the Micro-code sequencer. The Allocate and Rename stages reserve entries in various buffers further on in the pipeline ( outside the area shown above ). At the end we expect the Specialized queues for memory access and general integer and floating point instructions. Here ends the 4 way (3 way) division of equal instruction paths.

 

What goes in must come out:   4 way retiring.

 

A logical consequence of a Trace Cache which can provide 4 micro operations per cycle is the Processors ability to retire micro operations at the same rate at the very end of the processing pipeline. 

 

HOME

 

 

(3)  Floating Point unit changed location on the Die.

 

Making room for ...

 

The top two images below are from one of the Intel presentation sheets. They show what the lay-outer sees on his monitor when looking at the Floating Point units of Northwood and Prescott.  The Prescott view shows how various units are intertwined. This because the layout software was allowed to place cells anywhere it wants in the entire area unlike Northwood's case where it wasn't allowed to place cells outside their own bounding box. 

 

The two middle images show the same Floating Point units. The Northwood version comes from a high resolution die plot while the vague Prescott Floating Point unit was found on Prescott die plot shown during the spring 2003 IDF.

The lower two images show how the location has changed. Again this shows that Prescott is a significant change from its Willamette / Northwood predecessors 

HOME

 

(4)  Two (!) Rapid Execution Engine's ? .

 

 

Most remarkable is what we may see at the location where we expect the Rapid Execution Engine and L1 Data Cache. Almost the same location that we know from the Pentium 4. The L1 Data Cache connected to the L2 with its very wide data bus is located close to the middle of the L2 cache. It seems that there are two identical copies, partly mirrored, next to each other. Now when we compare this area with the central part of the Northwood's integer core then we recognize a number of "hand-routed" macro-cells. These macro's were hand-routed because they are the timing critical building blocks of the Rapid Execution Engine. Units like the ALU's the AGU's and the Bypass network.

Two copies of the L1 Data Cache as well?

 

It looks like there may be two copies of the L1 data cache as well. Both with an increased size of 16 kByte

HOME 

 

(5)  Very wide high speed L3 Cache Bus ?.

 

The drawing below stems from a year old article that I never published. The only thing that has changed now is the new die plot of  Prescott. The 52,428,800 bit (presumably) L3 cache sram was the first shown 90 nm device. The L3 cache contains a significant amount of  extra logic and smaller memory that may be Tag Ram.

 

A lot of IO buffers

 

One eye catching detail on the L3 cache is the long row of little rectangles that runs along 90% of one of the die-sides.

If we zoom in then we can actually count them: 2 rows of 128 cells. These are most likely IO buffers. The fact that we now see a very similar row at the opposite side on Prescott's makes this only more likely.

Intel did mention that the Sram runs at a frequency higher as the then fastest Pentium 4.  This may mean that the old Xeon tradtion that the L3 cache bus runs at the same frequency as the processor core is continued with the Pentium 5.

 

One wonders what the relation is with the newly announced 775-contact pinless Land Grid Array (LGA) package with 297 contacts more as the current 478 pin package..

 

HOME 

 

 

(6)  Prescott die size 109 mm2  (10.7 x 10.2 mm).

(updated March 7, 2003)

 

 Thanks to Hisa Ando san from Japan:  Prescott's right die size

 

Thanks to Ando san who wrote how he first calculated Prescott's die size from Louis Burns spring 2003 IDF presentation and then found the exact values in presentation 19.7 at the ISSCC 2003: A scalable Sub-10ps Skew Global Clock Distribution for a 90 nm Multi-GHz IA Microprocessor (N. Bindal, T. Kelly, N. Velastegui, R. Raman, K. Wong)

The exact sizes are 10.7 mm x 10.2 mm. The wafer calculations give 10.9 and 10.34 mm from which we must subtract the narrow scribe-lines which are in the order of 100um or so. 

 

  Pentium 5  Width:    10.7  mm
  Pentium 5  Height:     10.2  mm
  Pentium 5  Die Size:     109  mm2

 

Related articles 

 

March 26, 2003:     Clues for Yamhill

April    20, 2003:     Looking at Intel's Prescott die, part II

HOME