The Core architecture has a 14-stage pipeline, down from the 31 pipelines that NetBurst processors from the Prescott onwards featured. A classic case of when less is more! The K8 architecture has a 12-stage pipeline, up from its former 10
The driver of evolution is need. PCs are full-blown entertainment stations, and the term “rig” is used to define that monster lurking inside your cabinet… and it’s not a 400-horse V8! Such powerhouses are the realm of Intel and AMD. Their latest Core 2 and AMD 64 arhitectures are revolutionary, even evolutionary advancements over their earlier NetBurst and K7 cores respectively.
AMD 64s have been dominating the desktop market as far as performance goes for nigh to three years. Intel’s infamous NetBurst suffered from a serious case of performance-throttling heat-stroke, and lagged behind on nearly all fronts. Any market needs competition to survive, and it’s here, as the Core 2 Duo emerges with significant performance leads over everything in the desktop processor realm! AMD’s response is highly anticipated… but that’s another story for another time.
The Fearsome Twosome
The processor industry as a whole is moving towards “parallism”. Dual cores have been available for a while now, and quad cores are expected to make an entry shortly. Both the latest architectures (Intel’s Core 2 Duo and AMD’s Athlon 64) are significantly faster than previous-generation products, with significant performance leads across most applications. Let’s take a side-by-side look (literally) at all the features the latest architectures from these two CPU giants sport.
The Philisophy Of The Core…
Intel’s Core range of CPUs-“Core” being the architectural handle for the Core 2 Duos-are being made on a 65 nm fab. Intel is no stranger to 65 nm: their Pentium D Presler series were the first 65 nm CPUs, remember? The earlier 90 nm core had a bad history as far as heat dissipation goes, but the 90 nm process wasn’t the only culprit. Hand-in-glove were the longer pipelines the Prescott sported, not to mention that it sported more of them than did the older Northwood cores.
The die shrink and the 65 nm fab collectively translate to reduced power leakage-less heat dissipation. Besides this, the Core architecture also has fewer transistors than the previous Presler (the 9xx series) cores. Not only this, the size of the transistors has also come down, so while the Core 2 Duo’s transistor count is less than that of the Pentium D 9xx family, the density of the transistors can be greater.
Intel is talking about “performance per watt” with the Core range. Gone are the gigahertz wars where 4 GHz was considered the Holy Grail. These days the focus is on performance with power saving, and Intel has learnt their lesson from their 130 W Prescott cores, dubbed “nuclear reactors” by many-that hot doesn’t pay, at least as far as TDP (Thermal Design Power) goes!
Firstly, the Core 2 Duo loses quite a few pipelines. Of course, clock frequencies will be lower due to fewer pipelines, but the Core micro-architecture has some other nifty tricks up its sleeve. The Core 2 Duos are designed from the ground up to be dual-core solutions, unlike the Pentium D CPUs, which were “pseudo” dual-cores-more aptly “double cores.” The Smithfield and Presler processors (Pentium Ds) were basically double cores-the Smithfield consisted of two Prescott cores, while the Presler comprised two Cedar Mill cores.
Memory Subsystems-Advanced Smart Cache
Minimising the role of memory latencies as a performance deterrent was a major objective for Intel. Also, the processor cache subsystem had to be beefed up in order to counter the latencies that occur due to the lack of an on-die memory controller, as on the AMD64. The L1 cache is a hefty 128 KB (64 KB for each core); that’s 32 KB for data and 32 KB for instructions. Earlier, Pentium D processors had exactly half the L1 cache.
What Intel did with the Conroe’s L2 cache is somewhat revolutionary-it’s shared! The huge 4 MB of L2 cache is completely shared between the two cores.
Let’s take a hypothetical situation. If one of the cores needs more cache during operation (assuming the other core to be idle), all the 4 MB is at its disposal. Intel has termed this technology “Advanced Smart Cache.”
Let’s look at another scenario, where Advanced Smart Cache scores big time! Suppose both the cores are independently working on the same data synchronously, the data needs be stored only once in the cache, and not twice as in case of traditional discrete-cache CPUs. Normally, each of the cores would need a copy of this data in their respective L2 caches, causing redundancy, and therefore wastage of cache. This is a huge advantage for the Core architecture over Intel’s previous products, and AMD’s current products. Furthermore, each core would have to communicate with the other to ensure that the other hasn’t modified the data in its cache in any way, since both caches need the latest copies of the data, else “Cache Contamination” will occur. This communication causes CPU overhead, and since it must occur via the FSB, the data bus is further bogged down.
Wide Dynamic Execution Revisited
One of the technologies present in the Conroe touted by Intel isn’t new at all: “Wide Dynamic Execution” has been at work since Pentium Pro days! What “wide” means is a higher IPC (Instructions Per Clock) or commands per clock count. While each of the cores on the Conroe can handle a minimum of four instructions per clock, previous-generation Pentium 4s could handle a maximum of three.The Core architecture features “Micro-Ops Fusion” technology. Suppose one complex command contains more than one independent microinstruction. The decoder unit on the CPU will connect these commands to each other. Micro-Ops Fusion bundles these commands together, to be executed in a particular order. For example, an ALU operation and a Load operation will be combined to form a single Micro-Op. To the CPU, however, this encapsulation of microinstructions appears as a single command. It’s only at the execution stage that the individual command threads are executed in succession, and not simultaneously.
Note the 4 simple decoders. The reservation station improve data bandwidth. Note the massive L2 Cache shared between core, also notice the larger entry Translation Look-ahead buffer in the L1 cache
The proverbial partner, Macro-Op Fusion, allows set of two commands can be executed as a single command. This means the four decoder units on the Core 2 Duo can in reality decode five instructions, instead of four, in a clock cycle. Such command combinations aren’t always possible, but hypothetically, even if it happens only once per clock (the chances of it happening this frequently are very good) five instructions will have been executed in that cycle.
This fusion has other merits. Since the fused instructions now move along the pipeline as a single entity, buffer space is saved in case of Out of Order Execution, and the decoding bandwidth requirement is also reduced. Macro-Op Fusion increases the amount of instructions can be stuffed into the pipeline CPU processing speed, and saves power due to the fact that the CPU is able to tackle commands in queue that much faster.
Micro-Op and Macro Fusion work hand-in-glove to improve CPU load and execution efficiency.
Intelligent Power Capability-I’m The Frugal One!
The Core 2 range support Enhanced Halt State and Enhanced Intel SpeedStep. This enables huge power savings, as the CPU will throttle down when not under load. The Conroe will work from speeds of 1.6 GHz to 2.93 GHz. This down-clocking is achieved by reducing the multiplier. The Core allows for a minimum multiplier of 6; the maximum is 11, the FSB is quad pumped at 266 MHz.
Equally important-perhaps more so-is the fact that the Conroe as a whole can interactively disable any of its subsystems that aren’t in use at a point in time. Intel takes pains to explain that there is no latency whatsoever involved in this On and Off switching process.
The Conroe can also dynamically switch off parts of its L2 cache that aren’t in use, which was previously impossible. Traditionally in case of any cache transaction, however small in a CPU, the entire block of cache needed to be activated.
Advanced Digital Media Boost-Sounds Impressive?
Multimedia is all about SIMD (Single Instruction-Multiple Data), instruction sets. SSE (Streamed SIMD Extensions) is Intel’s baby, born in 1999 and adopted first by their Pentium III processors. Applications like video and audio editing and encoding, data encryption and their ilk use a lot of SSE. Most SSE instructions today are 128-bit, which are operated on 64 bits at a time-meaning that a complete SSEI can be processed in two processor clock cycles.
The Conroe changes things with the ability to process a 128-bit SSE in a single clock cycle. SSE 4 makes its first appearance in the Conroe, consisting of eight new SSE commands-over and above the existing SSE 3 instructions.
What this should mean for home users is quicker video and audio encoding, and a richer multimedia experience-for example, DVD playback or MP3 encoding.
The major difference was in the design of the memory controller. Socket 754 had a single-channel, 64-bit integrated memory controller, while Socket 939’s architecture included a dual-channel 128-bit controller. The difference in performance came not only from the memory controller type used, but also from the fact that the faster CPUs were all made with 939 pins. The IMC supported up to dual-channel DDR 400; later revisions saw native DDR 500 support.
Prefetching is the name of the cache game. The Core 2 Duo features two data and one instruction prefetcher per core, and two prefetchers for the L2 cache, making for a grand total of eight prefetchers
While Intel’s switch from DDR to DDR2 was relatively simple-no processor architectural-level changes were required-for AMD, this got a lot more complex. In case you haven’t guessed why, it’s due to the fact that the memory controller needed redesigning to support DDR 2.
Architecturally, other than a few tweaks to the memory controller (namely DDR2 support), the AM2 Socket-based AMD 64s are nearly identical to their 939-pin brethren. AMD is still using Fab 30, and a newer Fab 36-both 90 nm fabs, at their plant in Dresden, Germany.
AMD has followed “true” dual-core architecture from the very start with their 939 dual-cores. The cores can communicate with each other through the Crossbar Controller (internal to the processor). In the case of Intel dual-cores, any communication between the cores occurs via the FSB (external to the processor), which causes latency, especially when you consider the speed of the FSB is a measly 800 MHz. Compare this number to the CPU clock figures and you’ll get the drift.
AMD’s On-Die Memory Controller
AMD 64’s mainstay has been the Integrated Memory Controller or IMC. This has been one of the major advancements in desktop processing in modern times. What is it? Simply take the memory controller traditionally present on the MCH (Memory Controller Hub) or Northbridge, and relocate it on the CPU die. What happens? Well, for one, the pin count increases (939 and now 940 are a lot of pins!). Performance-wise, the bottleneck between the CPU and the memory is removed, because the FSB route between CPU and memory becomes defunct. Communication occurs at CPU clock speeds rather than FSB clock speeds. Communication with the memory therefore becomes much faster, and in comparison to all Intel CPUs, AMD 64s utilise memory bandwidth more efficiently.
Hand-in-hand with the IMC, AMD 64s have a massive L1 cache size advantage. 128 KB of L1 cache per core, is a huge figure: it contributes significantly to the percentage efficiency of memory utilisation. The L1 and L2 cache remains identical to Socket 939 processors however, and the L2 cache is either 512 KB or 1 MB depending on the model.
Execution Subsystem And Decoding:
Complex Decoders Galore
Let’s take a very brief look at what a decoder on a CPU is supposed to do. CPUs are bombarded with instructions, which can be, very simply, of two types-opcode and addresses, or operations and locations. The decoder on a CPU is tasked with deciphering these instructions and reducing the variance of their length-forming even or close-to-even sized packets, since x86 instructions can be between one and 15 bytes long. In the bargain, RISC-like instructions are created, which make the job of scheduling and execution that much easier.
Typically, the simple decoders are handed the task of working on the most-frequently-used x86 instructions, which are converted into micro-ops, typically a single micro-op per set of instructions. The complex decoders work on more of the heavier CISC-like instructions, and will produce more than one micro-op per operation. Along with the AMD 64 CPUs, older Athlon XPs and Intel’s Pentium 4 and Pentium III CPUs all use this method of decoding.
But AMD 64s have 3 complex decoders. How do they stack up? There are two ways of decoding: a direct, easier path, and a vector path that is more suitable for certain complex instructions, but which may take longer to decode. AMD 64s use both these ways of decoding instructions. Each of the complex decoders can perform both direct and vector path decoding, but direct path decoding is preferred because the resultant macro-ops are fewer in number. When it comes to decoding complex instructions, the K8 architecture would definitely score big time. However, x86 instructions are largely simple, and very complex instructions are handled by a Microcode Sequencer, which is another unit placed on a CPU just to handle instructions that are too complex for the decoders. Nonetheless, in case of instructions that are complex, but not so complex as to require the Sequencer to do kick in, the three complex decoders aboard the AMD 64 processors are quite powerful.
AMD-Latency Vs. Bandwidth
Due to the IMC, and the AMD’s very architecture, their processors are more sensitive to changes in memory timings rather than bandwidth (read as frequency) changes. This is why AMD 64 939s initially performed much better than the AM2 equivalents; the latencies associated with DDR 2 were much higher.
It’s All About The Cache!
AMD 64s still offer double the Conroe’s L1 cache at 128 KB per core. That’s a massive 256 KB of L1, while Conroe is Intel’s largest L1 offering to the desktop segment at 64 KB per core. The L2 cache bus on the Conroe is 256 bits. The AMD 64’s L2 cache is half as wide, at 128 bits; Intel still has double the L2 cache though (4 MB), and it’s shared!
Double-Core Or Dual-Core?
Before the Core 2 Duo, Intel was basically integrating two cores on a single die-double cores. AMD 64s were the first “true dual-cores” for the desktop. The cores on an X2 processor can communicate with each other internally, via the Crossbar Switch. Pentium Ds used the external FSB to communicate with each other, which is much slower.
CACHING IS THE GAME
Download this Game pdfClosing Thoughts
Let’s hope these promises convert to deliverances-they have thus far!