To get these applications performing at their best, it all boiled down to just one thing-CPUs needed to execute more instructions per second. The easiest way of doing this, of course, was to kick up the CPU’s clock frequency, which triggered the “MegaHertz (and later, the GigaHertz) Wars.” This was all very well for a while, but there was also the matter of executing more instructions in each cycle of that clock signal. Naturally, the way to do this is have instructions run simultaneously-the concept is called Instruction Level Parallelism (ILP), and goes on in your CPU right now. The processor collects instructions, sees which it can run in parallel, and then does so.
A look at the way the Cell is built
Now, your average computer program is a single-threaded application-put simply, churning out a single stream of instructions, oblivious to the fact that there is a processor under the hood that’s trying to figure out which of those instructions it can run simultaneously to increase performance. The result? Every new generation of processors delivered only a 10 to 20 per cent boost in performance: quite distressing.
But things are changing. The GigaHertz wars are over, and everyone has realised that the only way to squeeze more performance is to have multiple program threads running at the same time-Thread Level Parallelism (TLP). The way to do this is to increase the number of processors doing the work-first with the powerful dual-processor workstations, and now today’s dual-core, tomorrow’s quad-core, and in five years, eighty-core processors (at least, that’s what Intel tells us).
While Intel and AMD have chosen to have identical cores on their chips, IBM, as usual, decided to do things a little differently.
The Cell has been optimised to work with parallel workloads-and few things can be broken into as many parallel processes as games. Rendering frames requires processing millions of pixels, and this task can be broken into as many parallel tasks as one wants. Ditto physics and AI calculations. The Cell’s obvious leaning towards better physics and AI means game scenarios can get more crowded (the more objects you place in a scene, the more complex physics calculations become) without any loss in performance. In fact, the Cell is even capable of taking up the job of the GPU, though industry experts suggest that it won’t be replacing the dedicated GPU just yet.
The Cell’s main purpose in the PS3 is to ensure that there are no bottlenecks whatsoever, so obscene amounts of memory bandwidth are the order of the day. You no doubt remember Rambus-the people who came up with RDRAM-a technology that was undoubtedly superior to DDR-SDRAM, but never really took off. Cell sports Rambus’ new XDR memory controller, which gives each processor core a memory bandwidth of 25.6 GBps-more than twice that of any PC processor. This also comes very close to the 32 GBps that GPUs get from their memory controllers, so the Cell won’t bottleneck the GPU the way PC processors do.
But what’s so special about these processors that make them so fast? The secret to Cell’s performance is a…
…Return To Innocence
The Cell’s cores are all in-order cores-a design that hasn’t been around since the days of the old Pentium processors. Processors since that time have been out-of-order processors (no, not in the “doesn’t work” sense), and in the world of general-purpose processing (office applications and so on), it’s these out-of-order processors that deliver a better performance. So why did IBM take Cell back into the Dark Ages? Let’s understand the difference between in-order and out-of-order execution first. Consider these instructions:
1. A = B C
2. D = A E
3. X = Y Z
You’ll notice that instruction 2 has to be executed after 1, because it depends on the resulting value of A. Instruction 3, however, is independent of the other two. An in-order processor, as the name suggests, will execute these instructions in the order that it receives them. If all the data it needs is readily available in the processor’s cache, then the instructions proceed swiftly, and all is well.
Suppose, now, that the value of B isn’t in the cache, but C is. It’s only a matter of four or five CPU clock cycles for it to fetch C, but to get to B, it has to first search the cache and encounter a “cache miss,” following which it has to access the system’s main memory to get its data, resulting in a delay of a couple of hundred CPU cycles.
All this time, instruction 3 has to wait its turn, even though the idle CPU cycles which were wasted hunting down B could have been used to process it and get the job over with. This is where out-of-order processors came in. Out-of-order processors use an “Instruction Window” (something like a buffer where incoming instructions are stored), within which it looks for instructions that can be run independently, and processes them in parallel, so a cache miss isn’t such a disaster.
A microscopic view of the cell- the black”bands”you see on either side are the SPEs; the PPE occupies the top left
Put simply, the difference is the same as between shopping for items in the order that they’re written on your list, or by picking up whatever items on the list that you can see, irrespective of order. Out-of-order processors have been immensely successful in multi-tasking environments (which is the most common scenario for PC use) because they receive instructions from multiple programs, all independent of each other.
Coming back to our original question: why does the Cell forego such obvious advantages to go with the in-order approach?
Firstly, apart from the dire consequences of a cache miss, in-order cores perform quite well. Secondly, and more importantly, in-order processors are simple-without the added circuitry that enables out-of-order execution, the transistor count of the in-order core is low, which is how they’re able to fit nine cores on that little chip. To get around this cache-miss hassle, IBM has resorted to a neat trick.
We mentioned before that each SPE has its own local storage rather than a cache- this is because unlike cache memory, which has its caching logic hard-wired into it, this local store is accessible to the programmer. What this means is that the onus is now on the programmer (rather, the compiler) to ensure that any data that the SPE needs is available in this local store exactly when it needs it, minimising delays, or at the very least, making them predictable. Note that this applies only to the SPEs-the PPE still has a traditional L1 cache, and gets its performance from the speeds that the Cell will run at-between 3.2 and 4 GHz.
Raw speed isn’t the only thing behind the Cell, though. It does pack another really
Big Picture
Cell isn’t just a processor-’tis but a mere piece in the game for World Domination (TM). Cell’s software is compiled into little “apulets,” which will distribute themselves to all available Cell processors-be it on the same board, the same LAN, even over the Internet. The result is a massive grid computer, where every idle Cell you’re connected to becomes a possible candidate to offload computing on. Imagine gaming on your PS3, while your Cell-enabled HDTV and PDA crunch numbers for a research lab in one corner of the Earth trying to find a cure for cancer!
So will we ever see the Cell in our desktops? Quite possibly, but there are a few things about this whole thing that might throw a spanner or two into the machinery. Firstly, there’s the PPE. While it’s way ahead of the competition for gaming, it will still lose out to current processors when it comes to general-purpose applications. However, throwing in more PPEs on the chip could well turn that around. Secondly, the Cell absolves itself of a lot of the responsibilities of regular processors, instead, offloading them on to the developers who will be writing the compilers for them. The insane task of writing a compiler that can intelligently exploit the Cell and all its features might just put programmers off, and that would be the end of the Cell’s PC prospects.
The only people willing to code to that level are game developers, so game consoles, at least, will see a lot more of the Cell. For now, Sony and Toshiba plan to release Cell-based HDTVs and PDAs in the near future, and the Cell will probably conquer the living room before it does the desktop.