# AMD Graphics Core Next: The architecture of future AMD GPUs!



## Jaskanwar Singh (Jun 17, 2011)

here you go - 

AFDS: L'architecture des futurs GPUs AMD! - Cartes Graphiques - HardWare.fr


----------



## vickybat (Jun 17, 2011)

*This *will clear things more .

This is the *translated *part.

Nice findings JAS. Well done .


----------



## Jaskanwar Singh (Jun 17, 2011)

thanks batman


----------



## mukherjee (Jun 18, 2011)

Thanks vicky!


----------



## Skud (Jun 18, 2011)

Nice find. And thanks Vicky for the Guru3D link. Already posted some links here:-

*www.thinkdigit.com/forum/graphic-c...ds-southern-islands-track-release-2011-a.html

And some update from Anandtech:-

AnandTech - AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute


----------



## vickybat (Jun 18, 2011)

^^ Nice links. The anandtech link is very detailed. Will go through it in detail.


----------



## Cilus (Jun 18, 2011)

Guys, here is my analysis over the new AMD GPU architecture. Took me more than 3 hours. Sorry it is little bit lengthy. Please have a little patience and finish the reading. God, I had to read my old Computer architecture books and the architecture of Cray super computer for it. Please share your valuable feed backs.

_____________________________________________________________

Well, as per my understanding the architecture of the new AMD GPU is actually uses the same concept of the famous *Cell Processors* currently getting used in PS3. Here we have a *PPE  or Power PC Processing Element*, a simple in order  X64 based microprocessor based on PowerPC architecture which is responsible for basic scheduling, Branch prediction, Allocation of works to the lower units and  series of *SPE or Synergistic Processing Element* which are simple vector processors with their own cache but lacking extra hardware for  Branch prediction and memory read write. It is the duty of the PPE to schedule the task among the SPEs and Read/Write from Main memory. It uses standard memory management techniques like Paging, Indexing, Virtual memory etc.
SPEs are very much capable of very fast execution of instructions stored in their own cache memory (they have 256 KB of cache memory for each) and they are also interconnected to each other by a 256 bit bus and can access cache or instructions of another SPE if required. These SPEs are basically Vector units and a presence of series of SPE, Cell processor can perform vector operations in a set of instructions in a single cycle by parallel computing, compared to the one by one execution of  a scalar unit or Instruction level parallelism of current Superscalar, Out of Order Execution and heavily pipelined micro architecture where all the hardware logic are present in a single die. This approach actually reduces the complexity of each of the element present in Cell (as now whole prediction and scheduling is done by PPE and the raw parallel processing power is present in the SPEs), resulting less Transistor count and enables it run in very high speed.

If you compared it to the coming  generation of AMD architecture, you will find a lot of similarity. The Instruction Fetch Logic, Control and decoding logic, Branch prediction logic etc are handled by the Scalar unit as these operations don't involve any Vector operations, they just use some algorithm to detect the next instruction to be fetched or Loop to which branch. One striking point is that the GPU memory controller uses the X86-64 memory management which is a revolution as till date no dedicated GPU use X86-X64 bit memory management. Instead they use Vector Memory Management where for performing parallel operation on a set of  data, they fetch a set of memory locations which contains the set of data to their register, devides it into blocks of instructions which can be called a Thread. So a thread is basically group of related instructions requires to perform a task.
Now  comes the Vector unit, consisting of 4 modules. These parts are similar to SPEs of Cell processor with one difference,  the whole block shares a single Instruction and data cache. Again each of the 4 vector unit is divided into Geometry Processing (Vertex Shader and Pixel Shader) and Tessellation unit.
The whole thing is considered as Compute Engine and there are several of Compute engines available where each of them are connected by some interconnect bus. Compute engines are not the same VLIW and far more independent as each of unit is now having their own Fetch, Decode, Branch logic which are handled by the SCALAR unit and Vector processing units for parallel processing and Cache for storing the Instructions and data.

For programming GPU, nVidia has released their Library CUDA for quite some times now. The nVidia architecture heavily relies upon TLP or thread level parallelism. Now for Graphics processing  GPU works more like a SIMD model where a single task say transforming a set of pixels to their complement, GPU can perform this operation in a single cycle by performing it over all the pixels simultaneously. For implementing TLP it heavily relies upon its hardware logic and hardware branch prediction technique which increases the hardware cost. Other disadvantage is here programmers don't have access or knowledge about the internal principles of the GPU which actually restricts to maximize GPU optimizations as through programming logic whatever you have implemented, Multi Threading,  high degree of parallelism, rescheduling the execution order of the instructions, you can't be sure that everything is going to run inside GPU on the same way.

Till date AMD is using VLIW architecture. VLIW architecture on the other hand, heavily relies upon the Software compiler to reschedule, branching etc. The advantage is removes the necessity of costly hardware logic to do the above mention tasks, resulting reduced cost, small die size and power consumption.
The main problem with that architecture is for maximizing the utilization of all the 5/4 units, the software compiler needs to be highly optimize to rearrange the code structure so that it can creates blocks of Instructions (thread) which are very loosely coupled or Independent which is not an easy task. Other thing is the missing Hardware logic inside the shader cores slows down the processing capabilities, mainly executing GP-GPU codes which may contain complex instructions that needs to be broken down to multiple simple instructions to be executed in parallel. It is very hard in VLIW to schedule an Instruction ahead in execution time as there is no run time scheduling  logic present in the hardware level. 

The main problem  with all the current GPU is memory management. All of you aware of that even the highest GPUs hardly use all the memory allocated to them. The reason is GPU memory management is completely different from CPU memory management. The main logic in here is to group the set of data in which same operations need to be performed in contiguous memory locations and fetch the whole block at a time to perform the operation over the whole set. For very complex and wide data this approach is not efficient .

*The New AMD Architecture Advantage:* The Compute Engine , as described by AMD is almost a independent processor, having own Fetch, decode, Scalar and Branch units for fetch , decode, schedule and 4 module vector units for raw processing power by parallel execution.
The 1st Advantage is moving from software based Compiler static scheduling to Hardware based execution time scheduling. In Compiler scheduling if it is found that some rescheduling may avoid cycle penalty or more parallelism, nothing can be done as there is no run time hardware scheduler and whole execution is going  through a predefined path.

The 2nd Thing is executing multiple groups, each of them contains a set of instructions. In ideal scenario all of them will be independent and can run in a single Compute Unit. But if there are dependencies among them, VLIW has to wait un till the dependent block is completed and result is available and then it can start execution the next block. But in here just like Intel's HT, if say block A is waiting for resource in Compute Unit C1, C1 can fetch another block of instructions or thread assigned to it and start processing it. It is called *Simultaneous Multi Threading or SMT*. One thing, the instructions present inside a particular block cannot be executed in that above fashion, they need to be executed in Order.

3rd thing is more efficient Multi Threading  performance. Now each of the CU can process one thread more efficiently due to the presence of hardware units. VLIW can also perform multi threading, but not that efficiently due to the lack of hardware unit and compiler dependencies.

4th Thing is writing code. As  now Scheduling task is not required in Compiler, programmer can write much dynamic codes and specify the execution pattern, more optimization to run the code in parallel more easily as he is sure that in execution time hardware logic will take care of those things.

*5th and revolutionary thing is using X86-64 memory management*. This is a very well proven and highly efficient memory management and programmers can directly write highly multi threaded and optimized code while developing any applications as most of the compiler knows how to take advantage of X86 memory management. If you are designing a highly parallel code and multiple threaded environment for the applications, X86 memory management  can take care of it easily and will make sure that the execution path should be what you have intended, not something rescheduled by the compiler. It also increases memory usage efficiency more than 100% due to the use of standard Virtual Memory concept.

6th is Scalar unit: The scalar unit is responsible for looping, branching, prediction etc.* But it can also execute completely independent single instructions so that those instructions don't need to sent in the Vector unit and save valuable SIMD unit time.*

Another thing is communication between multiple CUs when they are execution different threads.  If CU1, executing Thread T1 requires some input from Thread  T2, getting executed in CU2, CU2 can share that data through the shared  L1 cache among all the CUs. This is somewhere similar to *Superscalar *architecture where each of the CUs are basically considered as Execution units and a *superscalar *has multiple execution unit under it. AMD's ACE or Asymmetric Compute Engine can actually enables these features and gives the GPU *Out of Order execution* capability. IF it is found that execution in a different manner than the defined one can improve performance, ACE can prioritize the tasks and allowing them to be executed in a different order than the way they have been received.


----------



## Skud (Jun 18, 2011)

Still Hebrew for me, but at least your analysis is more clear to me than the online sites'.

Simply superb!


----------



## ico (Jun 19, 2011)

*My analysis:* This is Fermi done right and better.

Why? Read my second analysis.

*Second analysis:* This takes us much close to real 'fusion' 

Future beyond Trinity - Bulldozer + VLIW4 looks interesting. Has been an enormous year for AMD this.


----------



## Skud (Jun 19, 2011)

Hopefully they can cap the year with some really nice products.


----------



## Cilus (Jun 19, 2011)

Ico, it is not that easy to implement general purpose computing along side with gaming performance. The part AMD has shown is actually only the general processing capabilities of their upcoming Architecture, not how good it is gonna be in gaming. Example is Fermi which has better general processing capability than any AMD GPU but in gaming it simply can't beat AMD, even with their superior architecture. It is very early to predict anything.


----------



## comp@ddict (Jun 19, 2011)

I saw a chart where even the high end Bulldozer Enhanced (next gen bulldozer) will have DX11 capable GPUs.

So AMD will use GPU cores to enhance CPU functions wherver the GPU is more efficient? Amazing plan I think, hats off to AMD and their Fusion concept.


----------



## Jaskanwar Singh (Jun 20, 2011)

cilus excellent writeup.


----------



## Cilus (Jun 20, 2011)

^^Thanks buddy. Lets other members to have a look at it and share their opinion. BTW, I'm planning to write another on OOO or Out OF Order Execution.


----------



## Tenida (Jun 26, 2011)

Cilus Bhai-Excellent work  Keep it up


----------



## tkin (Jun 26, 2011)

@Cilus, wow, nice work there dude, and I read it, ALL of it, very detailed analysis and lines up with pretty much what I learned in comp architecture and advanced OS this year, one thing for sure, they are planning to decrease dependency and hazards in the pipeline, more advanced branch prediction like cpus, and also multilevel shared cache.

Now while all of this looks good in theory but in reality the entire implementation will heavily depend on the application, just like in pcs the apps can control the threading performance, and apps have to explicitly written to use multi core features, now my question is since game are ported from consoles to pc, will any developer rewrite their game engine completely(this has to be done in the engine core) to use this feature? If amd does pull the entire thing in software I'll applaud but its almost impossible, take for example, can anyone run a 2 threaded application in windows to use 4 threads without rewriting the core? I guess they are trying to achive true fusion, general purpose core for both the cpu and the gpu together.

One thing for sure, this will require lot of research, even larabee could not be pull it off(and intel puts more money in research than amd makes), I'll say 2015 atleast? 

But if done right, we can see real time raytracing finally in games(and one more step to achieving true photorealism, then photon mapping and we are set).


----------



## Cilus (Jun 27, 2011)

Tkin, actually AMD is moving out of the software dependencies to hardware branch prediction. VLIW architecture is all about software compiler scheduling and completely relies upon the application's code optimization and AMD driver optimization. Although they are good in gaming but in GPGPU processing VLIW is behind nVidia's TLP based architecture.
That's why they are changing the architecture to Scalar + Vector based unit with hardware prediction logic. As a result even if the code path is not optimized in software level, at runtime hardware can reschedule the execution path to yield better performance. If the code path is optimized, hardware prediction can still find out better execution path while executing the piece of code, enhancing the performance, a feature, completely missing in VLIW 5/4 design. But they need to take care of the fact while increasing general processing performance, gaming performance should not be hampered, if not increased.


----------



## tkin (Jun 27, 2011)

Cilus said:


> Tkin, actually AMD is moving out of the software dependencies to hardware branch prediction. VLIW architecture is all about software compiler scheduling and completely relies upon the application's code optimization and AMD driver optimization. Although they are good in gaming but in GPGPU processing VLIW is behind nVidia's TLP based architecture.
> That's why they are changing the architecture to Scalar + Vector based unit with hardware prediction logic. As a result even if the code path is not optimized in software level, at runtime hardware can reschedule the execution path to yield better performance. If the code path is optimized, hardware prediction can still find out better execution path while executing the piece of code, enhancing the performance, a feature, completely missing in VLIW 5/4 design. But they need to take care of the fact while increasing general processing performance, gaming performance should not be hampered, if not increased.


This is slowly starting to become like fermi(better actually), if they can reschedule in hardware level that would be great(true cpu), but that would complicate the logic and increase cost, as well as power consumption, maybe 22nm.


----------



## comp@ddict (Jun 27, 2011)

^ *20nm*, correction.


----------



## rchi84 (Jun 27, 2011)

Now, for the hard part. Can AMD convert it all into a decently performing package?

I remember when the details of the 29xx series came out. Sound like something from the future, with 512 bit memory, tessellator and the works. Didn't translate all that well into performance for two generations, until the 4xxx launch..


----------



## comp@ddict (Jun 27, 2011)

Well, Fermi was the same, and it bombed untill fixed. 

But AMD won't make that mistake. Always make a mainstream GPU to compete at high-end. AKA HD4870, HD5850 and i the future HD7950 most probably.


----------



## ico (Jun 27, 2011)

well, I wouldn't really want it to be completely like Fermi.

See, the methodologies of both AMD and nVidia at the moment are completely different. AMD uses VLIW which is Instruction Level Parallelism (shaders are grouped in groups of 5 or 4 and work together) and requires optimization through the compiler to get the best performance. nVidia uses Thread Level Parallelism which doesn't really require much optimization from the compiler level. Now, the point is, VLIW if utilized correctly is better than nVidia's approach but with VLIW under-utilization is a huge issue. nVidia's approach is easier to be precise. That's why AMD cut it from VLIW5 to VLIW4, reducing the transistor hugely and managed to get minor gain in performance.

Now why does AMD uses VLIW? VLIW shaders are very small (compared to nVidia), AMD can pack as many as they want and clock them high while keeping thermals/power in check. nVidia had huge worries with Fermi with GTX 465/470/480 and only got it right with GTX 460 and GTX 500 series. 

Hoping AMD won't have it.


----------



## tkin (Jun 27, 2011)

ico said:


> well, I wouldn't really want it to be completely like Fermi.
> 
> See, the methodologies of both AMD and nVidia at the moment are completely different. AMD uses VLIW which is Instruction Level Parallelism (shaders are grouped in groups of 5 or 4 and work together) and requires optimization through the compiler to get the best performance. nVidia uses Thread Level Parallelism which doesn't really require much optimization from the compiler level. Now, the point is, VLIW if utilized correctly is better than nVidia's approach but with VLIW under-utilization is a huge issue. nVidia's approach is easier to be precise. That's why AMD cut it from VLIW5 to VLIW4, reducing the transistor hugely and managed to get minor gain in performance.
> 
> Now why does AMD uses VLIW? VLIW shaders are very small (compared to nVidia), AMD can pack as many as they want and clock them high while keeping thermals/power in check. nVidia had huge worries with Fermi with GTX 465/470/480 and only got it right with GTX 460 and GTX 500 series.


Only problem is compiler scheduling has its limits, performance gain is not linear with the increase in shader count, and also the tess engine is separated from shaders there maybe a bottleneck, somethings fermi overcame, one thing I'd like to see is for them to keep producing high performance gpus, nvidia is moving to mobiles(lucrative, I'd say), amd is moving to fusion slowly, hope this does not kill the gpu market


----------



## ico (Jun 27, 2011)

tkin said:


> performance gain is not linear with the increase in shader count


This isn't really of much concern. As AMD's current approach is different and nVidia's approach is different.  If AMD was following nVidia's approach and it wasn't linear, then you have a point. afaik, it is linear in AMD's case too if you don't look at VLIW5 and 4 at once.


----------



## ico (Jul 25, 2011)

HD 7000 will come at least 4-5 months before Kepler arrives. Southern Islands taped out in February whereas Kepler taped out in July.



			
				AMD said:
			
		

> We also passed several critical milestones in the second quarter as we prepare our next-generation 28-nanometer graphics family. We have working silicon in-house and remain on track to deliver the first members of what we expect will be another industry-leading GPU family to market later this year. We expect to be at the forefront of the GPU industry's transition to 28-nanometer.


Coming soon.


----------



## Piyush (Jul 25, 2011)

this is gonna be big


----------



## Skud (Jul 25, 2011)

Nice update, ico.

So BD and HD7000 and 990FX - what name is AMD gonna five to this platform?


----------



## vickybat (Jul 25, 2011)

The next gen gpu's are really going to be something. Though nvidia is silent about its kepler architecture, amd has gone forward to show its compute engine based architecture and has done away from the older vliw based designs. The architecture looks promising and actually has x86/64 based computational abilities as well as shader processing very similar to a cell processor.

 But the real part was missing i.e will it be good enough to render life like in game graphics and will be a step above the current crop of gpu's?

The answer is YES and AMD  was again the first to disclose this and what we can expect from the next gen gpu's. According to amd, the next gen XBOX or rather termed as XBOX 720 , will have the ability to render in game graphics just like the movie "AVATAR" 

That's right and the microprocessor giant also said that A.I. and physics capabilities of the next-gen hardware will allow for every pedestrian in a game such as Grand Theft Auto to have a 'totally individual mentality,' meaning no more mob mentality.

We all know that AMD makes the gpu for xbox and this time, it might give it the latest 28nm compute engine based gpu's. So i think its a clear indication what we might expect from pc as well.

*Source*


----------

