# Reason of Poor performance of Bulldozer in Windows 7



## Cilus (May 7, 2012)

Well, after lots of recent interests shown here about the reasons of Bulldozer's poor performance, I thought it is the time for a detailed explanation. In this article I have only discussed the Windows 7 Scheduling problems with Bulldozer modules but there are other reasons too. One can be tracked as the foul play by Intel by dropping the support of advanced Instruction sets like FMA4 and XOP but those are kept for another article. Hope you'll like it.

*Little Detail about the Module in Bulldozer:*
As you guys know, Bulldozer architecture is actually a module based architecture where each of the modules are consisting of two discrete Integer cores, one double issue (two instructions can be issued in a single clock cycle) floating point Execution core, a Dual Issue Fetch/Decode logical unit and Shared L2 Cache. The L1 cache arrangement is unique here as the both the core shares the L1 instruction cache but the L1 data cache is separate for each of the Integer cores inside a Module. 
This approach is used here to utilize the two cores of a module even in lightly threaded environment. As they share the L1 instruction cache, *both of them can simultaneously fetch and executes instructions of a single thread currently present in the cache. It improves the resource utilization as you don’t need two threads to run at the two cores, two cores can work upon a single thread.* Although one floating point unit might look not so impressive, the empirical study has shown that most of the instructions assigned to a CPU core at a give time slice are actually Integer instructions. So AMD tried to shrink to the Die size and power consumption by only duplicating the most used units rather than doubling everything.

*Similarities and Dissimilarities with SMT or HT:*
This module design has lots of similarities with the Intel Hyper-Threading design where each of the physical core is able to process instructions from two different threads simultaneously (not switching as explained in most reviews). But there are lots of dissimilarities too.

The basic concept of HT is that if a single core can process four instructions in a single clock cycle (using Pipeline) by overlapping their execution time and less than 4 instructions, say 2 instructions are available from one thread at clock cycle C1, it can take another two instructions from another thread, if available, to process at Clock cycle C1. As a result all the CPU resources are getting utilized at its fullest capabilities. But remember, HT enabled core doesn't have two separate execution unit, it just shares different portion of the same execution unit of a single core. For example when decode unit is decoding the 1st instruction, Fetch unit can start fetching the 2nd instruction from the instruction queue.

So Intel Hyper-Threading basically uses Instruction level parallelism (ILP) whereas Bulldozer module has two integer cores, each of the Integer core can execute one thread at a time, resulting two threads running simultaneously, hence TLP improvement.  Also each of the integer core can use Pipeline to overlap the execution of multiple instructions of a  single thread.
HT won’t work if you are having only one thread running at one physical core. On the other hand, Bulldozer module’s each core can work upon the instructions of a single thread simultaneously and execute multiple instructions at a time because of the presence of two physical integer cores inside a module. Here all the instructions present in the thread are accessible to the whole module.

*Turbo Core 2.0:*
It is like Intel Turbo Boost but with some more functionality inside. Unlike Intel where the dynamic Clock frequency change is based on per core, in Bulldozer it is per module. So the two cores of a module cannot run at different speeds but one module can run at a different speed than other modules present in the processor die. Bulldozer has the ability to cut the power completely from the unused modules, resulting better power management. Similarly here two physical integer cores of a module are running at higher speed, not a single physical integer core of an Intel Processor.

*Windows 7 performance issues with Bulldozer architecture:*
After reading all of the above details, it looks like Bulldozer design has solid performance delivery capability but it actually fails to do so in current Windows 7 Environment and plagued with all the problems which it actually promised to improve on…Low Instruction Level Parallelism performance, not so good multi-threaded performance, high power consumption etc. Let’s discuss the root causes of the issues.

Before starting the explanation lets be clear with our assumption:-
Let’s consider a Dual module based Bulldozer processor, say FX4100. We have two modules here: M1 and M2. M1 has two integer cores, say C1 and C2 and one Floating Point Core, say F1. Similarly consider M2 has two integer cores C3 and C4 and F2 as the Floating point unit.
In Windows 7 there is no existence of M1 and M2 and it sees four completely independent cores, C1, C2, C3 and C4. It is also not aware about the presence of only two floating point execution units instead of four.  Here is the TABULAR representation:-



*Cores*
|
*Int Core1*
| 
*Int Core 2*
 \| 
*FPU  *
|Ideal View
*Module1*
|C1|C2|F1| M1(C1, C2, F1)
*Module2*
|C3|C4|F2|M2 (C3, C4, F2)
*Windows 7 View*
| C1, C2, C3, C4 as separate cores

*Multi-Threaded performance with two independent Threads:* 
Suppose we have two independent threads T1 and T2, running concurrently at a point of time.

*Ideal Case:* The ideal scheduling for this case will be T1 to be assigned to M1 and T2 to be on M2.  So here C1 and C2, both can operates over the integer instructions present on T1 and F1 will handle the Floating point instructions of T1. Same is true for T2, having C3, C4 and F1 operating simultaneously over the integer and floating point instructions of T2. As a result the execution of T1 and T2 will be very fast as they are utilizing very large amount of CPU resource.

*Windows 7 Scheduling:* As Win 7 has no idea about M1 and M2, it might assign T1 and T2 to C1 and C2 respectively, i.e. in the two cores of the same module M1. Now you see, T1 and T2 are actually fighting for CPU resources, C1 only works on T1, C2 on T2 and both T1 and T2 shares only one FPU F1. Also Module M2 is just sitting idle. So obviously it actually decelerates the execution speed and very poor CPU resource utilization.

*Multiple Dependent Threaded Performances: *
Consider T1 and T2 are two threads and they are highly related or tightly coupled. Execution of T2 needs 90% of the instructions of T1 to be completed and they also share the data space in which they are operating. For simplicity you can consider in T1 has instructions to calculate the area of different geometric figures and T2 has the instructions of calculating the price needed to cover the geometric figures with Iron sheet.

*Ideal Case:* Now C1 and C2 of M1 actually shares data and instructions and they can communicate with each other very fast as they are situated inside a single module. So T1 and T2 should be assigned to C1 and C2 respectively. When C2 is executing whatever independent instructions are present in T2, C1 can finish execution of the instructions in T1 which are required by T2. As a result there will be no pause or stall in execution as when C2 will start the execution of the dependent instructions of T2 which are related to T1, T1 has been already completed by C1 and made available to C2.
It also reduces the number of memory accesses as both T1 and T2 shares a common data space. Consider the example C= 2^5 + 10 and E= C+5 are two instructions where the former belongs to T1 and the later belongs to T2. Here after calculating the value of C, the value of E can directly be calculated inside the CPU as both the threads are present inside a single module. So you are basically saving one memory Write (Write C back to Ram) and one Read (Reading the value of C while calculating E).

*Windows 7 Scheduling:* As windows 7 is not aware of M1 and M2, it might assign the interdependent threads T1 and T2 to cores of different modules. For example T1 is assigned to C1 (of M1 module) and C4 (of M2 module). As a result C4 needs to wait until C1 finishes execution of T1 and saves it to Main memory. So poor parallel processing and more number of memory accesses, resulting slower execution speed.

*Turbo Core Performance in interdependent Multi-threaded environment:* Just check the scenario mentioned in above case, i.e. the case 2.

*Ideal Case:* As the two interdependent threads T1 and T2 have been assigned to C1 and C2 cores of M1 module, the logic can completely cut the power from Module M2 and increase the clock frequency of M1 to increase the execution speed and reduce the power consumption.

*Windows 7 Scheduling:* As discussed in case 2, here Windows 7 may assign the dependent threads to cores of different modules rather than assigning them the cores of a single module. As a result, at a particular time frame both M1 and M2 are running but none of them are getting utilized properly. So power cannot be cut from any of the modules, preventing Turbo Core to be activated

*Turbo Core Performance in Single-Threaded Environment:*
 Suppose we are working on a single threaded environment where apart from the Windows processes, a heavy single thread is in priority which is operating on a large set of data. Now this kind of thread needs multiple iterations of CPU time slice to be executed completely. In each iteration (Iteration means each thread is allocated a dynamic time period to be executed in CPU. After that it has to release CPU for other tasks) some of the instructions get executed or partially executed and the values get saved in Main Memory. In the next iteration the thread can resume execution from the point where it was released from CPU in the previous iteration. In our case consider T1 is the large single thread, affecting large set of data.

*Ideal Case:* Here T1 should be allocated to only one module, say M1, in each iteration so that all the other Modules , Say M2, can be turned off and M1 can hit the maximum Turbo Core Frequency level, hence speeding up the single thread execution significantly. *So if T1 needs 10 iterations for completion then in all the 10 iterations it should be assigned to M1 module for Turbo Core to be activated for an effective time for producing a noticeable performance improvement.*

*Windows 7 Scheduling:* As Windows 7 is not aware of Module M1 and M2 and only sees C1, C2, C3 and C4 as four independent cores, each of the iterations of T1 are assigned to any of the four cores. So consider the case where T1 has been assigned to C1 (of M1 Module) in 1st iteration, in 2nd iteration at C3 (of M2 Module), in 3rd iteration at C2 (of M1 Module) and so on. As a result all the modules got busy within the execution time of Thread T1 without any performance improvement and stops Turbo Core to be activated. It also increases the power consumption indirectly as none of the Modules are turned off.

*WHAT SHOULD BE DONE IN WINDOWS 8*
Actually Bulldozer modules should be treated more like a SMT or HT enabled core with some extra tweaks. OS needs to understand that one Bulldozer module is not a complete Dual Core but it also needs to understand that it has more CPU resource available than a HT enabled Core.

Suppose we have 4 threads, T1, T2, T3 and T4, waiting in the thread queue to be picked up by the Processor. Among them T1, and T2 are loosely coupled or they have very less dependency among them. T3 and T4 are dependent threads. Let’s consider that T3 is dependent upon T1, and T4 is dependent upon T2
How the scheduler should work and assign the threads to a Dual Module (M1 and M2) Quad Core Processor (Core C1, C2, C3 and C4)?
1st it should check for the most independent threads from the thread queue and will fetch T1 and T2. Then OS should assign T1 to C1 of Module M1 and T2 to C3 of Module M2 so that each of the thread can get two integer cores and one FPU for being processed. Reason is explained in Point 1
Now it should fetch the next two threads T3 and T4 and analyze the level of dependency with other threads. As T3 is dependent upon T1, it should assign it to C2 of Module M1. Similarly T4 should be assigned to C4 of Module M2. Reasons are explained in Point 2.

After this scheduling, now M1 has T1 and T3, two dependent threads and M2 has T2 and T4, again two dependent threads. Now each of the modules has data which are independent of each other and inside a module we have two dependent threads which can share data and instructions, resulting faster execution.

Suppose Module M2 has finished the execution of T2 and T4 prior to the execution of T1 and T3, present in Module M1. As the two dependent threads T1 and T2 have been placed inside a single module M1, M1 does not need to wait for any other modules (here M2) to send data to it. So CPU Logic can completely cut power to M2 module and enable Turbo Core to increase speed of Module M1. As a result the execution speed of T1 and T2 will be far faster


----------



## RiGOD (May 7, 2012)

^^Oh man this is heartbreaking. Such a promising architecture and see how Windows 7 ruined the party. BTW read this stuff on a discussion forum



Spoiler



There is most definitely a Windows 7 AMD FX – software patch in the works. *By most estimates the AMD Bulldozer FX is underperforming by 40-70% in most Windows 7 benchmarks*. By forcing Windows 7 to recognize 8 cpu cores a huge performance hit has happened. The Bulldozer FX-8xxx design… really isn’t 8 cores, it’s a 4 core CPU with an extra integer pipeline on each core. If the FX-8xxx series scale according to the 4 and 6 core Bulldozer design than there is a serious bug in Windows 7 that is crippling the FX-8150 performance.

*The one thing that is for-sure here is that every hardware review website rushed to be the first to publish an AMD FX-8150 review, they all used the same generic benchmarks and NONE did any real world computing*. The game is fixed, the big-dog spreads around the most ad-dollars.



Any truth in the second para?


----------



## fz8975 (May 7, 2012)

nice info thx
where did you get that from ??


----------



## amjath (May 7, 2012)

so there is still hope for best performance cause of those patch. 

BTW can we lower power consumption cause of these patches [noob question!!! ]


----------



## RiGOD (May 7, 2012)

^^You missed the last para it seems.



Spoiler



Suppose Module M2 has finished the execution of T2 and T4 prior to the execution of T1 and T3, present in Module M1. As the two dependent threads T1 and T2 have been placed inside a single module M1, M1 does not need to wait for any other modules (here M2) to send data to it. *So CPU Logic can completely cut power to M2 module* and enable Turbo Core to increase speed of Module M1. As a result the execution speed of T1 and T2 will be far faster


----------



## Cilus (May 7, 2012)

> nice info thx
> where did you get that from ??



I am a computer engineer and have little understanding about how a CPU work in little depth. The content of the articles have been gathered from the Architecture detailed published in depth (not only the module core thing, but all the other details too), Windows 7 problems published in all the articles and my personal understanding.

amjath, the Hotfixes for Bulldozer in Windows 7 didn't work well, here we need a complete revisit and patches can't solve the problems. If Windows 8 will work properly with BD modules then ya, there are chances of power consumption reduction. But that's an educated guess based on the different analysis, I can't guaranty you that. 

RiGOD, what you have posted is actually reflected in my article but I can't say Win 7 is holding back BD in how much amount. Could you provide me the link from where you've got that info? Let me have a look at it.


----------



## amjath (May 7, 2012)

i did not understand on first read. I get it now after 5 reads  :


----------



## gopi_vbboy (May 7, 2012)

I have fx 6100


----------



## Cilus (May 7, 2012)

^^ Then I guess you're waiting like hell for the arrival of Windows 8


----------



## gopi_vbboy (May 7, 2012)

^^ No but i brought it for gaming 2 months back and planning to get a gpu to have decent rig.Now this analysis is a shock.


----------



## Cilus (May 7, 2012)

FX 6100 is pretty much okay for your need, just add a good GPU and there won't be any problem.


----------



## amjath (May 7, 2012)

my friend would be very happy seeing this thread cause he own a bulldozer. He was pretty sad on seeing a bad performance on Windows 8 CP. 

So r u sure it ll be okay in Windows 8

*off topic*
BTW how is Sabertooth performing. He is looking for an upgrade.


----------



## RiGOD (May 7, 2012)

@Cilus : Here you go buddy

Source 1
Source 2
Source 3

BTW I've got a question. Windows 7 was released way back in 2009 and BD in 2011, I mean didn't AMD have enough time to realise that such a widely used OS would respond to their architecture in such a manner? Or did they see it coming and run into trouble even after knowing it? I read somewhere that the development of BD was started in 2007 (not sure how legit it is), is that the reason?



amjath said:


> my friend would be very happy seeing this thread cause he own a bulldozer. He was pretty sad on seeing a bad performance on Windows 8 CP.
> 
> So r u sure it ll be okay in Windows 8



Windows is just a part of the story



Spoiler



Well, after lots of recent interests shown here about the reasons of Bulldozer's poor performance, I thought it is the time for a detailed explanation. In this article I have only discussed the Windows 7 Scheduling problems with Bulldozer modules *but there are other reasons too*. One can be tracked as the *foul play by Intel by dropping the support of advanced Instruction sets like FMA4 and XOP* but those are kept for another article. Hope you'll like it.


----------



## Cilus (May 8, 2012)

Actually performance of a CPU doesn't follow any pre-defined rules, instead it follows empirical patterns, i.e., patterns or trends deduced from the analysis of statistical data. Since BD is a completely new design, it has no statistical data available. There are certain issues from AMD side too. They relived too much on peer feedback who have tested BD and used machine to optimize the design rather than human based inputs.
That's the reason AMD has terminated a lot of peers after BD's poor performance because of the wrong info provided by them.


----------



## AcceleratorX (May 8, 2012)

AFAIK the Windows 7 "core parking" feature also plays a role in hindering Bulldozer's performance, effectively disabling some parts of the pipeline when it shouldn't.

This possibly means Bulldozer may perform better under Vista which doesn't have this feature (not sure about XP since that kernel wasn't optimized for multi-core in general). I'm not sure if anyone ran a test to see if this theory is true.


----------



## zyberon (May 8, 2012)

so ls fx a bad buy4 windows pc??


----------



## Tech_Wiz (May 8, 2012)

Windows people should release a SP2 with Bulldozer thing Fixed. i52500k at 12k and FX4100 @6k @@.
If that patch can improve performance by 30%+ then BD will be an instant hit.


----------



## gopi_vbboy (May 8, 2012)

Why doesn't AMD consider these things before releasing bulldozer.Hope Atleast now AMD engineers should coordinate with Microsoft to fix it on SP2.

@Cilus, So linux has any upper hand on performance compared with Win7 ?


----------



## RiGOD (May 8, 2012)

Spoiler






Cilus said:


> Actually performance of a CPU doesn't follow any pre-defined rules, instead it follows empirical patterns, i.e., patterns or trends deduced from the analysis of statistical data. Since BD is a completely new design, it has no statistical data available. There are certain issues from AMD side too. *They relived too much on peer feedback who have tested BD* and used machine to optimize the design rather than human based inputs.
> That's the reason AMD has terminated a lot of peers after BD's poor performance because of the wrong info provided by them.






Suicidal I'd say. Baddest way ever of optimising the performance of a much awaited and higly hyped product.

@gopi_vbboy : Check this post.


----------



## tarey_g (May 8, 2012)

Bought 1090T for 8600/- for a friend last week and saved money. It was hard to find in Pune, everyone is stocking bulldozer these days.
Win for people who read when a new architecture is released.


----------



## amjath (May 8, 2012)

gopi_vbboy said:


> Why doesn't AMD consider these things before releasing bulldozer.Hope Atleast now AMD engineers should coordinate with Microsoft to fix it on SP2.
> 
> @Cilus, So linux has any upper hand on performance compared with Win7 ?



[Phoronix] AMD FX-8150 Bulldozer On Ubuntu Linux Review


----------



## vickybat (May 8, 2012)

Read the article fully and it explains everything to the fullest. So the things to look out for are efficient use of turbo core and handling threads having dependent and independent instructions accordingly.

So it all boils down to one thing i.e the os should be aware of bulldozer's modular design in order to assign the dependent and independent threads in a proper manner. As explained above, dependent threads get a single module to share instructions in a similar data space whereas independent instructions should be assigned to different modules rather than different execution units so they get to utilize execution resources to the fullest.

Doing so will solve the turbo-core issues as well. Even though we don't have to look deep into how capable the bulldozer execution units ( integer and float units) are , proper view by the os will definitely help a lot in efficient resource utilization and thus increasing performance.

When i first read about bulldozers architecture in several articles, it really sounded impressive and a serious breakthrough from conventional designs. But it was plagued by current OS view. Lets hope win 8 finally realizes bulldozers modular design and puts it to proper use by allocating threads accordingly as explained in the above article by *cilus*. 

Great article and a real simple read buddy.


----------



## RiGOD (May 8, 2012)

And don't forget that the review sites were making it even worse for the buyers by bashing the BD's by every means possible without even mentioning the impotence in the OS to untilise the advanced architecture to the fullest. 

Even the most reputed technology discussion forums were flooded with Intel fanboys and AMD haters (even the inactive members came outta nowhere just to bash) passing noob comments about BD's. But check the user reviews of the FX-8120 at newegg man, it has got 5 eggs rating and almost every customer is satisfied with the performance even with the OS glitches. 

Anyways I'll say one thing, if Windows 8 makes way for flawless functioning of the architecture and results in atleast 20% improved performace and reduced load power consumption, the price performance ratio will make it a killer VFM product and a worthy opponent to current hot seller.

BTW check this.


----------



## gopi_vbboy (May 8, 2012)

Does turbo core activation causes more heat /power consumption?


----------



## Cilus (May 8, 2012)

amjath said:


> [Phoronix] AMD FX-8150 Bulldozer On Ubuntu Linux Review



Just finished the review, really a good one. It clearly proves that even without any optimization pack, BD FX 8150 is ahead of 2500K in all multi-threaded benchamrks. With the optimization packs I think Linux users are having a clear and big advantage for using Bulldozer modules.


----------



## gopi_vbboy (May 8, 2012)

Cilus said:


> Just finished the review, really a good one. It clearly proves that even without any optimization pack, BD FX 8150 is ahead of 2500K in all multi-threaded benchamrks. With the optimization packs I think Linux users are having a clear and big advantage for using Bulldozer modules.



Thanks for reading and letting us know that.


----------



## amjath (May 8, 2012)

I recommended Bulldozer for my friend who is a architect, who uses multi threaded applications Revit Architecture, Photoshop and 3ds max all running @ same time


----------



## sukesh1090 (May 8, 2012)

@cilus,
 very god article bro.keep it up.rep added.
  btw if i am right there is some mistakes from AMD also in this poor performance and high power consumption and the big problem is even PD or all successors of BD can't fix some of the problem.to fix it they have to change and redevelop whole architecture.which is not a choice considered by AMD right at this time. so with PD and others we can only see some of the problems getting fixed.


----------



## Cilus (May 8, 2012)

No man, for sorting out power consumption, a lot of techniques have been implemented in Piledriver, like Resonant Mesh Clock generator for running @ higher ferquency, you don't need the whole architecture to be changed.  

Look at original Phenom and Phenom II architecture: Phenom was an absolute disaster for AMD but Phenom II, I think the best VFM processor til date.


----------



## RiGOD (May 8, 2012)

^^Seems like everytime AMD introduces a new processor it'll be fail and its revision will be awesome Like Phenom II after Phenom and PD (let's wait and see) after BD.


----------



## Cilus (May 8, 2012)

It is because although they have considerably lesser man power as well as money power when compared to Intel, they always tried to deliver something out of the box, which actually helps the CP industry to move forward. Sometimes they are instant hit (Like Athlon 64, the 1st X86-64 CPU), sometimes they are failure like the Phenom.

Find below the couple of innovations which are introduced by AMD and you'll get a clear picture that how important they were:-


 They are the 1st company to provide separate L1 Data Cahce and Instruction cache for reducing Cache Coherency problem
 They are the 1st company to produce On-Die Memory controller with Athlon 64. Intel got that techniuqe with their Nehalem processors, after more than 4 years.
They are the one to introduce the 1st Dual (Athlon X2) and Quad core (Phenom) processors where all the cores are implemented inside a single die. Intel C2Q is just two Core2Duo processor packed inside a single package.
 They are the pioneer of *Serial linking or Peer to Peer linking *between CPU and the memory instead of using old FSB. A 3 GHz P4 used to run on a 533 MHz FSB..... Even with Athlon 64, using Hyper Transport 1.0, AMD was able to provide 6 times faster memory bandwidth. Intel only got that with their Nehalem processor. Intel's QPI or Quick Path Interface is just a modified version of Hyper Transport technology.
Also don't forget the AMD lllano APU because they're the one where the whole southbridge, including PCI-E bus, USB 3.0 and 2.0, everything is fused inside the CPU die. Intel only has this capability with Ivybridge.


----------



## Omi (May 8, 2012)

Many people just Label AMD products as cheap and backward, under performing etc etc. But We have to see the Importance of AMD in the market.
It is the Sole company that is giving competition to 2 of the top companies in their fields. *It is the company giving us choice*, keeping prices under control and giving speed to innovation. Competition is the key to innovation in the Corporate world.

Hats Off to AMD's effort, its very hard to do something new when you don't have billions lying around,and things are at stake. They have been doing it quite successfully.

@Cilus very nice article, so simple and lucid.

Windows 7 or the Intel dominated compilers are not only to be blamed here. They went a bit more ahead of the time, Single threaded performance is not something that can be ignored. And despite having 8 physical cores it is not that much ahead of i52500k even in heavy multitasking apps. Potential is there but they need to do a lot more.


----------



## sukesh1090 (May 8, 2012)

Cilus said:


> No man, for sorting out power consumption, a lot of techniques have been implemented in Piledriver, like Resonant Mesh Clock generator for running @ higher ferquency, you don't need the whole architecture to be changed.
> 
> Look at original Phenom and Phenom II architecture: Phenom was an absolute disaster for AMD but Phenom II, I think the best VFM processor til date.



hope you are right.wish there is some reason why AMD postponed PD by 1 year.about that VFM yeah you are absolutely right thats the reason for which i bought 955.



Cilus said:


> It is because although they have considerably lesser man power as well as money power when compared to Intel, they always tried to deliver something out of the box, which actually helps the CP industry to move forward. Sometimes they are instant hit (Like Athlon 64, the 1st X86-64 CPU), sometimes they are failure like the Phenom.
> 
> Find below the couple of innovations which are introduced by AMD and you'll get a clear picture that how important they were:-
> 
> ...



hey bro you left BD from that list.the new module design for me looks like the future of the processors and look at the turbo core and its consistency.BD's turbocore speed is maintained without any drops but Intel turbo core it has more peaks and valleys than those in himalayas.intel turbocore speed never stays in its position for more than 2 min it always fluctuates.
btw aren't they the first company to introduce 64bit technology with athlon 64.some of the oses still detect 64bit as AMD 64 like mac,even some linux.



> Many people just Label AMD products as cheap and backward, under performing etc etc. But We have to see the Importance of AMD in the market.
> It is the Sole company that is giving competition to 2 of the top companies in their fields. It is the company giving us choice, keeping prices under control and giving speed to innovation. Competition is the key to innovation in the Corporate world.


AMD is more than that buddy.it has given some of the best processors and technologies without which we would have been in stone age of processor development.


> Windows 7 or the Intel dominated compilers are not only to be blamed here. They went a bit more ahead of the time, Single threaded performance is not something that can be ignored. And despite having 8 physical cores it is not that much ahead of i52500k even in heavy multitasking apps. Potential is there but they need to do a lot more


you have to check that BD review in linux os from phoronix given in the previous post.you will see BD is racing ahead of 2500k in all the multi threaded benchmarks.


----------



## d6bmg (May 8, 2012)

So, you buying FX-8150?


----------



## sukesh1090 (May 9, 2012)

who me??but the problem is what should i do with that 8 core and even my mobo doesn't support 8150.it supports only 8120.


----------



## Cilus (May 9, 2012)

Sukesh, thanks for pointing out that. Actually I just shared couple of infos which are not known to all.

Lets have some info about FMA4 instruction, it stands for Fused Multiply and Add. For example a= b*c + d is an X86 instruction or Macro Ops. Now this instruction can't be calculated directly and needs to be devided into multiple small instructions which are called Micro Ops. So in a processor without FMA support, 
1. The value of b*c will be calculated 
2. Then it will be rounded off as per the required level of accuracy. 
3. After that it will be added with d and again the value will be rounded off. 
4. Now the total calculated value will be stored in a register and saved to the 
    memory location where a is situated. Rounding off can be again done as 
    per the data type of the operand d.
So here 4 steps are required and three levels of rounding off, resulting use of multiple CPU cycle and reduced level of precision.

If you have a FMA enabled porcessor then the FPU will execute the instruction a= b*c + d, instead of the integer unit. As FPU has support for FMA, the b*c + d will be calculated in a single clock cycle and then the result will be rounded off. So the advantages are:-
1. Execution speed is faster
2. High level of accuracy because of only one rounding logic
3. FPU can produce more accurate result than Integer unit.

Now AMD has implemented FMA4 in their BD architecture while intel planning to introduce FMA3 in Hashwell. None of the Sandybridge, Sandybridge-E or Ivybridge have support for FMA. 
In FMA4, 4 operands can be placed inside the CPU register. So here the value of a, b, c and d can be fetched from memory to CPU register. The the whole operation can be performed inside the CPU withour any memory access, resulting far better speed up as you know memory access is a very slow process and CPU has to wait for several thousands of cycles for it.

On the other hand in FMA3, you can only place three operands inside CPU registers. So first b, c and d will be fetched adn the result of (b*c+d) will be stored in Temp register. After that CPU will fetch the value of a from Memory to a Register, saves the result in a and then again writes it back to memory.

So it is clear to all that FMA4 is faster than FMA3.

Now here comes Intel's foul play. In their compiler Instruction support Specification back in 2010, they have shown FMA4 as a supported Instruction set. But because of the hype of Bulldozer, just before it's release, Intel just dropped support for FMA4 and took FMA3 instead.

As a result none of the major software vendors haven't implemented FMA4 support. There are couple of open source programs like X264 encoder, have support for it and you just check the benchmarks...In AMD compiled X264 benchmark 2nd pass where originally the video gets converted, FX-8150 is ahead of 2600K even under Windows 7.

The good news is that AMD has signed a deal with Intel that from next iteration of their processor, starting from Piledriver, FMA3 will be used and Intel will support it in their compiler. So you can see a serious performance boost in Piledriver because of compiler support from all major vendors as Intel is gonna support.


----------



## amjath (May 9, 2012)

^ oh i see why 8150 outperform 2600k in video encoding benchmarks. Very well explained cilus bro. 

I have a question. U said FMA4 is faster than FMA3, then why dont they use it instead signed to use FMA3??? Also is it difficult to make FMA4 support to implement in softwares.

I can see Intel wants to make money by making us idiotic sheeps just like Apple is doing


----------



## Cilus (May 9, 2012)

Intel procesors are better suited for FMA3 because of the number of registers present and their structures. Intel has 256 bit Floating point registers and can perform operations on a 256 bit data but has fewer number of registers whereas AMD has two 128 bit registers and need to break a 256 bit data to 2 128 bit data to be operated on. But the number of registers are relatively higher.
Now in multimedia needs like Video conversion or editing, Photo Editing etc, you hardly operate on 256 bit data and instead operate on large set of smaller data. So you can basically store more number of small sized operands inside the FPU of Bulldozer and that's why FMA4 has been implemented.

In case of Intel, although they can't store large number of smaller data inside FPU registers but they can perform a 256 bit operation on a single cycle, compared to two cycles of Bulldozer. That's why they probably stressed on the operation on big sized data instead of storing large number of small sized data inside FPU. So FMA3 is perfect for them.

Since AMD is going to support FMA3, we will see more number of optimized software as Intel based Compilers will have support for FMA3. There is a chance that AMD might retain the current FMA4 compatibility along with new FMA3 in their Piledriver architecture, making it more future proof and versatile solution.

Have a read here: Single Floating-Point Unit, AVX Performance, And L2 : AMD Bulldozer Review: FX-8150 Gets Tested


----------



## vickybat (May 9, 2012)

Maybe the architectural revision of piledriver that we constantly talk about will be one of these. They might follow intel's path of using fewer registers but working on a wider data bit. This way they can use FMA 3 more efficiently.

Logically , fma4 just uses an extra register and thus saves cpu cycles. Considering an FMA4 instruction : d = a*b +c , the same instruction will be implemented in FMA3 as c = a*b +c.

In the first case, after computation of a,b &c the result is saved in a different register i.e d. But in the 2nd case, after computation of a,b & c, the data is stored in either a,b or c after freeing the previous data.

There are a some losses in CPU cycles in fma3 but there are a few advantages in a 3 operand scheme. Fewer operands result in shorter code and thus simplifies the implementation. It has been seen the that losses are negligible in fma3 compared to fma4 and is a mere 1%. So considering simplicity, fma3 is the new standard.

But others say fma4 to be more flexible from a programmers perspective.


----------



## Cilus (May 9, 2012)

Vicky, FMA4 does have other benefits apart from being much flexible in programmers point of view. Consider the following Case:

const c = 10; //c is a constant assigned to value 10
int d[] = new int[10]; //d is an array with size 10 of integer data type
For (int a = 0, a<10, a++)
{
  int b = a +1;
   int d[a] = a*b+c
}

Now here if you're using FMA3, the implementation for each of the iteration will be like:
the value of a, b and c will be loaded to three CPU registers, say A, B and C. And the address of d[a] is stored in Program Counter Register or PC for the iteration under consideration.
C = A*B + C.
C->MEM[PC] --Store the value of C in the memory location present in PC
For next iteration
again Load A, B and C and continue the cycle.* Here you have to reload C as its value has been overwritten, although it is a constant.*
Now if you look at it, The value of c is needed to be read from the Memory or Ram 10 times.  But think, c is just a constant and its value is actually same throughout the life cycle of the program. So you are basically wasting 9 Memory Read cycle, a very slow process, for Reading the same value.

Now consider FMA4,

Load the value of a, b and c in A, B and C respectively. D is another register
D =  A*B + C
D-> MEM[PC]
Load the Value of a and b to A and B and continue the cycle.

*So here you're saving 9 precious memory read times as C is a constant and is kept in the Register C throughout the life cycle without reloading.* This is just an example and with the increase of the number of iterations , you will save more memory read times.

Basically FMA4 can actually have all the features of FMA3, it is only Intel who made it incompatible with FMA4 by using legal steps.


----------



## AcceleratorX (May 9, 2012)

As far as FMA is concerned, it's actually a legitimate issue....*but*...A processor has to show it's performance in real world applications as well and not expect the application to be optimized for it.

In most real-world testing, BD is similar to 2500K at best. For this reason, irrespective of compiler specifics, BD is a failure. NVIDIA, for example, faced similar issues with GeForce FX series. They worked around it by introducing a new philosophy: Optimize the hardware to the compiler and not the other way around. AMD needs to learn from this.

I have respect for AMD because it has put in significant efforts to drive competition. But at the same time, their graphics division is aggressively working in a very NVIDIA-like fashion (the _old_ NVIDIA). AMD constantly disses NVIDIA instead of working around bugs (for example). When has NVIDIA done this? Even when the nForce/Intel fiasco broke out, NVIDIA weren't giving interview after interview about why the competitor sucks.

AMD needs to clean up its engineering and its general act. That being said, if they pull it off, they'll be _very_ competitive, CPUs or otherwise.


----------



## Cilus (May 9, 2012)

I think you are getting the picture wrong. I agree with you that AMD should have optimized Bulldozer for Windows 7 and rather that creating hype like 1st 8 core in the world, they should have made it something else, the modules to be detected by the OS.
But what I'm trying to point out that some software won't run with their max performance in AMD or any other Non-Intel CPU because Intel deliberately stops their compilers to use optimized code path in any Non-Intel CPU or simply force drive the vendors to use only their optimization.

I guess you have missed out the VIA Nano issue. It used to be around 8% faster than the Intel Atom CPUs but it supposed to be more faster as per Via's claim. So when a mod has been used to change the CPUID of VIA nano to Intel original, the performance increase jumps to whopping 30% over the comparable Intel Atom processor.

See, even if the software is optimize to run at a specific hardware, Intel is forcefully preventing it by their money power and influence, it is not like that hardware is not already optimized for software.

Check the Linux benchmarks of Bulldozer where plenty of open source and unbiased Compilers are available, in more than 90% cases FX-8150 is ahead of 2500K.

And regarding software optimization thing, one has to take steps into the future, otherwise the innovation will stop. If that is the case then Dual Core processors shouldn't be released at all as Windows XP, the primary OS on that time was never optimized for Dual Core processors. It is after the release of Pentium D and Athlon X2, Microsoft started releasing patches for Multi-Core optimization for XP.


----------



## vickybat (May 9, 2012)

Cilus said:


> Spoiler
> 
> 
> 
> ...



Yup buddy this case is fair enough. Since c is a constant, FMA3 faced a performance hit compared to FMA4 as it unnecessarily fetched the value of c from main memory nine times.

But consider a scenario, where c is a variable and its value changes every iteration or lets say its a random generated number. Then won't FMA3's performance equals to FMA4 as both have to fetch value of c equal number of times?

Won't such a scenario be available in a program logic? Just asking.


----------



## AcceleratorX (May 9, 2012)

The thing with dual core is, well, Intel released the first dual cores with similar frequencies as the top single cores, which meant performance was roughly comparable either way. For this reason dual cores caught on quickly since Microsoft released patches later that improved the performance.

Now that you mention the VIA Nano issue, I remember this, and this issue was not just with VIA's CPUs. Even AMD CPUs had this issue (AFAIK the Intel compiler limited the instruction sets when a non-Intel CPU was detected). However, the issue was resolved by simply switching the compiler to Microsoft (AFAIK the MS compiler does have FMA4 support now).

I didn't think it was a large enough issue since I assumed it was easier for developers to use Microsoft's compiler for Windows and GCC for Linux than resorting to Intel's stuff (The developers I know do this only. In fact, AVG, despite being partly owned by Intel, still uses Microsoft compiler). IMO this issue never gained enough traction because AMD's processors were still fast enough despite being crippled by the Intel compiler and VIA never had a significant market share.

I know FX is a good architecture limited by choice of compiler, but what I'm saying is that pointing fingers at the competitor will not help your bottom line. You gotta find another way, no matter what are the odds.


----------



## Cilus (May 9, 2012)

> I know FX is a good architecture limited by choice of compiler, but what I'm saying is that pointing fingers at the competitor will not help your bottom line. You gotta find another way, no matter what are the odds.



Is that so simple when Intel is your competitor? 
They have spent 50 million over the compiler business and let the developers use it for free. Also in terms of features and optimization, Intel compilers are far superior than their competitors, even better than Microsoft. You like it or not, most developers still rely on Intel compilers over Microsoft. In fact other CPU makers pay Intel for licensing of their compilers, including AMD.

I know GCC and Open64 are two open source compilers, having support for AVX and FMA for AMD processors and that's why in Linux benchmarks 8150 excels over i5 2500K. Check the review link given in the 1st page.

Most of the synthetic benchmark software like Syssoft Sandra, 3dMark 05/06 are based on Intel compilers and therefore favors Intel Processors.

The latest X264 codec, having developed with FMA4, AVX and XOP (rest of the SSE 5 instructions which are not a part of Intel's specification) support , performs better in 8150 than even the mighty 2600K, in Windows 7, despite of its poor thread scheduling. So isn't it clear if proper and unbiased benchmarks are used, FX series get a big performance boost.
And whether Via has market share or not, VIA nano Modding shows that the foul play by Intel.


----------



## AcceleratorX (May 10, 2012)

Hmm......that is interesting, and I am aware Intel is less than ethical with some of it's practices, but I still think AMD can only make the best of a bad situation and not try to change things about (because Intel will not allow it). That's why I'm saying they should try and optimize the architecture to the compiler the next time around.

As for now, with BD only awareness will change things and not being first to market has hurt AMD a lot. I'm hoping they make a better comeback with Piledriver.

There are many good things to say about AMD though - they are really pushing features in their chipsets, I mean you only need to look at the features from an AMD platform compared to Intel - Intel's been so slow in pushing SATA 6gbps and USB 3.0.


----------



## Cilus (May 10, 2012)

AMD is trying hard buddy. They have paid some unknown amount to Intel for licensing FMA3 for Piledriver to make sure Intel Compilers will support code path which is optimized for FMA3 for AMD processors too. Also they are going to use resonating Mesh Clock generator to run their processors at 4 GHz+ speed without increasing the power draw and to increase the IPC.


----------



## vickybat (May 10, 2012)

Here's a link that chronicles the *"instruction set war"* between amd and intel :

*Source*

I hope you'll like it guys.


----------



## Cilus (May 10, 2012)

> Yup buddy this case is fair enough. Since c is a constant, FMA3 faced a performance hit compared to FMA4 as it unnecessarily fetched the value of c from main memory nine times.
> 
> But consider a scenario, where c is a variable and its value changes every iteration or lets say its a random generated number. Then won't FMA3's performance equals to FMA4 as both have to fetch value of c equal number of times?
> 
> Won't such a scenario be available in a program logic? Just asking.



That can be implemented in FMA4 also as you can use max of 4 registers, so you can use 3 also. But the thing is FMA3 is use self destructive regoster mapping where the value of one operand inside a register gets updated with the result. The example I've given is just one of the millions of possible scenarios. For example consider the following example:-

For (int i = 0, i<10, i++)
{ 
  a = i*2; b= i*3, c = i+2;
  int d = a*b+c;
  int j = c*a;
  printl (a,b,c,d,j);
}

Here a, b, c and d all are variables but the value of c is needed to calculate the value of both d and j in each iteration.
Now if you use FMA3: C=A*B+C, the value of the C register will be overridden and can't be use to calculate the value of j. So again you need to fetch it from Memory, resulting one extra memory read in each of the 10 iterations resulting 10 extra reads.
on the Oter hand, in FMA4, you can use the combination of 4 register and 3 Register use.
D = A*B+C; (FMA4 using 4 registers)
C = A*C (FMA3 implementation within FMA4)

MOVE C to MEM[j]  //Save the value of D to the variable j in Memory;
MOVE D TO MEM[d] //Save D to the variable d in Ram or Memory;

Now look at the advantage.
1. We are reusing the value C and hence saving one memory read in each iteration
2. Moving C to the variable j in memory and Moving D to the variable d inmemory can be performed simultaneously as they are coming from two separate registers and updating two separate memory locations. So we are also saving one memory Write operations time per cycle.

These advantages actually provides very high degree of flexibility to the compiler as well as to the programmers.


----------



## vickybat (May 10, 2012)

Cilus said:


> Spoiler
> 
> 
> 
> ...



Yes, here FMA4 is advantageous in calculating j = c*a per iteration because it requires the original value of c per iteration. In FMA3 , the use of C register to hold the value of a*b+c would have overwritten c's original value and  not available for calculating j = c*a. But in FMA4, that D register doesn't let the compiler to again fetch the original value of C per iteration like in FMA3 as its already available in register C.

Definitely FMA4 has an advantage. Doubts cleared buddy. This was a good scenario for testing if c had a constant change in value. So we can conclude that in scenarios, where we have to reuse values stored in registers, FMA4 comes handy and saves excess memory read/write.


----------



## Tech_Wiz (May 12, 2012)

Its look like we may have to Let BD skip and hang on to Phenom II and jump on Piledriver directly just like we stick with XP and skipped over Vista to Jump on Win 7.


----------



## RiGOD (May 13, 2012)

^^I guess you're right there. If the Piledriver has better IPC performance, power management features and can run cooler then BD's, then it'll be wise to wait till Q3


----------



## sukesh1090 (May 13, 2012)

^^
what Q3 buddy,piledriver has been announced for 2013 not 2012.so you have to wait longer.


----------



## Tech_Wiz (May 14, 2012)

Yeah but Phenom II & Thuban are still giving enough performance to wait till that time. 

OCed Phenom/Thuban pretty much match/exceed First Gen i Core Proccys in most cases they are not bad at all to hang on to.


----------

