# Everything about Threading; Single Thread to Bulldozer



## Cilus (Sep 17, 2011)

Hi Guys, Here is the article about Threading I've promised earlier. It is the lengthiest article by me till date but the most elaborated and informative one too. Tried to explain everything without using jargons so that it can be understood by mass. Please have your patience and go along with it. I am sure you're gonna like it and believe me, it is gonna clear a lots of thing about threading. So here it begins:-

*Introduction: *
Multithreading is a technique implemented in all the modern CPU or Processor to run multiple threads simultaneously. It increases the parallel processing capabilities of a CPU resulting improved performance by using Thread level parallelism or TLP and maximizing the CPU resource usage. This technique is rooted long back in 70s and has gone through a lot of new techniques, new concepts and hardware changes. In this article we will cover different types of threading, from orthodox Multitasking to Hyper-threading to Multi-core processors.

*Instruction Level Parallelism, Out of Order Execution and Pipelining:*
Before we start discussion for TLP, let’s spend few words about the basics of ILP or Instruction Level Parallelism. The most common techniques are pipelining and Out of Order execution and are present in all the modern processors. These techniques are implemented as hardware logic inside the CPU die and don’t need the aid of OS.

*Pipelining:* The CPU’s execution core is divided into some logical units. For simplicity we can have 5 units, namely Fetch Instruction (FI), Decode Instruction (DI), Fetch Operand (FO), Execute Instruction (EI) and Write Back Output (WO). Instruction Pipelining basically overlaps the execution of multiple instructions to increase the execution rate of instructions. Here suppose at clock cycle C1, instruction is fetched by FI, at C2, the instruction is decoded and the addresses of the operands to which the operation needs to be performed are retrieved by DI, at C3 the operands are fetched from either CPU register or from Memory by (FO), at C4 the operation has been performed by EI and C5, the result is write back to memory or register for the instruction i. Now at C2, the FI unit is sitting idle…so it can start fetching the instruction i+1. Similarly at C3, both FI and DI units are idle, so FI can start fetching the instruction i+2 and DI can start decoding the instruction i+1. This process goes on for all the units present. So for executing instruction I, five clock cycles are needed but for finishing i+1, only 6 clock cycles are needed and for i+2 to be finished only 7 clock cycles are needed. If Pipelining is not there, for finishing I to i3, 15 clock cycles would be required. 
So from the above discussion it is very clear that more the number of instructions which can be overlapped, more performance benefit will be achieved from Pipleline
For simplicity we are assuming all the instructions are independent of each other.

*i51.tinypic.com/34imxiw.jpg

*Out of Order Execution*: In very simple word, OOE architecture takes code that was written and compiled to be executed in a specific order, reschedules the sequence of Instructions (if possible) so that they make maximum use of the processor resources, Executes them, and then arranges them back in their original order so that the results can be written out to memory. To the programmer and the user, it looks as if an
ordered, sequential stream of instructions went into the CPU and identically ordered, sequential stream of computational results emerged. Only the CPU knows in what order the program's instructions were actually executed, and in that respect the processor is like a black box to both the programmer and the user.
*
What are a Process, Context and Thread: *
When we run multiple programs in our computer, those programs are basically loaded to our main memory. *The more technical term about a program is called Process*.  Now a process may consist of different sub modules and execution of all of them will signify the successful completion of the process. Process has some unique fields to be uniquely identified by the Operating system.* Collection of these parameters is called Context*. Context actually holds data to store the current state of the process. It contains the *CPU register values, Program Counter, the instruction which needs to be executed in next iteration etc. * 
*Thread:* Now it is observed if the process can sub divided into small pieces of sub processes with their own state, data and work then the execution can be faster. The reason is executing those small sub programs is easier than a heavy weight process because the sub-processes operates with small set of data and less number of instructions. So these small sub-processes are called Thread. So in main memory, a process is again divided into multiple threads, each of them has their own data to operate on and to perform a specific task. Like the process, Threads have their own Context which contains the Context of the parent process and additional information which is unique to a particular thread.  A process always has at least one Thread. In that case it is a single threaded application.
If multiple processes share a common method, then the thread created for the method can be placed once in Main memory or Ram and can be shared by all the processes rather than placing multiple copies of it. It improves memory utilization.
*Difference between Running and Executing:* When a process is loaded in main memory then it is assumed as running. Several processes can be loaded into main memory and we can say that several processes are running simultaneously.
But when any of the running processes gets CPU time and its instructions are getting executed by the CPU, only then we can say that the process is executing.

*Conventional Multithreading, A single core single threaded processor, CONTEXT SWITCHING*
Consider a single threaded processor is running multiple processes simultaneously. Now since there is only one CPU available, only one program can be executed at any particular point. Even in our Pentium 4 days we used to run multiple applications like Web browser, Word documents, Windows media player simultaneously. It looks like that a single CPU is executing all the processes simultaneously. Actually it is an illusion created by the Operating system by implementing a technique known as Time Sharing. Here Operating system allocates each of the running process to CPU for a specific time period which is known as Time Slice. The time slice is not fixed for all the running processes. Even a specific process can be given different time slice value in different run. At the allocated time slice, the instructions of a process are getting executed by CPU. Inside the CPU, different ILP techniques like Pipelining, Out of Order execution can be used to maximize CPU resource utilization for the instructions of the particular process. *Once the time slice is expired, Operating system saves all the values of the CPU registers, current state of the program to the Context of that process and writes it back to the main memory.* Then it allocates another process to the CPU for the time slice allocated to it and this process continues.   *So when a process again gets its next time slice, it can be started from the point where it was left in the last iteration*. This technique is known as *Context Switching*.
Here one of the most important factors is value of the time slice. It has to be small enough that the user doesn't notice any degradation in the usability and performance of the running programs, and it has to be large enough that each program has a sufficient amount of CPU time in which to get useful work done. All the modern OS can set a specific value for Tie Slice for each process depending upon the nature and complexity of the process.

Initial approaches for Determining Time slice were to leave it to the process itself. Here OS does not force any process to release the CPU resource and it is known as *Preemptive Multitasking*. But if the process is not well optimized to handle resource, it may cause serious performance issue by not releasing the CPU resource and other processes will be waiting for the CPU for indefinite period of time

To resolve the issue, *Cooperative Multitasking* is introduced which forces the processes to release the CPU once their time slice expired. It also use Memory Protection feature to ensure that a process can only access its own memory space, not any others.

*Our Hypothetical CPU:*

*Front End:* Front end of the CPU consists of Fetching, decoding, Branch Prediction, program Counter, Out of Order Rescheduling Logic etc. These operations are basically scalar operations and done in sequence. So CPU front end is basically an In Order unit where operations are performed in the same order as it is specified. 
*Execution Core:* Execution core consists of the execution units and retire logic. Here the instructions are originally executed and the instruction reached its end state. So pipelining is implemented inside the Execution core

In our example, we are having a *front end which can issue four instructions simultaneously*. In case of Execution core,* it has a 7 issue Pipeline, i.e. one instruction needs to go through seven steps to be executed completely*. Look from *left to right inside the execution core*, not top to bottom, you’ll find lot of similarities with the Pipelining diagram, I’ve explained earlier.

*CONTEXT SWITCHING In a Single Threaded CPU:*
Here the front end of the CPU can hold only instructions for one thread at a time. So when time slice is allocated to a Thread, it is fetched from Main memory and placed in CPU front end. All the instruction fetches, decode and Out of Order rescheduling is performed over the instructions of the thread and then it is allocated to Execution Unit. Once time slice is expired, all the values from execution unit, temporary results are written back to memory with the CONTEXT value of the thread updated and the next thread is loaded to CPU front end.

*i52.tinypic.com/vrex05.jpg
A single Threaded CPU, running one process at a time

In the above Diagram, *different colors indicate threads from different processes*. Currently inside Main memory, 4 different processes are running, but only the instructions from the red colored Process are getting executed by the CPU while other processes wait for their time slice to come.
Also, be sure and notice those *empty white boxes in the pipelines of each of the execution core's functional units. Those empty pipeline stages, or pipeline bubbles, represent missed opportunities for useful work*; they're execution slots where, for whatever reason, the CPU couldn't schedule any useful code to run, so they propagate down the pipeline empty. One of the main reasons for Pipeline Bubbles is that practically there are lots of dependencies among the instructions from a thread and all of them can’t be run in parallel and Pipeline needs to be stalled for this reason to wait for some other instruction to be completed or at least partially completed until some intermediate result is available

The empty boxes in the front end indicate that CPU cannot actually reach the maximum limit for a single Thread, which is 4 in our example, for issued instructions. The main reason is dependencies among the instructions of a Thread and even using Out of Order rescheduling logic. On most CPU cycles it issues 2 instructions in parallel.  

*Advantages and Disadvantages:* Although from the above discussion it looks like the conventional Context based Multitasking concept is not good, it is not true. In CPU resource utilization obviously it is not that efficient but in OS point of view, it has couple of advantages

For switching from one Thread to another thread, CPU has to access Main memory which is a very slow process. Now for a single threaded big Process, it is not even possible to load the whole process inside CPU cache or Register and needs multiple Memory access, resulting cost of valuable CPU cycles. But a Thread of light weight program are small, so f*etching them and writing them after their time slice has been expired is easier compared to a huge single threaded process*. *It also improves the memory allocation efficiency since the whole process does not need to be residing inside the memory, only the active threads of this process will do the job.*

*Superthreading with a Multithreaded Processor*

One of the technique to maximize maximum CPU resource utilization by running multiple threads concurrently and eliminates the waste of CPU resource associated with a single threaded design is known as Time Slice Multi-Threading or Super-threading and a Processor uses that technique is known as Multithreaded processor. These processors are capable of executing multiple threads at a time; although in most of the desktop processors the concurrent threads that can be executed is two.

*i51.tinypic.com/73idna.jpg
A two way Superthreaded CPU, switching between two threads in each clock cycle

Here the front end is updated and can hold instructions of two threads simultaneously. Also here the number of white boxes is fewer compared to a single threaded processor as Front end is issuing instructions from two threads. The small arrows in the Left hand side of the execution core are indicating how instructions from two threads can be mixed up inside the execution core.* Look from left to right, a pipeline stage is either having yellow boxes or red boxes, but not both,* indicating that at one CPU cycle instructions from only one thread can be issued.
The Reason is here that in a Super-threaded processor, a single pipeline stage can only executes instructions from one and only one thread. Let’s explain it more clearly with the help of the diagram:
Here the Frontend can issue hold two threads simultaneously and issue 4 instructions to the 7 functional unit Pipelines based execution core. However, *all the instructions should come for only one thread because each of two threads present in front end as the thread is confined in a time slice. But here the Time Slice is just One CPU Cycle.* Look at the diagram, in the 1st line of Pipeline only has instruction from Yellow thread is present, in the 2nd line of the pipeline only has instruction from Red thread. 
So instead of system memory containing multiple running threads that the OS swaps in and out of the CPU each time slice, *the CPU's front end now contains multiple executing threads and its issuing logic switches back and forth between them on each clock cycle as it sends instructions into the execution core. *It is a far faster approach as switching among the threads present inside the Frontend is really quick as Frontend runs at the same speed of the CPU.

Multithreaded processors can help alleviate some of the latency problems brought on by *DRAM memory's slowness relative to the CPU.* For instance, consider the case of a multi-threaded processor executing two threads, red and yellow. If the red thread requests data from main memory and this data aren’t present in the cache, then this thread could stall for many CPU cycles while waiting for the data to arrive from slower main memory or RAM. In the meantime, however, the processor could execute the yellow thread while the red one is stalled; thereby keeping the pipeline full and getting useful work out of what would otherwise be dead cycles. I know you guys know it as Hyper-threading or Simultaneous Multi Threading, but it is not.

*Barrel Processor:* One of the implementation of Super-threading is known as* Barrel Processor which can have n number of Thread registers and a single execution unit.* The execution unit is allocated to each of the thread registers for a fixed time and this process continues. *So if you have a 700 MHz processor with 7 threads, it will roughly behave as seven (700/7) = 100 MHz processors. *Although it looks like there is no performance improvement as 700 MHz processor can perform 7 times faster than a single 100 MHz processor, so seven 100 MHz CPU processing power is roughly equivalent to one 700 MHz CPU. *But you guys are forgetting one thing….Memory latency.* Even today, armed with high speed DDR3 memory, on-die memory controller, CPU still waits more than 50% of its time just waiting for memory. In 80s, CPU used to wait more than 80% of its time. Barrel processor hides that latency for memory access.* Here when one thread is waiting for memory resources, dependencies; CPU has another six threads to pick up to keep its execution units busy while Frontend is performing Memory read activities.*


While superthreading can help immensely in hiding memory access latencies, it does not, however, address the waste associated with poor instruction-level parallelism within individual threads. If the scheduler can find only two instructions in the red thread to issue in parallel to the execution unit on a given cycle, then the other two issue slots will simply go unused.

*Hyper-threading: the next step, Bringing ILP and TLP together*

Hyper-Threading or Simultaneous Multi-Threading and brings ILP and TLP together. Simply put, Hyper-Threading removes the restriction of Superthreading that at any given clock cycle, Frontend can only issue instructions from one thread only. Here at any clock cycle Frontend can issue instructions from all the threads present in it. It takes into account the fact that the CPU Execution core can’t distinguish the instructions coming from different threads as instructions are stateless. It basically improves the ILP performance to improve the TLP. Look at the following Diagrams:

*i56.tinypic.com/2hrd4ip.png 
a: Two way HT processor 
 *i56.tinypic.com/2m4d1y8.jpg
b: Single Threaded Dual Core

Diagram a describes a 2 way Hyper-threaded Processor where diagram b is a single threaded dual core processor. If you look little carefully, you’ll find that if you merge the two CPU blocks of the dual core processor into one then it will look like the CPU block of the HT processor. That is the reason a two way Hyper-threaded CPU is viewed as two logical processors to OS.
Let’s discuss its advantage in more details. Now CPU Frontend is containing two threads, Red and Yellow. Suppose our Out of Order Execution Logic has fetched all the Instruction Level Parallelism (ILP) from the red thread and it can issue two instructions in parallel from the Red thread in an upcoming cycle.* This is an exceeding scenario because research has proven that the average ILP that can be extracted from most code is two instructions per cycle.*
*Now as Out of Order execution unit is aware that still two Instruction Issue slots are empty, it can start working on the yellow thread and fetch 2 instructions for the coming cycle which can be issued in parallel with the two instructions of the Red thread.*I think now you guys are getting the idea, *Hyper-Threading lets multiple threads to be executed concurrently, not by switching between them*. So here total of four instructions is being issued from the Frontend, hence maximizing the resource utilization. Otherwise these two slots would have gone wasted. Also now execution unit which has a seven stage pipeline is having 4 instructions that can be overlapped together, hence increasing the efficiency of the pipeline.
In either a single-threaded or multicore single-threaded processor design, the two leftover slots would just have to go unused for the reasons outlined above. But in the hyper-threaded design, those two slots can be filled with instructions from another thread. Hyper-threading, then, removes the issue bottleneck that has plagued previous processor designs. 

*OS Requirement to Implement Hyperthreading*

The minimum requirement for the OS to identify a Hyper-threaded processor is that it should be Multi-Core aware so that the Hyper-threaded processor can be treated as a dual core processor and OS can schedule two threads simultaneously to it. But for better performance, *OS must know the difference between a true dual core and a Hyper-threaded processor. *Consider the following scenario with a Dual Core  i5 processor where OS sees it as a 4 logical processor cores. Now suppose OS has assigned two threads to two of the logical cores which belongs to a single physical core. Since the execution unit is only one, this particular Physical core will be very busy to execute both the threads while other physical core is sitting idle as both the threads have to share the CPU resource of a Single execution core. But if OS is aware of the Physical cores Vs Logical cores, then each of the two threads can be issued to each of the Physical cores so that instructions of each of thread can have more CPU resource as here each of them are having a separate execution unit.

*Caching in SMT processor*
As you know that in a Non Hyper-threaded Multi-core processors, like Phenom II, each of the core has its own instruction and data cache inside the die. They may share a Off-chip large cache like the L3 cache. The dedicated caches to each of the core, L1(both data Cache and Instruction Cache) and L2 cache, stay inside the CPU die and the Off-chip large L3 cache resides outside the CPU die but inside a single package. This things has been implemented to reduce the extra hardware cost to handle Cache Coherency problem. *Cache coherency problem occurs when multiple core use a shared cache and a specific location X of the cache has been updated by Core1 before the old value can be read by Core2. Here Core2 will read the updated or new value of X where the intention was to read the initial value in location X .The main  Cache Coherency problems are:-*
*i. Read Before Write:* Location X of cache has been read by CPU1 before it was updated by CPU2 where CPU1 is supposed to read the updated value.
*ii. Write before Read:* Here location X has been updated by CPU2 before CPU1 can read it where CPU1 was supposed to read the initial value of X, not the updated one
*iii. Write before Write:* This is the most potential danger case. Here CPU1 updates the location X after CPU2 but the value wanted is the updated value from CPU2. So the value that CPU2 has written in the location X is lost and an erroneous value has been placed.

Normally special restriction laws are implemented in hardware level to prevent such issues and since L3 cache is not a On-Chip cache, the extra hardware cost is not much. But implementing it inside the CPU die is not very cost-effective way.

In SMT, the Instruction and Data Cache are shared to both the logical cores and each of them can access any data or instruction present inside the L1 cache. However, it also increases the possibility of cache coherency problem as the cache is shared by both the logical cores. 

*Problems :* Here the cache is not aware of any of the threads and unlike the Processor Scheduling queue where CPU can switch between threads, here a single thread can *monopolize the cache access completely and there is no way to force it cooperate intelligently with the other executing threads.* But the processor will issue* instructions from both the threads* but instructions for the other thread *will starve for cache access and cannot be completed within time, resulting CPU resource lock for an elongated period of time.* This means that, in a worst-case scenario where the two running threads have two completely different memory reference patterns (i.e. they're accessing two completely different areas of memory and sharing no data at all) the cache will begin thrashing as data for each thread is alternately swapped in and out and bus and cache bandwidth are maxed out. That is one of the main reason for some applications to perform poorly in HT enabled processor than in a Non-HT processor.

*Bulldozer: Module based architecture to enhance SMT*
I think you guys have finished my explanation of Bulldozer core about its handling of multiple threads concurrently. Bulldozer modules have another trick to reduce Cache Coherency problem.* Inside a module both the Bulldozer core shares a single instruction cache but they have a separate Data Cache dedicated to each of them.* I think now you get the idea...Instructions cache has multiple instructions from both the threads but it doesn't bring *Cache coherency issue as instructions are stateless, it does not matter which instruction belongs to which thread.* So both the core can fetch instructions from the instruction cache to increase parallelism *but all the operands, temporary results required or need to be stored by an instruction being executed in one core can be put in its own data cache, eliminating the chance of Data Coherency.* If the 2nd Core needs data from the 1st core, 1st core can send the proper data as it already knows which data to return from its data cache.

Well, here I finish my lengthiest article. I think I have covered most of the things. Have little patience to finish it as it'll clear a lot of doubt about Threading.


----------



## MegaMind (Sep 18, 2011)

Great write-up, learnt a lot from this..


----------



## Cilus (Sep 18, 2011)

^^ it is not completely posted at all. I am working as posting from word document in forum is not easy; all the styling is lost. Check it after another 15 min. I guess I'll be finished then


----------



## sumonpathak (Sep 18, 2011)

nice writeup...so far....you have to share them in other forums too dude


----------



## Cilus (Sep 18, 2011)

sumonpathak said:


> nice writeup...so far....you have to share them in other forums too dude



Will do that for sure. I am finished now, go through it and share your valuable feedback so that I can improve it.


----------



## Skud (Sep 18, 2011)

Superb write-up buddy.


----------



## gameranand (Sep 19, 2011)

Real nice article bro. Very informative and simple to understand.


----------



## asingh (Sep 19, 2011)

Really nice. Will give it a proper read once I get more time. Nice.


----------



## d6bmg (Sep 19, 2011)

Superb explanation. Rep +1 for your very very useful post.


----------



## Cilus (Sep 19, 2011)

Thanks guys... will add more contents to it so that it can be more helpful. BTW, I guess now you guys have understood that what Hyper-Threading or SMT is all about. It is not just switching between two threads present in the CPU Frontend which is called Superthreading, it is all about executing instructions from both the Threads simultaneously by oberlapping them in CPU Pipeline. So SMT actually improves the ILP to improve TLP.

Now I guess you guys are getting the idea why Bulldozer module shares a single Fetch-Decode unit or *Frontend*for both the cores inside a module. It helps both the cores to share the instructions from all the threads present in the CPU frontend. If fetch-decode engines are separated for each of the cores of a module then one core can't share the instructions present in the Frontend of another core and for sharing them special hardware logic is needed to be inplemented which will increase cost, die size etc.


----------



## Tenida (Sep 19, 2011)

Nice writeup bro.


----------



## Jaskanwar Singh (Sep 23, 2011)

read a part of it(reading in parts )
excellent work cilus. rep+

BTW it should be made a sticky imo.


----------



## vickybat (Sep 29, 2011)

Really nice article. Read till context switching and its getting interesting. Will post further comments after finishing it.


----------



## avinandan012 (Oct 11, 2011)

quote :
*Inside a module both the Bulldozer core shares a single instruction cache but they have a separate Data Cache dedicated to each of them.*

That's a very good move from AMD , this will reduce CPU idle time waiting for DATA .


----------



## thetechfreak (Oct 11, 2011)

Awesome job Clius. Just read a bit of the first couple paragraphs. 
Very well written.


----------

