Behind The Picture

By Michael Browne | Updated on 01-Mar-2008

01-Mar-2008

Ever wondered what makes graphics cards tick, and tick so well?

From the movies we watch to medical sciences and imaging, educational simulations to defence applications, and even manufacturing processes (Computer Assisted Manufacturing) the impact that 3D rendering has had on the world is phenomenal. 3D accelerators (or graphics cards) have acted as catalysts for the growth of 3D modelling, rendering, and animation.

The year 1981 and the first PC video card from IBM seems ancient history. We’ve gone from single colour video cards to cards that can work with millions of colours. The real graphics war for desktop PCs didn’t start till 1997 when 3dfx delivered Voodoo to the world with (then) powerful features like Mip-Mapping, Z-buffering and AntiAliasing. Rivals NVIDIA came up with their TNT and TNT2 cards following the Voodoo 2, and these were the first AGP-based cards. Though it may seem that things have stagnated from those initial gigantic steps, nothing could be further from the truth. Enter NVIDIAs GeForce 256 graphics card (August 31, 1999). It essentially had 4 pixel pipelines that could each handle a single, half precision (16 bit) pixel operation in a clock. Core and memory speeds were 120/166 MHz and with a peak output of 480 million pixels per second and 15 million triangles per second, the GeForce 256 was hard to beat. November 8, 2006, seven scant years later and the GeForce family’s eighth generation of graphics cards were unveiled. Their latest 8800GT, which is a very value oriented card, churns out 16.8 billion triangles per second, a difference comparable to the difference between the sun and the moon. Regardless of the difference in specifications, there are certain characteristics common to all graphics cards which we’re going to look at. We’re also going to see how all this comes together. Let’s look at the individual parts of a graphics card, and what they contribute to the whole.

GPU Facts
At the heart of technology in a graphics card is the GPU or Graphics Processing Unit. Think of the GPU as a processor dedicated solely to graphics. Just as a CPU slots into a motherboard, a GPU is affixed on the PCB (Printed Circuit Board) of the graphics card. However, unlike a CPU, a GPU can never be removed from its PCB. Today’s GPUs have the ability to number crunch enormous amounts of data and are much more powerful, even than the fastest quad core. NVIDIA’s 8800GTX GPU, for example, has 128 stream processors inbuilt and each is capable of handling one thread of data. The constant need for speed has led manufacturers to up transistor counts while technology advances have ensured that the die size of today’s GPU core has shrunk. ATI technologies has already made the shift to 55 nm while NVIDIA has gone to a 65 nm process, previously both vendors were using 90 nm and 80 nm fabrication processes. More transistors are being crammed on a smaller GPU die, and the 8800GTX broke all barriers by having 681 million transistors on its 90 nm die—the previous high was 276 million transistors, which was found on ATI’s X1950XTX. The brand new 8800GT has crossed the 700 million transistor barrier, and at this rate we could see a billion transistors on a GPU die within a year.

So what increased with this increase in transistor count? For one, the number of Shader units rose steadily, from 4 pixel pipelines (PPs) on the GeForce FX series, to 16 PPs on the GeForce 6800, to 24 PPs on the GeForce 7800GTX, and finally 128 SPs (Stream Processors) on the 8800GTX. Similarly, the colour precision has gone from 16 bits on the GeForce 4 to full 32-bit precision on the 8800GTX, and double precision (64-bit) is on the way. Increases in performance and precision have gone hand in hand with increases in programmability with DX 8.1 giving way to DX 9, and then DX 9.0c (Shader Model 3), and finally DX 10.
The characteristic properties of a GPU mainly depend on two variables:

There are two ways to increase performance of a graphics core-
increase the clock speed, or increase the count of Shader units

1. The core speed: this is the rated frequency (in MHz, GHz etc) of the processing unit. A general thumb rule is the faster the core, the faster the GPU. But performance doesn’t only depend on the frequency. For example, the core on a GeForce 7900GTX runs at 650 MHz, while the core on a GeForce 8800GTX runs at 575 MHz, in spite of this the 8800GTX is more than twice as fast as the 7900GTX in terms of its processing performance.

2. Pixel and Vertex Shaders: A Vertex Shader basically takes a 3D image made up of vertices and lines and draws a 2D image out of it, which is displayed in the form of pixels on the screen. A Pixel Shader is a refinement of the Vertex Shader and allows the addition of textures to the 2D image on a per pixel basis. With the advent of DX10 both major vendors NVIDIA and ATI have done away with the concept of Pixel and Vertex Shaders completely choosing to design their architectures on a Unified Shader model. These Shader units are also called Stream Processors. The more the Shader units aboard a GPU the faster it will perform, since it can complete more operations (on pixel and vertex data) in a clock cycle.

So there are two ways to increase performance of a graphics core—increase the clock speed, or increase the count of Shader units. Current limitations of silicon yield mean that increasing clock speeds may not always be possible or even feasible. Increasing the Shader unit count is easier to achieve and the results are startling. Another reason for unifying Shaders would be the move towards making GPUs less graphics specific and more towards general purpose computing engines.

How Does The Core Work?
Any GPU is tasked with creating an image from a description of a scene, much like an artist. This scene contains a variety of data one of which is geometric primitives (vertices and lines) which are to be viewed in the scene. Also present is the data which contains lighting information for the scene, as well as information about the luminance qualities of the objects in the scene, (objects means the geometric primitive data), i.e. the way that each object reflects light. Finally, information of the viewers’ perspective and orientation is also important. This process of image-synthesis has always been viewed as a pipeline process consisting of a number of stages to avoid confusion,

Stage 1: Input
The GPU is designed to work simple vertices and for this purpose every complex shape in a 3D scene is first broken down into triangles—the basic building blocks of any 3D model irrespective of how complex. All shapes like rectangles, cubes and even curved surfaces are broken down into triangles. Here’s where the software comes in. Developers will use either of the computer graphics library software (OpenGL or Direct3D) to feed each triangle into the graphics pipeline, but one vertex at a time. The GPU can assemble vertices into triangles as and when needed.

These days due to the infinite Shader programming possibilities available, developers are able to give advanced light-based properties to objects. For example, the way water in a pond interacts with light, the surface sub-scattering of light by translucent materials of varying densities, like skin and hair, and more.

Stage 2: Transformations
GPUs have their own localised coordinate system and they can use this to specify the position of each object in a 3D scene. For utilising their own coordinate system the GPU has to convert all the objects to a common tracking system. At this point the GPU cannot perform complex modifications to the objects for fear of warping the triangles, simple operations like scaling and rotation can be done. By representing each vertex in a homogenous coordinate the GPU can perform a single operation on triangles en masse. At this stage of the pipeline, the output is a stream of billions of triangles on the basis of the GPUs common coordinate system in which the orientation is the viewers’ perspective.

Stage 3: Lighting
By now, each triangle is placed in a global coordinate system and the GPU can now calculate each triangle’s colour on the basis of the lights in the scene. The lighting could be as simple as a single source (a light bulb in a room), or as complex as a street lit up by a row of streetlights. In the latter case, the GPU is tasked with summing up the effects of each light source and computing their overall effect on colours and luminosity of the triangles in the scene. Today’s GPUs are also complex enough to colour surfaces of the basis of their physical surface properties—matte, gloss, etc. Advanced lighting techniques like specular highlighting, and diffusion and occluded lighting are also possible.

Stage 4: Getting Perspective
The next stage of the pipeline projects the (now) coloured triangles on to a virtual plane, as viewed from the users perspective. These coloured triangles and their respective coordinates are finally ready to be turned into pixels.

Stage 5: Rasterization
Rasterization is quite simply the process of converting a vertex representation to a pixel representation. Here, an image that is described as a vector graphics format is converted into a raster image consisting of dots or pixels. In this stage, each pixel can be treated separately, and the GPU handles all the pixels in parallel—one of the reasons that manufacturers upped the Shader unit count on GPUs to speed up this parallel processing.

Stage 6: Texturing
While pixels are already coloured by this stage sometimes additional textures may need to be added for added realism. This is a cosmetic process similar to the make up used by ramp models, in which the image is further draped in additional textures to add an additional layer of detail and believability. These textures are stored in the GPUs memory, where each pixel calculation can access them. This is where memory bandwidth and the quantity of memory plays a vital role which is why we see graphics cards with more memory generally performing better.

Stage 7: Hidden Surfaces
In all 3D scenes, some objects are clearer while some are obscured by others. Its not as simple as writing each pixel to memory. In that case the most recently written triangle would get written first, in which case (for example) a chair placed behind a table would be displayed while the table itself would be hidden behind. This stage involves sorting of triangles from a back to front view. This is where a GPUs depth buffer comes in. The depth buffer stores the distance of each pixel from the viewer and before writing a pixel to the screen the GPU does a check to see whether the pixel occupying that position is the closest to the user and if it is, it is sent to the monitor.

Memory
VRAM is present on the PCB of the graphics card and is as important to it as system RAM is to the CPU. Very basically VRAM feeds the GPU with raw data and passes on the results after the GPU has worked on the data. VRAM is also sometimes called ‘frame buffer’. In the past three years or so the importance of VRAM has sharply inclined, mainly because whole visual revolution that is taking place where users want as realistic an experience as possible while playing 3D games. In fact games are the numero uno reason for advancements in graphics cards’ performance over the years. From 32 MB of VRAM on the GeForce 256 DDR, to 768 MB of VRAM on the 8800GTX, the quantity of memory has increased more than 20 fold. This goes in tandem with the improvements to the core of the GPU.

Not only has the quantity of video memory gone up, but the type of memory has changed too. Current graphics cards use DDR3 memory, and although DDR4 memory is waiting in the wings, it hasn’t seen much adoption yet.

The importances of VRAMhas sharply inclined , mainly because
whole visual resolution that is taking place – users want as
realistic an experience as possible

The bus width of memory has also gone up, from the 64 bits of yore to 256-bit and even 384-bit bus width (GeForce 8800GTX). All this speed is required too, as today’s games demand more than 50 times the resources than the games released eight years ago. The ultimate goal of manufacturers is to up the memory bandwidth figure of the graphics card. This is done to eliminate bottlenecks which occur in the memory, as 90 percent of the time the core has to wait for much slower memory to supply it with fresh data. Here’s how you calculate the memory bandwidth of your video card (works for system memory too):

An 8800GTX has memory clocked at 1,800 MHz, and this memory is 384 bits wide, the formula for total bandwidth = (clock speed * memory bus width)/8 = (1,800 * 384)/8 = 86,400 megabytes per second or 84.6 GB/s (Gigabytes per second)

Just to compare this figure to the fastest system memory based on PC8500 DDR2 RAM (1066 MHz) we have Bandwidth = (1066*64)/8 = 8528 MB/s or 8.5 GB/s

As you can see current generation cards offer ten times the bandwidth that the fastest memory offers and with DDR4 this bandwidth is set to cross 100 GB/s. This enormous amount of bandwidth is used to feed the data hungry GPU with enough raw data to keep it busy, which is no easy task considering that the GPU has an internal bandwidth of over 180 GB/s, and the latencies are minimal. Very simply, VRAM is a buffer for the GPU. Data that is fed to the GPU is fed via VRAM which holds such data till time the pipelines of the GPU need feeding again. While VRAM holds vertex data, perspective data and lighting information, it can never actually hold texture data which is held by the PCs main memory. The GPU accesses these textures from system memory directly through DMA (Direct Memory Access) and there is no CPU interference. The resultant output of the GPU is also buffered in part of the VRAM which is utilised as a ‘frame buffer’.

RAMDAC
The RAMDAC or Random Access Memory Digital to Analog Converter is a component of a video card that is responsible for converting digital images into analog signals that can be displayed by an analog monitor like a CRT. The RAMDAC may be a separate card on the graphics cards PCB (for example like on the 8800GTX) or could be integrated on the GPU die. The RAMDACs ability to support greater refresh rates depends on its internal data transfer rate, (which depends on its speed, in MHz), and the precision of colour bits used. However the concept of flicker applies only to CRTs and with the advent of the DVI (Digital Video Interface) on most LCD monitors and graphics cards the RAMDACs days are numbered. All LCD and Plasma screens work entirely in the digital domain and do not require a RAMDAC. In case they have an analog input, the signal is simply converted to digital again after the RAMDAC has fed out an analog signal from the original digital signal. This double conversion is lossy to say the least.

Video BIOS
Like a motherboard’s BIOS the VGA BIOS is a chip that contains the firmware (hardwired low-level instructions) that controls the graphics card’s operation and allows the computer’s other components and software to interface with the card. It contains essential information about the graphics card such its core and memory clock speeds, memory timings, voltages, and even thermal thresholds (for warnings and shutdowns). Most graphics cards these days come with re-flasheable BIOS’ which may improve performance for affect stability favourably. If the BIOS is flashed incorrectly, the card can be permanently damaged.

Outputs
Connectivity might seem a tiny part of a graphics card, but without standards you wouldn’t see any of that brilliance actually working. The following three connects are common to nearly every graphics card:

HD-515: Commonly known as a D-Sub, or VGA connector. It is an analog connect and has 15 pins and was designed for CRT displays. It has its issues—electrical noise (being an analog signal), image distortion, and sampling errors (due to downscaling).
DVI: A digital standard developed for LCD and Plasma panels. It uses the computers native resolution without any compression of any sort and avoids the demerits of the D-Sub connect.

S-Video: This interface was designed keeping in mind other devices which may need to interface with a display card such as DVD players, large screens, game consoles, video recorders, etc.

Interface
Let’s face it—graphics cards wouldn’t be the speed demons that they are today unless they could interface with PCs using high bandwidth ports. A GPU works in tandem with the CPU, Northbridge (or Memory Controller Hub), and system memory, and its only means of communication happens to be that little slot it’s plugged into. The specifications of the same little slot have changed a lot too over the past few years—so much so that the latest specification is not backwards compatible. From IBMs 16-bit ISA interface with a speed of 8 MHz to the current PCI Express we’ve come a long way. Currently graphics cards utilise the PCI Express interface (also called PCIe or PCI-E). PCIe 1.1 has a maximum bandwidth of 8 GBps, consisting of 32 parallel lanes each capable of a maximum throughput of 250 MBps. PCI 2.0, which has just debuted in the latest ATI Radeon 3xxx series and NVIDIA 8xxx series, offers 16 GBps.

The Shape Of Things To Come
DX 10 wasn’t the big affair it was said to be and we’re not going to say “see we told you so!” Both manufacturers NVIDIA and ATI have realised the follies of developing a graphics card for a standard that has yet to be properly implemented, and the fact that hardly anyone uses Windows Vista as a gaming platform doesn’t help. NVIDIA has just released the first card of their new 9xxx series—yes, the GeForce 9600GT has finally arrived. Based on a 65nm manufacturing process it’s a mid-range offering meant to take over the reins from the capable 8600GT. Rumours abound of the presence of a mighty 9800GTX, but the company themselves are tight lipped about specifications. One thing we know is these cards support DX 10 which is nothing new. ATI has already released a host of products namely the 3xxx series comprising of the high-end Radeon 3870 and the slightly lower Radeon 3850, with the mid-range Radeon 3650. But the specifications of these newcomer’s is hardly revolutionary as the 8800GTX was a couple of years ago. It seems we’ve reached another plateau, and the real meat may arrive after DX 10 matures as a platform and application support picks up. Till then we’re sitting pretty, literally.