|
Page 3 of 8
Cypress architecture
We've seen AMD's aims now and their strategy to achieve it, so rather than keeping you on tenterhooks any more let's put you out of your misery and lift the lid on AMD's brand new architecture, before breaking it down into its component parts so that we can discuss each aspect of this architecture in detail.

So, there it is in all it's glory - Cypress, RV870, the powerhouse behind the Radeon HD 5800 series, call it what you will. The first thing you might notice is how similar in terms of its basic configuration this architecture is to AMD's previous offerings in recent generations - And you'd be absolutely right to think that. Although it boasts plenty of improvements (as we'll discuss shortly), and despite the changes required to bring about full DirectX 11 support, the basic architecture of the Radeon HD 4800 is very heavily based upon its predecessor in the form of the RV770 GPU. Not that that's a bad thing by any stretch of the imagination.
If you're looking for more information on DirectX 11 and what new functionality it brings to the table specifically, then you can read our own DirectX 11 overview right here.
Starting out with the basics, the Radeon HD 5800 series boasts a second-generation 40 nanometre GPU from AMD (with the Radeon HD 4770 their first 40 nanometre part) featuring a die size of 334 mm² against the 263 mm², 55 nanometre die used by RV770, a 1.27x size increase. This pales into comparison to the increase in transistor count however, with the Radeon HD 4870's 956 million transistors dwarfed by the 2.15 billion transistors employed by the Radeon HD 5800 series.
Graphics engine

We start our more in-depth discussion, as you might expect, at the beginning, with rendering data passed through by the architecture's command processor, where it hits the dual rasterisers employed by the Radeon HD 5800 series (compared to a single rasteriser in previous architectures) to increase the amount of data that can be fed to the rest of the GPU.
This portion of the core also features the tessellation unit which has finally become a required part of the GPU as of DirectX 11 - This is now what AMD are referring to as their sixth-generation unit from the technology that has features on both previous AMD graphics boards as well as Microsoft's Xbox 360 GPU. Most of the major changes to this particular tessellation unit relate to DirectX 11, with the introduction of capabilities to process hull shaders (which effectively calculate the level of tessellation required before passing it to the tessellator itself) and domain shaders (which output the final tessellated data as vertices to be handled as per normal by the rest of the rendering process). On top of this, the Radeon HD 5800 series tessellator features a new algorithm which has been designed to reduce any artifacts which can otherwise appear during the tessellation process.
The geometry shader featured as part of this engine also gets some incremental improvements to improve performance, while OpenGL rendering gets some specific enhancements to improve line rendering performance and clipping speeds alongside the introduction of 12-bit sub-pixel precision.
Finally for this section of the GPU core, and again in keeping with AMD's focus on providing full DirectX 11 support, this new architecture also supports "pull model interpolation", which allows the graphics board to use its Stream Processors for interpolation commands via some new instructions which in turn decreases the amount of fixed-function hardware required by this part of the pipeline with little in the way of a performance hit.
Stream Processors

Now we get to the real meat of AMD's new architecture - The Stream Processors (or Thread Processors as AMD tend to call them). The most important number here for most of you is that a full Radeon HD 5800 series core features double the number of Stream Processors as RV770, from 800 processors on that architecture up to 1600 Stream Processors here. This change also sees the number of SIMD units employed by the graphics core doubled, from ten SIMD units holding eighty Stream Processors in RV770 to twenty SIMDs which again house eighty Stream Processors each here.
The general layout of this superscalar architecture is the same as AMD's previous generation, with the architecture further broken down into groups of cores, each of which houses five Stream Processors. Four of these processors are typical units which can handle a single 32-bit floating point instruction per clock cycle, while the fifth Stream Processor in each core is capable of handling special functions (SIN, COS, EXP for example) as well as a 32-bit floating point MAD (MULTIPLY ADD command) per clock. Each group of five Stream Processors is also equipped with its own designated branching unit, as well as general purpose registers to help kept them fed with all of the data they need, while the GPU's Ultra Threaded Despatch processor handles what work is passed to each SIMD before each cores own logic schedules the work for its own particular sub-set of Stream Processors.
As well as simply increasing the sheer number of Stream Processors available to this new GPU, a number of other improvements have also been made, from the necessary inclusion of DirectX 11's bit-level operations (such as performing bit counts, inserts, extracts and the like) through to more performance-centric decisions aimed at increasing the architecture's IPC (Instructions Per Clock) count such as the ability to perform a co-issue MUL and dependant ADD in a single clock, and fused multiply-add capabilities. Another addition along those lines is the inclusion of a SAD (Sum of Absolute Differences) instruction, which when leveraged has the potential to offer up to 12x speed improvements, largely for GPGPU purposes such as video encoding. Although this particular instruction isn't exposed by DirectX, an OpenCL extension does allow for it to be used, so it'll be interesting to see if and how it's taken up by developers once we start seeing the roll out of OpenCL coded applications.
All of this equates to a vast amount of processing power altogether, giving a full Radeon HD 5870 configuration 2.7 TeraFLOPS to play with (against 1.2 TeraFLOPs for the Radeon HD 4870) in normal single precision, and 544 GigaFLOPS when handling 64-bit double precision data.
|