CPU Evolution - VLIW CPUs

What is real advantage of VLIW CPUs?

The performance of CPUs themselves can be increased two ways. First - increase clock speed. However, silicon technologies have certain limits. To increase clock speed CPU must be manufactured using smaller transistor's size. For this moment (Jan 2000) it is 0.18 mkm. The clock speed limit for 0.18 mkm technology is somewhat around 800 MHz. Although 2000 MHz is not so far away (year 2002 - 2003), overclocking CPU by factor of two does not mean 100% increase of performance, mostly because of slow memory bus (800 MHz RUMBUS RAM is not panacea - RUMBUS modules are 16-bit only, unlike 64-bit PC100/PC133 SDRAM DIMMs). So there is second and no less tricky way - increasing number of instructions that could be executed by CPU in single clock cycle. First of all, it requires additional functional units, what adds large number of transistors and surely do not help to gain top MHz. Many functional units do not guarantee large number of simultaneously executed instructions.

Imagine sequence of 8 CPU instructions (A, B, C, D, E, F, G, H) and CPU capable to execute 4 instructions per single clock cycle (what most modern RISC CPUs theoretically can do). CPU cannot execute instruction B if it depends on result of A. But it can execute instruction C if it is not dependent of A and B. This is called out of order execution. Furthermore, often CPU cannot sequentially execute instructions because of branches (usually every 6 instructions), which override sequential instruction execution. Let's assume that depending on result of C either D either E shall be executed (simplest example - if x is greater than y, z = 1, otherwise z = 2). In this case CPU will try to guess result of C and will execute D or E. This is called branch prediction. So, ideally our virtual CPU will execute 8-instruction sequence in 2 clock cycles: 1 cycle - A+C+D+F, 2 cycle - B+G+H, E is not used at this time. In real life this most likely will not happen. Why?

Branch prediction may be wrong. Thus, CPU have to flush pipelines, load E and execute it.
Certain instructions cannot be executed in parallel even if they are not interdependent. Most RISC CPUs, for example, may execute only 2 integer operations at once.
In worst case, instructions or data have to be loaded from RAM what could take up to 100 clock cycles.

All modern CPUs have very complicated units for 1) rearranging instructions at run time for most effective out of order execution and 2) branch prediction.

So, how VLIW CPUs overcome those limitations? Pretty smart. First of all, VLIW compiler packs several RISC instructions, which can be executed in parallel, into one long instruction. Second, VLIW CPU eliminates branch prediction errors by means of executing all branch outcomes. After evaluating branch, it throws away needless results. VLIW CPU have more functional units and registers than CISC or RISC CPU but do not needs instruction reordering and branch prediction units. Third, VLIW CPU has feature as speculative loading. Compiler inserts speculative load check instructions to avoid idle CPU cycles while data loading from slow memory. In theory, VLIW CPUs have less transistor count and are better scalable rather RISC CPU.

Let's draw an analogy with mechanical process.

Imagine robots which take boxes from the conveyor and put them into containers. CISC robot have to deal with boxes of different shapes and sizes. In opposite, RISC robot is being supplied only with boxes of the same shape and size. CISC robot with RISC core has special device which split boxes apiece, which in turn, must be the same size and shape. VLIW robot receive small boxes already glued together to fit container. Now think, which robot will have top performance and less complex mechanic (taking into account that all containers have the same capacity)?

Does RISC really fall behind?

In future, may be, but not right now.

Let's face reality: Alpha RISC CPU is still the fastest one-chip CPU in the world.
With the estimated price tag of $5,000.00 Intel/HP Itanium is not for ordinary PC.
Itanium will run x86 software in emulation mode (projected performance running x86 software will be the same as Pentium III/500). Native EPIC software will be scarce at first time.
Native Itanium software requires very complex compilers. Complex software usually means more hard-to-find bugs.
Advanced RISC CPUs like IBM Power3 (but not PowerPC, unfortunately) have feature similar to speculative loading( prefetch instructions).
RISC CPUs have some hidden potential - there are some optimization techniques which aid CPU. First is called instruction scheduling. Compiler (special program which translates source codes written in certain programming language to CPU instructions) can arrange instructions for better parallel execution. As the side effect, program become somewhat CPU dependent. Second called dynamic optimization based on profiling . Special programming tool analyzes program at run time, traps and record branches into log file. Later this log file is used to generate hints which explicitly point out CPU which branch outcome most likely to occur. The result is very dependent on data used during profiling. Effect of instruction scheduling and dynamic optimization is tested using BYTEmark test suite, which is available in source codes (please refer to my optimization page for more info).
Not all tasks really need any performance above of those old 200 MHz CPUs. Word processor and e-mail client sure will be usable without 1500 MHz speed demon in foreseeable future. Only software which perform heavy calculations like scientific, CAD, graphic, multimedia, video, sound programs cry for speed. This is why Motorola built Velocity Engine optimized especially for this kind of tasks into PowerPC G4. Real-world tests show that Power Macintosh G4/450 MHz is still faster than Windows PC running on AMD Athlon/700 MHz in most Photoshop tasks.

Below you can find SPEC performance numbers for some currently shipping (January 2001) and sampled CPUs (higher scores mean better performance). For sampled CPU (Itanium, SUN Sparc Ultra III, Alpha 21364) SPEC numbers are estimated.

How to get the most of current RISC? Are benchmark scores really represent real world performance? Let's review some optimization & benchmarking tricks. Continued...