OpenPA.net

PA-RISC CPU Architecture

Overview

More in-depth explanations of some technologies and features present in PA-RISC processors.

Floating Point Unit (FPU)

The Floating Point Unit is an assist processor logically added to a system to improve the performance on floating-point operations. The processor can be on a seperate chip (e.g. PA-7000) or integrated onto the central CPU die (all PA-RISC CPUs upwards). The FPU executes special floating point instruction to perform arithmetic on its own set of independent registers (register file) and to move data between its own registers and the system’s lower memory hierarchy. The FPU execution stage is pipelined. All PA-RISC FPUs contain thirty-two 64-bit registers, which can also be used as sixty-four 32-bit registers and sixteen 128-bit registers.

↑ up

Memory and I/O Controller (MIOC)

The Memory and I/O Controller in the PA-7100LC processor and PA-7300LC processor is the integration of both DRAM/cache controller and I/O controller on the processor die. It is very similar on both CPUs, not much has been changed in the transition from 7100LC to 7300LC.

The integrated memory controller requires only buffers and DRAM modules to build up the complete memory subsystem. A wide range of programmable options is implemented:

On the PA-7300LC the memory controller also includes the SLC (Second Level Cache Controller). It provides an optional L2 cache, ranging from 32KB to 8MB (on almost all systems with a PA-7300LC, 2MB of L2 was used). It shares the data bus with the DRAM subsystem, so it has the same width (64/128-bit) and same optional SEDC error control.

↑ up

Transition Lookaside Buffer (TLB)

The Translation Lookaside Buffer is a hardware structure doing virtual-to-physical memory address translations. The TLB takes virtual page numbers and returns the corresponding physical page number. The PA-7000 is the last PA-RISC processor to use seperate I/D TLBs, all later PA 1.1 and 2.0 CPUs use a combined TLB structure.

Hitachi’s PA-RISC 1.1 derivates also used split TLBs:

Most interestingly, the older PA-RISC 1.0 processors (pre-PA-7000) have huge TLBs (even for today” standards):

The TLB memory on these earlier CPUs was implemented mostly off-chip/off-die via separate memory (SRAM) chips.

Translation process

TLB miss handling implementations

↑ up

Block Transition Lookaside Buffer (BTLB)

Similar as the TLB, the BTLB provides virtual-to-physical address translations. However the BTLB maps large address ranges rather that single pages as the TLB does. These large address ranges are called block translations and therefore stored in the Block Translation Lookaside Buffer. These block translations are useful for virtual address ranges that do not get paged in or out.

BTLBs were only implemented on 32-bit PA-RISC processors (PA-7x00), the 64-bit versions instead implement variable page sizes, thus any entry can be of >4k mapping.

↑ up

Superscalar execution

A superscalar processor implementation decodes, dispatches and executes multiple instructions per cycle if dependencies between the instructions permit. This is possible if the instruction stream contains independent instructions. Superscalarity can be easily gained from an decoupled floating point unit (FPU) which executes floating point operations (calculations) indepently from the (integer) ALU. More complicated variations allow for parallel load/store operations, integer calculations et al, which need a more complex CPU design that analyzes the instructions/branches.

Every PA-RISC processor from the PA-7100 upwards implements superscalar execution. Instructions proceed together through the execution pipeline which is called instruction bundling. The superscalar execution is functionally transparent to the software, the effects of any given instruction are the same whether it was executed as part of a bundle or alone. Bundling rules are applied at run-time by the hardware; optimal performance may only be gained by proper ordering of the instructions so the processor can use its full superscalar potential.

Several kinds of restrictions are placed upon the instruction bundling:

For bundling purposes, all instruction are divided into classes:

PA-RISC superscalar instruction classes
Class Description
FLOP Floating point operation
LDST Loads and stores
FLEX Integer ALU
MM Shifts, extracts, deposits
NUL Might nullify successor
BV BV, BE
BR Other branches
FSYS FTEST and FP status/exception
SYS System control instructions

PA-7100LC/PA-7300LC superscalar capabilities

These are 2-way superscalar processor implementations with two integer ALUs and one FPU. Notably only one of the two ALUs is capable to handle loads, stores and shifts.

Allowed bundles

PA-7100LC/PA-7300LC allowed instruction bundles
First (older)
instruction
Second (younger) instruction
FLOP  + LDST/FLEX/MM/NUL/BV/BR
LDST  + FLOP/FLEX/MM/NUL/BR
FLEX  + FLOP/LDST/FLEX/MM/NUL/BR/FSYS
MM  + FLOP/LDST/FLEX/FSYS
NUL  + FLOP
SYS Never bundled

Besides from these bundles, LDST + LDST bundles are under certain circumstances also possible. These are then called double word load/store.

Data dependencies

Several kinds of instructions cannot be bundled together because of inter-instruction data dependencies:

Control Flow

PA-7200 superscalar capabilities

This is a 2-way superscalar processor implementation. It has two integer ALUs and one FPU. Similar to the PA-7100LC, shift-merge and test condition units are not duplicated in the second ALU. To support the superscalar capabilities one additional write port and two additional read ports were added to the general registers (GR*).

Allowed bundles

PA-7100LC/PA-7300LC allowed instruction bundles
First (older)
instruction
second (younger) instruction
FLOP  + LDST/FLEX/MM/NUL/BV/BR
LDST  + FLOP/FLEX/MM/NUL/BR
FLEX  + FLOP/LDST/FLEX/MM/NUL/BR/FSYS
MM  + FLOP/LDST/FLEX/FSYS
NUL  + FLOP

↑ up

Multimedia Acceleration eXtensions (MAX-1 and MAX-2)

MAX-1 (32-bit)

The original multimedia extensions were proposed for and later introduced in the PA-7100LC processor. The aim was to enable workstations with this CPU to provide real-time MPEG video decompression and playback at a rate of 30 frames/second without the need for a special DSP (digital signal processing) chip.

The design process for the PA-7100LC processor (in the early mid-1990s) included for the first time multimedia benchmarks while analyzing optimizations for the instruction set design.

The actual implementation was achieved via the introduction of a very small set of SIMD-MIMD1 instructions to faciliate the application of a small set of instructions on bundled subword data. Since these instructions use the same data paths and execution units within the processor as the normal instructions the term intrinsic signal processing (ISP) was coined. By sticking to conventional RISC principles the design team decided against adding complex special-purpose instructions and opted for small, elegant use of the existing processing facilities, which just were modified to understand the new, packed subword data.

In 1994, the extensions made their way to be included in the final PA-7100LC product and as such were the first SIMD1 instructions found in a general microprocessor. Less than 0.2 percent of the silicon area had to be used for these additions and modifications, while allowing a very significant performance boost in affected applications (for example, the then-highend 735/99 workstation running at 99 MHz with 512KB cache achieves 18.7 fps at MPEG decompression benchmarks, while the new, lower clocked 712 workstation at 60MHz and with 64KB cache achieved 26 fps). New MAX-1 multimedia instructions include: parallel add, parallel subtract, parallel shift left & add (i.e. multiply with integer), parallel shift right & add (i.e. division), parallel average.

  1. Single Instruction Multiple Data, Multiple Instruction Multiple Data (MIMD), see for example the SIMD Wikipedia article and MIMD Wikipedia article

MAX-2 (64-bit)

With the introduction of the new 64-bit PA-RISC 2.0 architecture in 1996 HP unveiled a new set of multimedia-oriented instructions aimed at using the processor’s resources more effectively for sub-word data. The basic components of the contemporary multimedia data were often represented as 8, 12 or 16-bit integers, for example audio sampling and pixel color depth. Doing arithmetic with data of this length would waste an considerable amount of the processor’s execution capacities, a simple addition of 16-bit data would only use one quarter of the 64-bit wide integer units datapath. To remedy this situation, MAX allows for packing of these subword data into larger words near the processor’s natural word width (64-bit on PA-RISC 2.0 processors) and using parallel instructions on them. An example would be four 16-bit additions by the 64-bit adder on four 16-bit packed subwords.

The basic functionality from the earlier 32-bit MAX-1 was taken over and four more instructions added for MAX-2. Additionally, due to the wider integer registers (now 64-bit) more subwords can be packed in one cycle, doubling the effective speed of these multimedia instructions. The MAX-2 multimedia instructions include (new in MAX-2 are in bold): parallel add, parallel subtract, parallel shift left & add (i.e. multiply with integer), parallel shift right & add (i.e. division), parallel average, parallel shift right, parallel shift left, mix and permute.

MAX-2 debuted 1996 in real silicion on the PA-8000 processor and later featured on all subsequent PA-RISC 2.0 processors (PA-8x00). In contrast to contemporary multimedia extensions, MAX-2 required only very little die space (0.1 percent on the PA-8000).

References

Accelerating Multimedia with Enhanced Microprocessor (PDF, 2.4MB)
Discussion of the MAX-1 instructions. Ruby Lee, April 1995, IEEE Micro, Volume 15 Number 2.
64-bit and Multimedia Extensions in the PA-RISC 2.0 Architecture (PDF, 66KB)
New features of the 64-bit PA-RISC 2.0 architecture and overview on the MAX introduced with it. Ruby Lee and Jerry Huck, 1996, Hewlett-Packard Company.
Subword Parallelism with MAX-2 (PDF, 1.5MB)
Discussion of the MAX-2 instructions. Ruby Lee, August 1996, IEEE Micro, Volume 16 Number 4.

↑ up