



Spring 2009 Prof. Hyesoon Kim





The ARM9 Family-High Performance Microprocessors for Embedded Applications







# **Nintendo DS/DSi/ DSlite**

























### **Hardware Picture**



microcontroller were covered by a metal shielding plate.







# **Hardware Specifications**

- Dual TFT LCD screens
- CPUs
  - ARM 7 TDMI (33MHz)
  - ARM 9 946E-S (67MHz)
- Main memory: 4MB RAM
  - VRAM: 656 KiB
- 2D graphics
  - Up to 4 backgrounds
- 3D graphics









### **ARM7/ARM9**

- Both can be running code at the same time.
- ARM 7 is the only CPU that controls the touch screen.
  - Interrupt based









# **Brief History of ARM**

- ARM is short for Advanced Risc Machines Ltd.
  - Founded 1990, owned by Acorn, Apple and VLSI
- Known before becoming ARM as computer manufacturer
- ARM is one of the most licensed company
- Used especially in portable devices due to low power consumption and reasonable performance (MIPS/watt)







### **ARM7 and ARM7 TDMI**

- ARM7: 3 stage pipeline, 16 32-bit Registers, 32-bit instruction set
- TMDI
  - Thumb instruction set
  - Debug-interface
  - Multiplier (hardware)
  - Interrupt (fast interrupt)
  - The most commonly used one









## **ARM7 TDMI**

- 32/16-bit RISC
- 32-bit ARM instruction set
- 16-bit Thumb instruction set
- 3-stage pipeline
- Very small die size and low power
- Unified bus interface
   (32-bit data bus carries both instruction, data)









## **Thumb Instruction Decode**



The ARM9 Family-High Performance Microprocessors for the bedded Applications







### **Thumb Instruction**



- Instruction compression to save I-cache/memory accesses
- Use only top 8 registers,
- 3 operands → 2 operands









### Thumb...

- Instructions are compiled either native ARM code or Thumb code
  - To utilize full 16bit opcode
  - Use current processor status register (CPSR) to set thumb/native instruction









## **ARM Instruction Set**

- All instructions are conditional
- BX, branch and eXhange → branch and exchange (Thumb)
- Link register (subroutine Link register)
  - R14 receives the return address when a Branch with Link (BL or BLX) instruction is executed









## ARM9

- 5-stage pipeline
- I-cache and D-cache
- Floating point support with the optional VFP9-S coprocessor
- Enhanced 16 x 32-bit multiplier capable of single cycle MAC operations
- The ARM946E-S
   processor supports
   ARM's real-time trace
   technology



Georgia

College of

Tech | Computing







## ARM946E-S

- Embedded Core with Flexible Cached Memory System & DSP Instruction Set Extensions
- Memory Protection Unit (MPU) supporting all major RTOS: Vxworks, pSOS
- Flexible instruction and data cache sizes
- 180nm → 90nm
- Imaging products
  - -Printers, digital cameras
- Networking systems
- Automotive control









## ARM9

- ARM7 3stage->ARM9 5 stage
  - Increase clock frequency

#### ARM7TDMI Pipeline Operation

| Fetch             | Decode                  |                                           | Execute                                      |  |
|-------------------|-------------------------|-------------------------------------------|----------------------------------------------|--|
| Instruction Fetch | Convert Thumb<br>to ARM | Main Decode<br>Register Address<br>Decode | Register Read<br>Shifter<br>ALU<br>Writeback |  |

#### ARM9TDMI Pipeline Operation

| Fetch             | Decode                                                                                       | Execute       | Memory             | Writeback                                        |
|-------------------|----------------------------------------------------------------------------------------------|---------------|--------------------|--------------------------------------------------|
| Instruction Fetch | ARM Decode Reg. Address Register Decode Read  Thumb Decode Reg. Address Register Decode Read | . Shifter ALU | Memory Data access | ALU Result<br>and / or<br>Load data<br>Writeback |







# **ARM9** Pipeline

- ARM7: Thumb instruction decode: first ½ phase of decode stage
- ARM9: Parallel decoding
- ARM7: ALU (arithmetic, and logic units) is active all the time
- ARM9: Two units are partitioned to save power
- ARM9: Forwarding path







### **ARM9 TDMI Databath**



- 3 register read ports and two write ports
  - 2 read: execution unit 1 read: store data during execution stage (no latch)

The ARM9 Family-High Performance Microprocessors for **Embedded Applications** 

# ARM946E-S











# **AHB (Advanced High-performance Bus)**

- single clock edge operation (rising edge)
- unidirectional (nontristate) buses
- burst transfers
- split transactions:
  - Request (address) and Reply (data)



Request lines

Response lines

single-cycle bus master handover





- ETM (Embedded Trace Macrocells)
- Help Debugging and trace facilities
- Capture both before and after a specific event of processor's state
- Can be configured by software
- Help code development



Realview trace









# ARM10E

- A Jazelle Technology enhanced 32-bit RISC
  - Jazelle Technology: execute java byte code in hardware
- Support 64bit data
- 16-bit fixed point DSP instructions to enhance performance of many signal processing algorithms











### **ARM11 MP Core**

- Multicore
- A fully coherent data cache
- High memory bandwidth (1.3GB/s)









# **ARM Architecture Feature Comparisons**

| Feature                        | ARM9E™           | ARM10E™          | Intel®<br>XScale™ | ARM11 <sup>™</sup>           |
|--------------------------------|------------------|------------------|-------------------|------------------------------|
| Architecture                   | ARMv5TE(J)       | ARMv5TE(J)       | ARMv5TE           | ARMv6                        |
| Pipeline Length                | 5                | 6                | 7                 | 8                            |
| Java Decode                    | (ARM926EJ)       | (ARM1026EJ)      | No                | Yes                          |
| V6 SIMD Instructions           | No               | No               | No                | Yes                          |
| MIA Instructions               | No               | No               | Yes               | Available as coprocessor     |
| Branch Prediction              | No               | Static           | Dynamic           | Dynamic                      |
| Independent<br>Load-Store Unit | No               | Yes              | Yes               | Yes                          |
| Instruction Issue              | Scalar, in-order | Scalar, in-order | Scalar, in-order  | Scalar, in-order             |
| Concurrency                    | None             | ALU/MAC,<br>LSU  | ALU, MAC,<br>LSU  | ALU/MAC,<br>LSU              |
| Out-of-order completion        | No               | Yes              | Yes               | Yes                          |
| Target<br>Implementation       | Synthesizable    | Synthesizable    | Custom chip       | Synthesizable and Hard macro |
| Performance Range              | Up to 250MHz     | Up to 325MHz     | 200MHz –<br>>1GHz | 350MHz -<br>>1GHz            |







# **Nintendo Wii**















Table 2-1. PowerPC 750CL Microprocessor Block Diagram









## PowerPC 750CL

- Fetch: four instructions per clock
  - Process one branch per cycle
  - 512-entry branch history table
- Dispatch:
  - 2 instructions
- Load/store unit
  - Store gathering
  - Single cycle load/store
  - 32KB L1 cache 256KB L2 cache









# **Mobile 3D Graphics Benchmarks**











### **Announcements**

- Design review:
  - Brining all our designs
  - Ready to explain your design
- Friday (Report Due)
- Final exam: comprehensive (Wed 8:00 AM)
- Course-Instructor Opinion Survey(CIOS)

