



Fall 2011 Prof. Hyesoon Kim



#### Thanks to Prof. Loh & Prof. Prvulovic



# PARALLEL PROGRAMMING & HARDWARE IMPLEMENTATION

Georgia College of Tech Computing



## Lock Example (Review)







Computing

## **Synchronization**

- Shared counter/sum update example
  - Use a mutex variable for mutual exclusion
  - Only one processor can own the mutex
    - Many processors may call lock(), but only one will succeed (others block)
    - The winner updates the shared sum, then calls unlock() to release the mutex
    - Now one of the others gets it, etc.
  - But how do we implement a mutex?
    - As a shared variable (1 owned, 0 free)



Computing

#### Locking

- Releasing a mutex is easy
  Just set it to 0
- Acquiring a mutex is not so easy
  - Easy to spin waiting for it to become 0
  - But when it does, others will see it, too
  - Need a way to *atomically* see that the mutex is 0 *and* set it to 1

# **Atomic Read-Update Instructions**

- Atomic exchange instruction
  - E.g., EXCH R1,78(R2) will swap content of register R1 and mem location at address 78+R2
  - To acquire a mutex, 1 in R1 and EXCH
    - Then look at R1 and see whether mutex acquired
    - If R1 is 1, mutex was owned by somebody else and we will need to try again later
    - If R1 is 0, mutex was free and we set it to 1, which means we have acquired the mutex
- Other atomic read-and-update instructions
  - E.g., Test-and-Set





# LL & SC Instructions

- Atomic instructions OK, but specialized - E.g., SWAP can not atomically inc a counter
- Idea: provide a pair of linked instructions
- A load-linked (LL) instruction
  - Like a normal load, but also remembers the value in a special "link" register
- A store-conditional (SC) instruction
  - Like a normal store, but fails if its value is not the same as that in the link register
  - Returns 1 if successful, 0 on failure
- Writes by other processors snooped
  - If value matches link register, clear link register





## Using LL & SC

#### Swap R4 w/ 0(R1)

#### Atomic Exchange

| swap: | mov  | R3, R4   |
|-------|------|----------|
|       | 11   | R2,0(R1) |
|       | SC   | R3,0(R1) |
|       | beqz | R3,swap  |
|       | mov  | R4,R2    |





## **Simple Swap's problem**

• MSI protocol :

Acquire a lock

- P1 SWAP "1" mem[A]
- -P2 SWAP "1" mem[A]
- P3 SWAP "1" mem[A]
- -P2 SWAP "1" mem[A]
- P3 SWAP "1" mem[A]
- P2 SWAP "1" mem[A]

So many invalidations!!



# **Implementing Locks**

- A simple swap (or test-and-set) works
   But causes a lot of invalidations
  - Every write sends an invalidation
  - Most writes redundant (swap 1 with 1)
- More efficient: test-and-swap
  - Read, do swap only if 0
    - Read of 0 does not guarantee success (not atomic)
    - But if 1 we have little chance of success
  - Write only when good chance we will succeed



| try: | mov  | R3,#1    |   | try: | mov  | R3,#1    |
|------|------|----------|---|------|------|----------|
|      | 11   | R2,0(R1) | N |      | 11   | R2,0(R1) |
|      | SC   | R3,0(R1) |   |      | bnez | R2,try   |
|      | bnez | R2,try   |   |      | SC   | R3,0(R1) |
|      | beqz | R3,try   |   |      | beqz | R3,try   |





College of

Computing

# Using LL & SC

Swap R4 w/ 0(R1)

Test if 0(R1) is zero, set to one

#### Atomic Exchange

| swap: | mov  | R3, R4   |
|-------|------|----------|
|       | 11   | R2,0(R1) |
|       | SC   | R3,0(R1) |
|       | beqz | R3,swap  |
|       | mov  | R4,R2    |

#### Atomic Test&Set

| t&s: | mov  | R3,1     |
|------|------|----------|
|      | 11   | R2,0(R1) |
|      | SC   | R3,0(R1) |
|      | bnez | R2,t&s   |
|      | beqz | R3,t&s   |

Georgia Tech

#### Atomic Add to Shared Variable

| upd: | 11   | R2,0(R1) |
|------|------|----------|
|      | add  | R3,R2,R4 |
|      | SC   | R3,0(R1) |
|      | beqz | R3,upd   |



College of Computing

ſech

# Large-Scale Systems: Locks

- Contention even with test-and-test-and-set
  - Every write goes to many, many spinning procs
  - Making everybody test less often reduces contention for high-contention locks but hurts for low-contention locks
  - Solution: exponential back-off
    - If we have waited for a long time, lock is probably highcontention
    - Every time we check and fail, double the time between checks
      - Fast low-contention locks (checks frequent at first)
      - Scalable high-contention locks (checks infrequent in long waits)
  - Special hardware support



Computing

#### **Barrier Synchronization**

- All must arrive before any can leave
  Used between different parallel sections
- Uses two shared variables
  - A counter that counts how many have arrived
    - atomically dec a counter
  - A flag that is set when the last processor arrives



#### **Over using locks**



Acquire lock While (node) { If (node->data > 10) node->data = new\_data; node = node->next



**P1** 

P2

Release lock





#### **Transactional Memory**

- A group of load and store instructions to execute in an atomic way.
- Trying to avoid locks
- Hardware transactional memory
- Begin/End
- Commit or Abort
- Commits: All of the loads and stores appear to have run atomically with respect to other transactions
- How? Keep a record (in a hardware or in a memory)



#### **Review Question**

- MSI protocol : cache block size is 4B
  - -P1 LDB mem[A]
  - -P2 STB mem[A]
  - -P3 LDB mem[A+1]
  - -P1 STB mem[A+2]
  - -P2 STB mem[A]
  - -P3 LDB mem[A+1]



# **MEMORY CONSISTENCY**



College of Computing



## **Memory Consistency**

- Coherence is about order of accesses to the same address
- Consistency is about order of accesses to different addrs

| Program ord                                               | er | Exe                         | cutio | n or                 | der |                             |
|-----------------------------------------------------------|----|-----------------------------|-------|----------------------|-----|-----------------------------|
| <u>Proc 1</u> <u>Proc</u><br>ST 1->D LD F<br>ST 1->F LD D | 2  | <u>Proc</u><br>ST D<br>ST F |       | Prod<br>LD I<br>LD I | C   | Proc 2 can reorder<br>loads |

Georgia

Computing

- Possible outcomes on Proc 2 in program order
   (F,D) can be (0,0), (0,1), (1,1)
- Execution order can also give (1,0)
  - Exposes instruction reordering to programmer
  - We need something that makes sense intuitively

Computing

# **Memory Consistency**

- But first: why would we want to write code like that?
  - If we never actually need consistency, we don't care if it isn't there
- Example: flag synchronization
  - Producer produces data and sets flag
  - Consumer waits for flag to be 1, then reads data

| Source                           | e Code                        | Assembler Code                |                        |   |  |  |
|----------------------------------|-------------------------------|-------------------------------|------------------------|---|--|--|
| <u>Proc 1</u><br>D=Val1;<br>F=1; | <u>Proc 2</u><br>while(F==0); | <u>Proc 1</u><br>ST D<br>ST F | <u>Proc 2</u><br>wait: | - |  |  |

- Now we can have the following situation
  - Proc 2 has cache miss on F, predicts the branch not taken, reads
  - Proc 1 writes D, writes F
  - Proc 2 gets F, checks, sees 1, verifies branch is correctly predicted
    Georgia



# **Sequential Consistency (SC)**

- The result of any execution should be the same as if the accesses executed by each processor were kept in order and the accesses among different processors were arbitrarily interleaved
  - The same interleaved order seen by everybody





# Hardware support for SC

- Simple implementation
  - A processor issues next access only when its previous access is complete
  - Write completion acknowledgement, invalidation acknowledgment
  - Sloooow!
- A better implementation
  - Processor issues accesses as it sees fit, but detects and fixes potential violations of sequential consistency

Georgia

Tech

College of Computing





#### **Relaxed Consistency Models**



# **Relaxed Consistency Models**

- Two kinds of memory accesses
   Data accesses and synchronization accesses
- Synchronization accesses determine order
  - Data accesses ordered by synchronization
  - Any two of accesses to the same variable in two different processes, such that at least one of the accesses is a write, are always ordered by synchronization operations

College of

Computing

Georgia

- Good performance even in simple implementation
  - Sequential consistency for sync accesses
  - Data accesses can be freely reordered (except around sync accesses)



(റ്റത്തി കുറ്റത്തി

Computing

Georgia

# **Relaxed Consistency Models**

- Data Races
  - When data accesses to same var in different procs are not ordered by sync
  - Most programs are data-race-free (so relaxed consistency models work well)
- There are many relaxed models
  - Weak Consistency, Processor Consistency, Release Consistency, Lazy Release Consistency
  - All work just fine for data-race-free programs
  - But when there are data races, more relaxed models  $\Rightarrow$  weirder program behavior



#### **Data Race?**

 Two concurrent accesses to a shared location, at least one of them for writing.
 – BUG!!

