CS8803J - High Performance Communication (Spring 2002)
February 11, 2002
TOPIC: Project 1
Most submissions were not finished => the project was too hard somehow.
Why?
- Poor Intel documentation
- Trial and error in an unfamiliar language
TOPIC: Project 2
Given that, when should Project 2 be due?
- Week of Feb 24th - March 2nd.
TOPIC: Ken went to HPCA-8
More importantly, Ken went to the Network Processor Workshop at HPCA.
Major sources of discussion at the workshop:
- Network Processors as line cards?
- Multithreaded versus synchronous execution?
- The need for benchmarks
- Problem of programmability (see Project 1)
Dr. Bill Dally gave a keynote based in part on his discussions with
Avici:

Interesting points about Avici:
- Output Buffer: With 10Gbps channels, output buffer needs to be
approximately 400MB to store 300mS in case of transient bad routing
information.
- Input Buffer: Designed to buffer one packet from each port.
With 500 ports, and 64 channels per port, and 64KB packets, the input buffer
should be approximately 2GB in the worst case. Actual implementations
likely have 0.5 to 1.5GB total.
- 3-D torus interconnect greatly resembles that of the Cray T3D and T3E.
- Designed to be implemented on a line card. Linecards are
limited to about 200W based on industry limits of about 8KW per rack.
- 10Gbps line cards today cost around $200,000.
Tangent about the Cisco "Toaster" Network Processor:":

- 16 processors used as 4 pipelines of 4 stages each.
- Draws about 15W.
- Uses the synchronized execution model as opposed to the multithreaded one.
- 16 network processors working with packets offset by one cycle for each
row.
- No multithreading, so going to SRAM means all 4 pipelines block.
- Register data is passed along, left to right, in the 4-stage pipeline
along with the packet.
The idea of passing the register data along is similar to the idea in the
next Intel IXP. That design will not only have register banks for each
microengine, but also shared registers that sit between adjacent microengines.
This allows programmers to form pipelines out of the microengines. Ken
speculates the next IXP should run well above 1Ghz.
TOPIC: Click
Click is a language to describe network processing in an object oriented way.
It is a box-and-pointer language (the terms elements and connections
are used instead, though). Under the hood is simple C++. This is a
frequently referenced paper among NP researchers.

Interesting design decisions:
- No buffering between stages. The connections are really just
procedure calls, so data is passed instantaneously. You can, however,
add a "queue" element for manual buffering.
- Graph is static and can only be changed by reloading the graph text file.
- Both a push and pull model are implemented. Most quintessential
example is interaction with the queue. Other elements can push items
into a queue and others can pull them out. Though this sounds like basic
queue operation, many languages would only implement a push or a pull model,
but not both.
- Wakeup scheme: Click is run with various threads that check each
'awake' element to see if it has work to do. Elements are awoken when
they are a possible "next" stage after another element begins execution.
They go to sleep after they finish execution. Thus even with thousands
of elements, the scheduling thread does not need to poll elements that have no
work to do.
- Only packets are passed between elements, so any inter-element data is
appended to the packet.
- Global data and variables are highly discouraged, though (as I understand
it) they are allowed.
Results:
- Runs as a kernel process under x86 Linux.
- 90% as fast as the standard Linux routing process using 2 100BT line
cards.
- The Click process doesn't fit in a 32KB I-Cache like the Linux process
does. Ken believes the 10% difference is accounted almost solely by the
added I-Cache misses.
Is this useful?
- Well, just like with the IXP ACE system, we still have no notion of
software pipelining.
- This does help the issue of programmability without significant
performance hit.
- Not limited to x86 Linux PCs. Implementable with any NP.
Author: Peter Sassone