In this paper, we empirically evaluate fundamental design
trade-offs among the most recent multicore processors
and accelerator technologies. Our primary aim is to aid
application designers in better mapping their software to
the most suitable architecture, with an additional goal of
influencing future computing system design. We specifically
examine five architectures, based on: the Intel quadcore
Harpertown processor, the AMD quad-core Barcelona
processor, the Sony-Toshiba-IBM Cell Broadband Engine
processors (both the first-generation chip and the secondgeneration
PowerXCell 8i), and the NVIDIA Tesla C1060
GPU. We illustrate the software implementation process on
each platform for a set of widely-used kernels from computational
statistics that are simple to reason about; measure
and analyze the performance of each implementation; and
discuss the impact of different architectural design choices
on each implementation.