Filter by type:

Sort by year:

NAX: Near-Data Approximate Computing

Amir Yazdanbakhsh, Jake Sacks, Choungki Song, Pejman Lotfi-Kamran, Hadi Esmaeilzadeh, and Nam Sung-Kim
workshop Workshop on Approximate Computing (AC) co-located with ESWEEK'16 | October 6, 2016.

Abstract

This paper aims to devise an in-DRAM acceleration architecture integrated with conventional 2D DRAM for GPU-based computing systems. We utilize the neural transformation, which leverages the approximability of applications to transform complex, hot regions of code into simple operations. This transformation enables us to integrate the rather simple acceleration logic which implements these simple operations while supporting a diverse set of applications. This allows for many accelerators to be tightly and inexpensively integrated within DRAM and exploit its high internal bandwidth. The applications’ tolerance to approximation enables us to further simplify the circuit by utilizing an approximate MAC unit. This allows for even tighter and cheaper integration. Evaluation with a diverse set of applications shows that NAX yields 2.0x higher speed and 3.0× better energy efficiency compared to an accelerated GPU system. These benefits are achieved with a ≈4% area overhead to the GDDR5 chip.

AXBENCH: A Multi-Platform Benchmark Suite for Approximate Computing

Amir Yazdanbakhsh, Divya Mahajan, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh
jorunal Design and Test, special issue on Computing in the Dark Silicon Era, IEEE | May, 2016.

Abstract

As we enter the dark silicon era, the benefits from classical transistor scaling are diminishing. The current paradigm of microprocessor design falls significantly short of the historical cadence of performance improvements. To address these challenges, there is a need to go beyond traditional approaches and explore unconventional paradigms in the computing landscape. One such paradigm is approximate computing that embraces imprecision and relaxes the traditional abstraction of ‘’near-perfect‘’ accuracy across the system stack. Approximate computing promises to deliver significant performance and energy efficiency gains when small losses of quality are permissible. As approximate computing attracts more attention, having a general, diverse, and representative set of benchmarks to evaluate different approximation techniques becomes necessary. In this paper, we introduce AXBENCH, a general, diverse and representative set of benchmarks for CPUs, GPUs, and hardware design. We judiciously select and develop each benchmark to cover a diverse set of domains such as financial analysis, machine vision, medical imaging, machine learning, data analytics, scientific computation, signal processing, image processing, robotics, and compression. Furthermore, to enable a wide range of studies, each benchmark is paired with three different input datasets. AXBENCH also provides necessary annotations to mark the approximable regions of code and the application-specific quality metric to assess the output quality of each application.

Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration

Divya Mahajan, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, and Hadi Esmaeilzadeh
conference The 43rd International Symposium on Computer Architecture (ISCA) | June 18-22, 2016.

Abstract

Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator—improving performance and efficiency–or run on the precise core—maintaining quality. In this paper we introduce MITHRA, a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. MITHRA seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5x speedup and 2.6x energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that MITHRA performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Atieh Lotfi, Abbas Rahimi, Amir Yazdanbakhsh, Hadi Esmaeilzadeh, and Rajesh K. Gupta
conference Design Automation and Test in Europe (DATE) | March, 2016.

Abstract

Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators that are inherently subject to many resource constraints. To better utilize the FPGA resources, we devise, GRATER, an automated design workflow for FPGA accelerators that leverages imprecise computation to increase data-level parallelism and achieve higher computational throughput. The core of our workflow is a source-to-source compiler that takes in an input kernel and applies a novel optimization technique that selectively reduces the precision of kernel’s data and operations. By selectively reducing the precision of the data and operation, the required area to synthesize the kernels on the FPGA decreases allowing to integrate a larger number of operations and parallel kernels in the fixed area of the FPGA. The larger number of integrated kernels provides more hardware context to better exploit data- level parallelism in the target applications. To effectively explore the possible design space of approximate kernels, we exploit a genetic algorithm to find a subset of safe-to-approximate operations and data elements and then tune their precision levels until the desired output quality is achieved. GRATER exploits a fully software technique and does not require any changes to the underlying FPGA hardware. We evaluate GRATER on a diverse set of data-intensive OpenCL benchmarks from the AMD SDK. The synthesis result on a modern Altera FPGA shows that our approximation workflow yields 1.4×–3.0× higher throughput with less than 1% quality loss.

TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon kim, and Hadi Esmaeilzadeh
conference IEEE Sympoisum on High Performance Computer Architecture | March, 2016.

  Distinguished Paper Award

Abstract

A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. TABLA then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use TABLA to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry
conference High Performance Embedded Architectures and Compilers (HiPEAC) | January, 2016.

Abstract

This paper aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP does not check for or recover from load value mispredictions, hence, avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops some fraction of load requests which miss in the cache after predicting their values. Dropping requests reduces memory bandwidth contention by removing them from the system. The drop rate is a knob to control the tradeoff between performance/energy efficiency and output quality. Our extensive evaluations show that RFVP, when used in GPUs, yields significant performance improvement and energy reduction for a wide range of quality loss levels. We also evaluate RFVP’s latency benefits for a single core CPU. The results show performance improvement and energy reduction for a wide variety of applications with less than 1% loss is quality.

Neural Acceleration for GPU Throughput Processors

Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh
conference The 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48), Waikiki, HI, USA | December, 2015.

Abstract

Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application characteristic provides an opportunity to improve GPU performance and efficiency. Among approximation techniques, neural accelerators have been shown to provide significant performance and efficiency gains when augmenting CPU processors. However, the integration of neural accelerators within a GPU processor has remained unexplored. GPUs are, in a sense, many-core accelerators that exploit large degrees of data-level parallelism in the applications through the SIMT execution model. This paper aims to harmoniously bring neural and GPU accelerators together without hindering SIMT execution or adding excessive hardware overhead. We introduce a low overhead neurally accelerated architecture for GPUs, called NGPU, that enables scalable integration of neural accelerators for large number of GPU cores. This work also devises a mechanism that controls the tradeoff between the quality of results and the benefits from neural acceleration. Compared to the baseline GPU architecture, cycle-accurate simulation results for NGPU show a 2.4× average speedup and a 2.8× average energy reduction within 10% quality loss margin across a diverse set of benchmarks. The proposed quality control mechanism retains a 1.9× average speedup and a 2.1× energy reduction while reducing the degradation in the quality of results to 2.5%. These benefits are achieved by less than 1% area overhead.

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry
jorunal Computer Architecture and Code Optimization (TACO), ACM | December, 2015.

Abstract

This paper aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP does not check for or recover from load value mispredictions, hence, avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops some fraction of load requests which miss in the cache after predicting their values. Dropping requests reduces memory bandwidth contention by removing them from the system. The drop rate is a knob to control the tradeoff between performance/energy efficiency and output quality. Our extensive evaluations show that RFVP, when used in GPUs, yields significant performance improvement and energy reduction for a wide range of quality loss levels. We also evaluate RFVP’s latency benefits for a single core CPU. The results show performance improvement and energy reduction for a wide variety of applications with less than 1% loss is quality.

Mitigating the Memory Bottleneck with Approximate Load Value Prediction

Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry
jorunal Design and Test, IEEE | December, 2015.

Abstract

This paper aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth and long access latency. Our approach exploits the inherent error resilience of a wide range of applications through an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP does not check for or recover from load value mispredictions, hence avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates long memory access latencies by enabling the execution to continue without stalling for these accesses. To mitigate the limited off-chip bandwidth, RFVP drops a fraction of load requests which miss in the cache after predicting their values. The drop rate then becomes a knob to control the tradeoff between performance/energy efficiency and output quality. Our extensive evaluations show that RFVP, when used in GPUs, yields significant performance improvements and energy reductions for a wide range of quality loss levels.

TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon kim, and Hadi Esmaeilzadeh
technical report SCS Technical Report | GT-CS-15-07 | Georgia Institute of Technology | September, 2015.

Abstract

A growing number of commercial and enterprise systems increasingly rely on compute-intensive machine learning algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. To accommodate the needs of machine learning algorithms, Field Programmable Gate Arrays (FPGAs) provide a promising path forward and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long design cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for machine learning algorithms, we develop TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as stochastic optimization problems. Therefore, a learning task becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function. The gradient solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework for accelerating this class of learning algorithms. With TABLA, the developer uses a high-level language to only specify the learning model as the gradient of the objective function. TABLA then automatically generates the synthesizable implementation of the accelerator for FPGA realization. We use TABLA to generate accelerators for ten different learning task that are implemented on a Xilinx Zynq FPGA platform. We rigorously compare the benefits of the FPGA acceleration to both multicore CPUs (ARM Cortex A15 and Xeon E3) and to many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 15.0× and 2.9× average speedup over the ARM and the Xeon processors, respectively. These accelerator provide 22.7×, 53.7×, and 30.6× higher performance-per-Watt compare to Tegra, GTX 650, and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

Abstract

This proposal aims to develop an end-to-end solution—from circuit level through programming model— that enables integration of various analog neuromorphic computing models within the conventional digital von Neumann framework with no disruptive changes to traditional programming languages.

Briding Analog Neuromorphic and Digital von Neumann Computing

Amir Yazdanbakhsh and Bradley Thwaites
talk Qualcomm®Innovation Fellowship Winners Day | September, 2015.

Approximate Computing in Memory Subsystem

Amir Yazdanbakhsh, David G. Stork, and Craig Hampel
talk Intern Review Final Presentations at Rambus ® | August, 2015.

Axilog: Abstractions for Approximate Hardware Design and Reuse

Divya Mahajan, Kartik Ramkrishnan, Rudra Jariwala, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Anandhavel Nagendrakumar, Abbas Rahimi, Hadi Esmaeilzadeh, and Kia Bazargan
journal IEEE Micro, special issue on Alternative Computing Designs and Technologies | May, 2015.

Abstract

Relaxing the traditional abstraction of “near-perfect” accuracy in hardware design can yield significant gains in efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions and synthesis tools that can systematically incorporate approximation in hardware design. We define Axilog, a set of language extensions for Verilog, that provides the necessary syntax and semantics for approximate hardware design and reuse. Axilog enables designers to safely relax the accuracy requirements in the design, while keeping the critical parts strictly precise. Axilog is coupled with a Safety Inference Analysis that automatically infers the safe-to- approximate gates and connections from the annotations. The analysis provides formal guarantees that the safe-to-approximate parts of the design are in strict accordance to the designer’s intentions. We devise two synthesis flows that leverage Axilog’s framework for safe approximation; one by relaxing the timing requirements and the other through gate resizing. We evaluate Axilog using a diverse set of benchmarks that gain 1.54× average energy savings and 1.82× average area reduction with 10% output quality loss. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code.

Prediction-Based Quality Control for Approximate Accelerators

Divya Mahajan, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, and Hadi Esmaeilzadeh
workshop Workshop on Approximate Computing Across the System Stack (WACAS) co-located with ASPLOS 2015 | March, 2015.

Abstract

Approximate accelerators are an emerging type of accelerator that trade output quality for significant gains in performance and energy efficiency. Conventionally, the approximate accelerator is always invoked in lieu of a frequently executed region of code (e.g., a function in a loop). However, always invoking the accelerator results in a fixed degree of error that may not be desirable. Our core idea is to predict whether each individual accelerator invocation will lead to an undesirable quality loss in the final output. We therefore design and evaluate predictors that only leverage information local to that specific potential invocation. If the predictor speculates that a large quality degradation is likely, it directs the core to run the original precise code instead. We use neural networks as an alternative prediction mechanism for quality control that also provides a realistic reference point to evaluate the effectiveness of our table-based predictor. Our evaluation comprises a set of benchmarks with diverse error behavior. For these benchmarks a table-based predictor with eight tables each of size 0.5KB achieves 2.6× average speedup and 2.8× average energy reduction with a 5% error requirement. The neural predictor yields 4% and 17% larger performance and energy gains, respectively. On average, an idealized oracle predictor with prior knowledge about all invocations achieves only 26% more performance and 37% more energy benefits compared to the table-based predictor.

Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors

Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Nam Sung Kim, and Mikko Lipasti
conference Great Lakes Symposium on VLSI (GLSVLSI) | May, 2015.

Abstract

This work presents a framework which detects online and at operand level of granularity all the vectors which excite already-diagnosed failures in combinational modules. These vectors may be due to various types of failure which may even change over time. Our framework is flexible with the ability to update vectors in the future. Moreover, the ability to detect failures at operand level of granularity can be useful to improve yield, for example by not discarding those chips containing failing and redundant computational units (e.g., two failing ALUs) as long as they are not failing at the same time. The main challenge in realization of such a framework is the ability for on-chip storage of all the (test) cubes which excite the set of diagnosed failures, e.g., all vectors that excite one or more slow paths or defective gates. The number of such test cubes can be enormous after applying various minimization techniques, thereby making it impossible for on-chip storage and online detection. A major contribution of this work is to significantly minimize the number of stored test cubes by inserting only a few but carefully- selected “false alarm” vectors. As a result, a computational unit may be misdiagnosed as failing for a given operand however we show such cases are rare while the chip can safely be continued to be used, i.e., our approach ensures that none of the true-positive failures are missed.

Neural Acceleration for GPU Throughput Processors

Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh
technical report SCS Technical Report | GT-CS-15-05 | Georgia Institute of Technology | September, 2015.

Abstract

General-purpose computing on graphics processing units (GPGPU) accelerates the execution of diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application characteristic provides an opportunity to improve the performance and efficiency of GPGPU. Recent work has shown significant gains with neural approximate acceleration for CPU workloads. This work studies the effectiveness of neural approximate acceleration for GPU workloads. As applying CPU neural accelerators to GPUs leads to high area overhead, we define a low overhead neurally accelerated architecture for GPGPUs that enables scalable integration of neural acceleration on the large number of GPU cores. We also devise a mechanism that controls the tradeoff between the quality of results and the benefits from neural acceleration. We evaluate this design on a mod- ern GPU architecture using a diverse set of benchmarks. Compared to the baseline GPGPU architecture, the cycle-accurate simulation results show 2.4× average speedup and 2.8× average energy reduction with 10% quality loss across all benchmarks. The quality control mechanism retains 1.9× average speedup and 2.1× energy reduction while reducing the quality degradation to 2.5%. These benefits are achieved by approximately 1.2% area overhead.

A Wireless Neural Recording SoC and Implantable Microsystem Integration

Lian Duan, Tao Wang, Siwei Wang, and Amir Yazdanbakhsh
technical report SCS Technical Report | GT-CS-15-06 | Georgia Institute of Technology | September, 2015.

Abstract

An integrated 4-channel wireless neural recording system architecture is proposed. The system was designed to detect extracellular activity potential in the brain. Highly power-efficient front-end signal processing, spike detector, analog-to-digital converter, on-chip power management system, and FSK transmitter are designed and implemented on 0.5 μm CMOS process.

Axilog: Language Support for Approximate Hardware Design

Amir Yazdanbakhsh, Divya Mahajan, Bradley Thwaites, Jongse Park, Anandhavel Nagendrakumar, Sindhuja Sethuraman, Kartik Ramkrishnan, Nishanthi Ravindran, Rudra Jariwala, Abbas Rahimi, Hadi Esmailzadeh and Kia Bazargan
conference Design Automation and Test in Europe (DATE) | March, 2015.

Abstract

Relaxing the traditional abstraction of “near-perfect” accuracy in hardware design can lead to significant gains in energy efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions that can systematically incorporate approximation in hardware design. We introduce Axilog, a set of language annotations, that provides the necessary syntax and semantics for approximate hardware design and reuse in Verilog. Axilog enables the designer to relax the accuracy requirements in certain parts of the design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designer’s annotations. The analysis provides formal safety guarantees that approximation will only affect the parts that the designer intended to approximate, referred to as relaxable elements. Finally, the paper describes a synthesis flow that approximates only the relaxable elements. Axilog enables applying approximation in the synthesis process while abstracting away the details of approximate synthesis from the designer. We evaluate Axilog, its analysis, and the synthesis flow using a diverse set of benchmark designs. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code. Applying our approximate synthesis flow to these designs yields, on average, 54% energy savings and 1.9× area reduction with 10% output quality loss.

Rollback-Free Value Prediction with Approximate Loads

Bradley Thwaites, Gennady Pekhimenko, Amir Yazdanbakhsh, Jongse Park, Girish Mururu, Hadi Esmaeilzadeh, Onur Mutlu, and Toddy Mowry
conference The 23rd International Conference on Parallel Architecture and Compiler Techniques (PACT'14) | August 24-27, 2014.

Abstract

This paper demonstrates how to utilize the inherent error resilience of a wide range of applications to mitigate the memory wall—the discrepancy between core and memory speed. We define a new microarchitecturally-triggered approximation technique called rollback-free value prediction. This technique predicts the value of safe-to-approximate loads when they miss in the cache without tracking mispredictions or requiring costly recovery from misspeculations. This technique mitigates the memory wall by allowing the core to continue computation without stalling for long-latency memory accesses. Our detailed study of the quality trade-offs shows that with a modern out-of-order processor, average 8% (up to 19%) performance improvement is possible with 0.8% (up to 1.8%) average quality loss on an approximable subset of SPEC CPU 2000/2006.

Customized Pipeline and Instruction Set Architecture for Embedded Processing Engines

Amir Yazdanbakhsh, Mostafa E. Salehi, and Sied Mehdi Fakhraie
jorunal Supercomputing (Supercomput), Springer | Feburary, 2014.

Abstract

Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.

Implementation-aware Selection of the Custom Instruction Set for Extensible Processors

Amir Yazdanbakhsh, Mehdi Kamal, Sied Mehdi Fakhraie, Ali Afzali-Kusha, Saeed Safari, and Massoud Pedram
jorunal Microprocessor and Microsystem (MICPRO), Elsevier | June, 2014.

Abstract

This paper presents an approach for incorporating the effect of various logic synthesis options and logic level implementations into the custom instruction (CI) selection for extensible processors. This effect translates into the availability of a piecewise continuous spectrum of delay versus area choices for each CI, which in turn influences the selection of the CI set that maximizes the speedup per area cost (SPA) metric. The effectiveness of the proposed approach is evaluated by applying it to several benchmarks and comparing the results with those of a conventional technique. We also apply the methodology to the existing serialization algorithms aimed at relaxing register file constraints in multi-cycle custom instruction design. The comparison shows considerable improvements in the speedup per area compared to the custom instruction selection algorithms under the same area-budget constraint.

General-Purpose Code Acceleration with Limited-Precision Analog Computation

Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thawaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger
conference The 41st International Symposium on Computer Architecture (ISCA'14) | June 14-18, 2014.

  Honorable Mention in IEEE Micro Top Picks

   Nominated for a CACM Research Highlights

Abstract

As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a novel hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code acceleration, through a neural approach, is both feasible and beneficial over a range of approximation-tolerant, emerging applications including financial analysis, signal processing, robotics, 3D gaming, compression, image processing.

Bridging Analog Neuromorphic and Digital von Neumann Computing

Amir Yazdanbakhsh, Bradley Thawaites, Hadi Esmaeilzadeh, and Doug Burger
talk The 6th Qualcomm Innovation Fellowship (QInF'14) | March 24, 2014.

Abstract

This proposal aims to develop an end-to-end solution—from circuit level through programming model— that enables integration of various analog neuromorphic computing models within the conventional digital von Neumann framework with no disruptive changes to traditional programming languages.

Toward General-Purpose Code Acceleration with Analog Computation

Amir Yazdanbakhsh, Renée St. Amant, Bradley Thwaites, Jongse Park, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger
workshop The 1st Workshop on Approximate Computing Across the System Stack (WACAS) co-located with ASPLOS'14 | March 2, 2014.

Abstract

We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a novel hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code acceleration, through a neural approach, is both feasible and beneficial over a range emerging applications.

Bio-Accelerators: Bridging Biology and Silicon for General-Purpose Computing

Bradley Thwaites, Amir Yazdanbakhsh, Jongse Park, and Hadi Esmaeilzadeh
workshop Wild and Crazy Ideas (WACI) co-located with ASPLOS'14 | March 2, 2014.

Abstract

In our past work we described offloadling an approximable region of code to a fast and efficient neural processing unit made, naturally, in silicon. Here we envision a similar system, but instead using the most powerful neural network of all – a biological neural network as the accelerator.

Comprehensive Circuit Failure Prediction for Logic and SRAM using Virtual Aging

Amir Yazdanbakhsh, Raghuraman Balasubramanian, Tony Nowatzki, and Karthikeyan Sankaralingam
jorunal IEEE Micro, special series on Harsh Chips | September, 2014.

Abstract

This paper develops a comprehensive technique for prediction of failures in the field for many-core processors to address wear-out in harsh environments for logic and SRAM. We develop the following three principles. Virtually aging a processor by momentarily reducing its voltage exposes wear-out failure(s). We then use sampled redundancy to capture logic wear-out failures since their underlying fault model is delay faults, making them poorly suited for using test-vector based techniques. Wear-out on SRAMs, on the other hand, decreases their noise margin, and the end results can be effectively modeled as a stuck-at fault, thus allowing asymmetric checkers like BIST to work effectively. Our design comprising two components, Aged-SDMR and Aged-AsymChk, has a simple implementation and delivers low complexity, low overheads and high accuracy. In addition to ensuring no corruptions or missed errors from wear-out failures, our full system predicts failures within 0.4 days for logic and within milliseconds for SRAM after their appearance. Furthermore, compared to SRAMs protected with ECC only and decommissioned on first failure, we extend lifetime by 14 months on average.

Methodical Approximate Hardware Design and Reuse

Amir Yazdanbakhsh, Bradley Thwaites, Jongse Park, and Hadi Esmaeilzadeh
workshop The 1st Workshop on Approximate Computing Across the System Stack (WACAS) co-located with ASPLOS'14 | March 2, 2014.

Abstract

Design and reuse of approximate hardware components— digital circuits that may produce inaccurate results —can potentially lead to significant performance and energy improvements. Many emerging error-resilient applications can exploit such designs provided approximation is applied in a controlled manner. This paper provides the design abstractions and semantics for methodical, modular, and controlled approximate hardware design and reuse. With these abstractions, critical parts of the circuit still carry the strict semantics of traditional hardware design, while flexibility is provided. We discuss these abstractions in the context of synthesizable register transfer level (RTL) design with Verilog. Our framework governs the application of approximation during the synthesis process without involving the designers in the details of approximate synthesis and optimization. Through high-level annotations, our design paradigm provides high-level control over where and to what degree approximation is applied. We believe that our work forms a foundation for practical approximate hardware design and reuse.

Low Energy Hardening of Combinatorial Logic using Standard Cells and Residue Codes

Michel D. Sika, Amir Yazdanbakhsh, Bradley Kiddie, Jonathan Ahlbin, Michael Bajura, Michael Fritze, John Damoulakis, and John Granacki
conference 39th Annual Government Microcircuit Applications & Critical Technology (GOMACTech) | March 2014.

Abstract

A novel low power method for hardening combinatorial logic to Single Event Upsets (SEU) based on Residue Arithmetic Codes (RAC) and implemented with standard logic cells is presented. Simulations and analysis validate the effectiveness of embedded RAC logic at detecting and correcting single-event upsets (SEUs) in digital arithmetic logic units at low voltage. Simulations show that compared to conventional redundancy based Radiation Hardening by Design (RHBD) methods such as Triple Module Redundancy (TMR), RAC hardened digital arithmetic logic units require 157% less energy per bit processed, 1.42X less propagation delay and up to 4X less area.

Applying Residue Arithmetic Codes to Combinational Logic to Reduce Single Event Upsets

Michel D. Sika, Amir Yazdanbakhsh, Bradley Kiddie, Jonathan Ahlbin, Michael Bajura, Michael Fritze, John Damoulakis, and John Granacki
conference Radiation Effects on Components and Systems (RADECS) | September 2013.

Abstract

Mitigating Single Event Upsets (SEU) in combinatorial logic is conventionally accomplished through redundancy based Radiation Hardening By Design (RHBD) methods such as Triple Module Redundancy (TMR). A hardening technique based on residue arithmetic codes (RAC) is proposed as a lower overhead alternative for detecting and correcting SEUs in arithmetic logic units. Simulations and analyses at the 45nm node show that RAC detects over 99% of faults with 2.6X less area and 157% less energy than TMR.

A New Merit Function for Custom Instruction Selection Under an Area Budget Constraint

Mehdi Kamal, Amir Yazdanbakhsh, Hamid Noori, Ali Afzali-Kusha, and Massoud Pedram
jorunal Journal of Design of Automation for Embedded Systems (DAEM), Springer | September, 2013.

Abstract

This paper presents a new merit function for custom instruction selection phase of the design flow of application-specific instruction-set processors (ASIPs) in the presence of an area budget constraint. In contrast to nearly all of the previously proposed approaches where ratio of the ASIP speed to layout area is used as a merit function to select the candidate custom instructions (CIs), we show that a merit function based on normalized cycle saving and area function can result in better CI selections in terms of the achievable speedup under a given area budget for both greedy and branch-and-bound techniques. The efficacy of the proposed approach is assessed by comparing the results of using the proposed and conventional merit functions for different benchmarks. The comparison points toward an average (maximum) speed enhancement of 3.65 % (27.4 %) for the proposed merit function compared to the conventional merit functions.

Online and Operand-Aware Detection of Failures by Utilizing False Alarm Vectors

Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Nam Sung Kim, and Mikko Lipasti
workshop The 22nd International Workshop on Logic and Synthesis (IWLS-22) co-located with DAC'13 | June 7, 2013.

Abstract

This work presents a framework which detects online and at operand level of granularity all the vectors which excite already-diagnosed failures in combinational modules. These vectors may be due to various types of failure which may even change over time. Our framework is flexible with the ability to update vectors in the future. Moreover, the ability to detect failures at operand level of granularity can be useful to improve yield, for example by not discarding those chips containing failing and redundant computational units (e.g., two failing ALUs) as long as they are not failing at the same time. The main challenge in realization of such a framework is the ability for on-chip storage of all the (test) cubes which excite the set of diagnosed failures, e.g., all vectors that excite one or more slow paths or defective gates. The number of such test cubes can be enormous after applying various minimization techniques, thereby making it impossible for on-chip storage and online detection. A major contribution of this work is to significantly minimize the number of stored test cubes by inserting only a few but carefully- selected “false alarm” vectors. As a result, a computational unit may be mis-diagnosed as failing for a given operand however we show such cases are rare while the chip can safely be continued to be used, i.e., our approach ensures that none of the true-positive failures are missed.

Instruction Set Architectural Guidelines for Embedded Packet-Processing Engines

Mostafa E. Salehi, Sied Mehdi Fakhraie, and Amir Yazdanbakhsh
jorunal Journal of System Architecture, Vol. 58 | March, 2012.

Abstract

This paper presents instruction set architectural guidelines for improving general-purpose embedded processors to optimally accommodate packet-processing applications. Similar to other embedded processors such as media processors, packet-processing engines are deployed in embedded applications, where cost and power are as important as performance. In this domain, the growing demands for higher bandwidth and performance besides the ongoing development of new networking protocols and applications call for flexible power- and performance-optimized engines. The instruction set architectural guidelines are extracted from an exhaustive simulation-based profile-driven quantitative analysis of different packet-processing workloads on 32-bit versions of two well-known general-purpose processors, ARM and MIPS. This extensive study has revealed the main performance challenges and tradeoffs in development of evolution path for survival of such general-purpose processors with optimum accommodation of packet-processing functions for future switching-intensive applications. Architectural guidelines include types of instructions, branch offset size, displacement and immediate addressing modes for memory access along with the effective size of these fields, data types of memory operations, and also new branch instructions. The effectiveness of the proposed guidelines is evaluated with the development of a retargetable compilation and simulation framework. Developing the HDL model of the optimized base processor for networking applications and using a logic synthesis tool, we show that enhanced area, power, delay, and power per watt measures are achieved.

Dynamic Soft Error Hardening via Joing Body Biasing and Dynamic Voltage Scaling

Farshad Firouzi, Amir Yazdanbakhsh, Hamed Dorosti, Sied Mehdi Fakhraie
conference 14th Euromicro Conference on Digital System Design (DSD) | August 2011.

Abstract

Shrinking feature sizes, reduced voltages, and higher transistor count of nano-scale silicon chips challenge designers in terms of performance, power consumption, and reliability. This paper investigates the effect of simultaneous use of dynamic voltage and frequency scaling (DVFS) and body biasing (BB) on power consumption, reliability, and performance. An analytical model of reliability as a function of body bias voltage, supply voltage, and frequency is proposed. We derive a three dimensional optimization problem by exploiting proposed reliability model in conjunction with power consumption and performance model. The resulting problem is solved using widely-used geometric optimization to identify optimal supply voltage and body bias voltage and then is validated using accurate simulation. Afterwards, it is demonstrated how joint energy-performance-reliability space optimization method can be used in an adaptive reliability-aware power management systems. Finally, we show that combined soft error aware BB and DVFS is capable of improving power consumption about 30% in comparison to reliability-aware DVFS only for the same level of reliability and performance constraints.

Customized High-Performance and Low-Power Wireless Sensor Network (WSN)

Amir Yazdanbakhsh, Sied Mehdi Fakhraei, Mostafa E. Salehi, Hamed Dorosti, and Alireza Mazraei-Farahani
patent Iran Patent (No. 79179) | Sep. 2011.

Computer System Prediction Memory Failure

Amir Yazdanbakhsh, Raghuraman Balasubramanian, Anthony Nowatzki, and Karthikeyan Sankaralingam
patentUS P150070US01 filed January 2015 (pending).

Energy-Aware Design Space Exploration of RegisterFile for Extensible Processors

Amir Yazdanbakhsh, Mehdi Kamal, Mostafa E. Salehi, Hamid Noori, and Sied Mehdi Fakhraie
conference The 10th International Conference on Embedded Computer Systems: Architecture, Modeling and Simulation (SAMOS) | July 19-22, 2010.

Abstract

This paper describes an energy-aware methodology that identifies custom instructions for critical code segments, given the available data bandwidth constraint between custom logic and a base processor. Our approach enables designers to optionally constrain the number of input and output operands for custom instructions to reach the acceptable performance considering the energy dissipation of the registerfile. We describe a design flow to identify promising area, performance, and power tradeoffs. We study the effect of custom instruction I/O constraints and registerfile input/output (I/O) ports on overall performance and energy usage of the registerfile. Our experiments show that, in most cases, the solutions with the highest performance are not identified with relaxed I/O constraints. Results for packet-processing benchmarks covering cryptography and lookup applications are shown, with speed-ups between 25% and 40%, and energy reduction between 20% and 30%.

Last Update: