< Back to previous page

Project

Design Space Exploration of Deep Learning Accelerators

Over the past decade, deep learning has reshaped the skyline of Artificial Intelligence, thanks to the evolution of hardware computing capability, algorithm improvement, and the ever-growing volume of data. Driven by a wide range of use cases in various industries, such as computer vision, natural language processing, healthcare, finance, manufacturing, robotics, etc., deep learning gained enormous momentum in its development and deployment. 

In the deployment, processing deep learning models fast and efficiently is challenging due to their computationally intensive, data-intensive and diverse nature. Yet, at the same time, this efficiency is critical across many use cases, especially in resource-constrained scenarios, like mobile and IoT devices. As a result, numerous specialized hardware accelerators are built, taking advantage of the intrinsic highly-parallelizable computing pattern of deep learning models. 

These accelerators usually consist of an array of multiply-and-accumulate units for parallel computing, a memory hierarchy for feeding/storing data, and a pre-defined scheduling controller for orchestrating the computation and data movement. As the processing efficiency is tightly coupled to these design components, carefully constructing them is crucial. However, as the design spaces of these components are vast and intertwined, and the traditional digital design flow is too slow to evaluate every single design option, it is difficult to jump out of the ad-hoc design paradigm to pursue a globally optimal solution.

To address this limitation, early-phase design space exploration is required. This thesis contributes to this goal by, firstly, the systematic identification of the design space of deep learning accelerators at different levels of abstraction and, secondly, the insightful exploration of these single and joint design spaces. At different abstraction levels, this thesis focuses on different exploration parameters:

At the multiply-and-accumulate unit and array level, this thesis focuses on studying the low-precision and variable-precision computing datapath, which has been proven to bring magnificent latency and energy benefits for resource-constrained systems (with no to minor algorithmic accuracy loss). This work constructs systematic taxonomies for precision-scalable multiply-and-accumulate units and arrays after identifying the different design options. These taxonomies depict the skeleton of the design spaces, not only covering the existing state-of-the-art precision-scalable designs but also uncovering new unexplored architectural options. These different design options are then thoroughly bench-marked with the traditional digital synthesis flow, and interesting tradeoffs and design insights are discovered.

Moving one step higher in the abstraction to the single-core accelerator level, we combine the multiply-and-accumulate array and a memory hierarchy, together with various mapping/scheduling possibilities. This thesis builds two high-level fast architecture-mapping design space exploration frameworks, ZigZag and DeFiNES. ZigZag focuses on single-layer mapping, while DeFiNES extends ZigZag to support depth-first scheduling. Thanks to deep learning's deterministic computing pattern, the built-in analytical cost models enable these frameworks to estimate energy and latency breakdown of processing a deep learning model on a customized accelerator in milliseconds to seconds, paving the way toward fast architecture/mapping search and optimization. In this thesis, several model validation experiments and multiple case studies demonstrate the reliability and capabilities of these frameworks.

Recently, the ever-growing model diversity and size are driving deep learning accelerator design to the multi-core level, which combines several accelerator cores and a network-on-chip, together with massive scheduling and layer-core allocation possibilities. At this level, the complexity of the design space is further increased, yet this thesis provides a framework, Stream, to systematically tackle it. Stream is a high-level multi-core accelerator modeling and design space exploration framework, built upon ZigZag. It can explore different core architectures, core-to-core communication topologies, layer-core allocation, and fine-grained layer-fused scheduling, supporting various deep learning models. Stream paves the way for fast and systematic multi-core deep learning accelerator design and workload deployment.

It is important to note that the creation of these different frameworks all followed a similar three-step methodology: firstly, identify different design options and construct a unified design representation that covers these options; secondly, based on this unified representation, build the cost models; lastly, automatically generate different design candidates and feed them to the cost models. In this way, the loop is closed, and the design space exploration can be conducted automatically.

In summary, this thesis aims to clearly introduce the vast design space of deep learning accelerators at the different abstraction levels and thoroughly explain how the high-level design space exploration frameworks can be built to rapidly offer design insights and guidelines. By providing the developed frameworks in open source, we pass on the taxonomy, modeling, and exploration methodologies applied in this thesis to future researchers. 

Date:31 Aug 2018 →  24 Aug 2023
Keywords:machine learning, embedded processor
Disciplines:Sensors, biosensors and smart sensors, Other electrical and electronic engineering, Nanotechnology, Design theories and methods
Project type:PhD project