Mobile — 140
NSL — 22
NSL Project — 2

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

by Jaemin Kang on 2020-09-16 13:24:59

Date: 2020. 09.21 (Mon) 14:00 Locate: EB5. 533 Presenter: Jaemin Kang Title: Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? Author: Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Gee Hock Ong, Yeong Tat Liew, Krishnan Srivatsan3, Duncan Moss3, Suchit Subhaschandra3, Guy Boudoukh Abstract: Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today’s GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming ... Continue reading →


Performance, Design, and Autotuning of Batched GEMM for GPUs

by Sujin Kim on 2020-09-10 17:59:44

Date: 2020. 09. 14 (Mon) 14:00-16:00 Locate: EB5. 533 Presenter: Sujin Kim Title: Performance, Design, and Autotuning of Batched GEMM for GPUs Author: Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville, USA Oak Ridge National Laboratory, Oak Ridge, USA University of Manchester, UK Abstract: The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scienti c applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK ... Continue reading →


ZeroQ: A Novel Zero Shot Quantization Framework

by Jinse Kwon on 2020-08-07 17:51:18

Date : 2020. 08. 11 (Tue) 16:00 Locate : EB5. 607 Title : ZeroQ: A Novel Zero Shot Quantization Framework   Author : Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer (Peking University, University of California, Berkeley)   Abstract : Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zero-shot quantization methods use different heuristics to address this, but they result in poor performance, especially when quantizing to ultra-low precision. Here, we propose ZeroQ , a novel zero-shot quantization framework to address this. ZeroQ enables mixed-precision quantization without any access to the ... Continue reading →


NeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems

by Sihyeong Park on 2020-08-04 09:27:13

Date : 2020. 08. 24 (Tue) 16:00 Locate : EB5. 607 Title : ANeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems Authors:  Soroush Bateni and Cong Liu, University of Texas at Dallas Abstract:  Deep neural networks (DNNs) used in computed vision have become widespread techniques commonly used in autonomous embedded systems for applications such as image/object recognition and tracking. The stringent space, weight, and power constraints seen in such systems impose a major impediment for practical and safe implementation of DNNs, because they have to be latency predictable while ensuring minimum energy consumption and maximum accuracy. Unfortunately, exploring this optimization space is very challenging because (1) smart coordination has to be performed among system- and application-level solutions, (2) layer characteristics should be taken into account, and more importantly, (3) when multiple DNNs exist, a consensus on system ... Continue reading →


A Hardware-Software Blueprint for Flexible Deep Learning Specialization

by Jinse Kwon on 2020-07-24 17:55:37

Date : 2020. 07. 28 (Tue) 10:00 Locate : EB5. 607 Title : A Hardware-Software Blueprint for Flexible Deep Learning Specialization   Author : Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy (University of Washington, hanghai Jiao Tong University)   Abstract : Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is ... Continue reading →


Harmoniing Performance and Isolation in Microkernels with Efficient Intra-kernel

by Sihyeong Park on 2020-07-16 11:13:41

Date: 2020. 07. 21 (Tue) 10:00-12:00 Locate: EB5. 607 Presenter: Sihyeong Park Title: Harmonizing Performance and Isolation in Microkernels with Efficient Intra-kernel Isolation and Communication Author: Jinyu Gu, Xinyue Wu, Wentai Li, Nian Liu, Zeyu Mi, Yubin Xia, and Haibo Chen, Shanghai Jiao Tong University Abstract: This paper presents UnderBridge, a redesign of traditional microkernel OSes to harmonize the tension between messaging performance and isolation. UnderBridge moves the OS components of a microkernel between user space and kernel space at runtime while enforcing consistent isolation. It retrofits Intel Memory Protection Key for Userspace (PKU) in kernel space to achieve such isolation efficiently and design a fast IPC mechanism across those OS components. Thanks to PKU’s extremely low overhead, the inter-process communication (IPC) roundtrip cost in UnderBridge can be as low as 109 cycles. We have designed and implemented a new ... Continue reading →


Towards Efficient Model Compression via Learned Global Ranking

by Jinse Kwon on 2020-07-11 16:39:39

Date : 2020. 07. 14 (Tue) 10:00 Locate : EB5. 533 Presenter : Jinse Kwon Title : Towards Efficient Model Compression via Learned Global Ranking   Author : Ting-Wu Chin, Ruizhou Ding, Cha Zhang, Diana Marculescu   Abstract : Pruning convolutional filters has demonstrated its effectiveness in compressing ConvNets. Prior art in filter pruning requires users to specify a target model complexity (e.g., model size or FLOP count) for the resulting architecture. However, determining a target model complexity can be difficult for optimizing various embodied AI applications such as autonomous robots, drones, and user-facing applications. First, both the accuracy and the speed of ConvNets can affect the performance of the application. Second, the performance of the application can be hard to assess without evaluating ConvNets during inference. As a consequence, finding a sweet-spot between the accuracy and speed via filter pruning, which needs to ... Continue reading →


Classifying images with Lenet-5 model

by Sujin Kim on 2020-06-19 11:55:08

Date: 2020. 06. 19 (FRI) 15:00 Locate: EB5. 533 Presenter: Sujin Kim Title: Classifying images with Lenet-5 model Article source: Continue reading →


이기종 디바이스에서의 CPU, GPU 작업 분할을 이용한 딥러닝 학습 속도 개선

by Donghee Ha on 2020-06-01 15:57:38

Date: 2020. 06. 5 (Fri) 17:30 Locate: EB5. 533 Presenter: Donghee Ha Title: 이기종 디바이스에서의 CPU, GPU 작업 분할을 이용한 딥러닝 학습 속도 개선 Continue reading →


Optimizing CNN Model Inference on CPUs

by Guest on 2020-05-25 20:53:59

Date : 2020. 05. 29 (Fri) 15:00 Locate : EB5. 533 Presenter : Seungmin Jeon Title : Optimizing CNN Model Inference on CPUs Author : Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang, Amazon Abstract : The popularity of Convolutional Neural Network (CNN) models and the ubiquity of CPUs imply that better performance of CNN model inference on CPUs can deliver significant gain to a large number of users. To improve the performance of CNN inference on CPUs, current approaches like MXNet and Intel OpenVINO usually treat the model as a graph and use the high-performance libraries such as Intel MKL-DNN to implement the operations of the graph. While achieving reasonable performance on individual operations from the off-the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level optimizations are predefined. Therefore, it is restrictive and misses the opportunity to ... Continue reading →


Online personalization of cross-subjects based activity recognition models on wearable devices

by Jinyoung Choi on 2020-05-15 18:18:30

Date : 2020. 05. 22 (Fri) 15:00 Locate : EB5. 533 Presenter : Jinyoung Choi Title : Online personalization of cross-subjects based activity recognition models on wearable devices Author : Timo Sztyler, Heiner Stuckenschmidt (University of Mannheim, Germany) Abstract : Human activity recognition using wearable devices is an active area of research in pervasive computing. In our work, we address the problem of reducing the effort for training and adapting activity recognition approaches to a specific person. We focus on the problem of cross-subjects based recognition models and introduce an approach that considers physical characteristics. Further, to adapt such a model to the behavior of a new user, we present a personalization approach that relies on online and active machine learning. In this context, we use online random forest as a classifier to continuously adapt the model without keeping the already seen data available and an active ... Continue reading →


Execution Model to Reduce the Interference of Shared Memory in ARINC 653 Compliant Multicore RTOS

by Jihun Bae on 2020-05-11 17:10:34

Date: 2020. 05.15 (Fri) 15:00 Locate: EB5. 533 and Zoom Presenter: Jihun Bae Title: TExecution Model to Reduce the Interference of Shared Memory in ARINC 653 Compliant Multicore RTOS Author: Sihyeong Park, Mi-Young Kwon, Hoon-Kyu Kim, Hyungshin Kim Abstract: Multicore architecture is applied to contemporary avionics systems to deal with complex tasks. However, multicore architectures can cause interference by contention because the cores share hardware resources. This interference reduces the predictable execution time of safety-critical systems, such as avionics systems. To reduce this interference, methods of separating hardware resources or limiting capacity by core have been proposed. Existing studies have modified kernels to control hardware resources. Additionally, an execution model has been proposed that can reduce interference by adjusting the execution order of tasks without software modification. Avionics systems require several rigorous software ... Continue reading →


이기종 디바이스에서의 CPU, GPU 작업 분할을 이용한 딥러닝 학습 속도 개선

by Donghee Ha on 2020-05-04 22:11:09

Date: 2020. 01. 20 (Mon) 15:00 Locate: EB5. 533 Presenter: Donghee Ha Title: 이기종 디바이스에서의 CPU, GPU 작업 분할을 이용한 딥러닝 학습 속도 개선 Continue reading →


Learning bothWeights and Connections for Efficient Neural Networks

by Juwon You on 2020-04-27 20:05:49

Date: 2020. 05. 01 (Fri) 15:00 Locate: EB5. 533 Presenter: Juwon You Title: Learning bothWeights and Connections for Efficient Neural Networks Author: Song Han, Jeff Poll, John Tran, William J. Dally           (Stanford Univ, NVIDIA, NVIDIA, Stanford Univ & NVIDIA) Abstract: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the ... Continue reading →


ADMM-NN: An Algorithm-Hardware Co-Design Framework...

by Jinse Kwon on 2020-04-17 17:00:06

Date : 2020. 04. 24 (Fri) 15:00 Locate : EB5. 533 Presenter : Jinse Kwon Title : ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers   Author : Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, Yanzhi Wang (Northeastern University, Syracuse University, SUNY University at Buffalo, University of Southern California)   Abstract : To facilitate efficient embedded and hardware implementations of deep neural networks (DNNs), two important categories of DNN model compression techniques: weight pruning and weight quantization are investigated. The former leverages the redundancy in the number of weights, whereas the latter leverages the redundancy in bit representation of weights. However, there lacks a systematic framework of joint weight pruning and quantization of DNNs, thereby limiting the available model compression ratio. Moreover, the computation ... Continue reading →