The indirect convolution algorithm

by Sujin Kim on 2020-12-01 12:50:19

Date: 2020. 12.01 (Tue) 15:00 Locate: EB2. 527 Presenter: Sujin Kim Title: the indirect convolution algorithm Author: Marat Dukhan Abstract: Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS libraries. Convolutions with 1x1 kernels can be directly represented as a GEMM call, but convolutions with larger kernels require a special memory layout transformation - im2col or im2row - to fit into GEMM interface. The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer — a buffer of pointers to the start of each row of image pixels. This broadens the ... Continue reading →


OSDI 2020 - Review

by Jihun Bae on 2020-11-23 09:38:45

Date: 2020. 11.23 (Mon) 14:00 Locate: EB5. 507 Presenter: Jihun Bae Title: OSDI 2020 - Review         Caladan : Mitigating Interference at Microsecond timescales.         LinnOS: Predictability on Unpredictable Flash Strorage with a Light Neural Network. Author:  Joshua Fried ,  Mingzhe Hao Abstract:  The conventional wisdom is that CPU resources such as cores, caches, and memory bandwidth must be partitioned to achieve performance isolation between tasks. Both the widespread availability of cache partitioning in modern CPUs and the recommended practice of pinning latency-sensitive applications to dedicated cores attest to this belief. In this paper, we show that resource partitioning is neither necessary nor sufficient. Many applications experience bursty request patterns or phased behavior, drastically changing the amount and type of resources they need. Unfortunately, partitioning-based systems ... Continue reading →


Smartphone and Smartwatch-Based Biometrics using Activities of Daily Living

by Jinyoung Choi on 2020-11-11 15:57:59

Date : 2020. 11. 16 (Mon) 14:00 Locate : EB5. 507 Presenter : Jinyoung Choi   Title : Smartphone and Smartwatch-Based Biometrics using Activities of Daily Living Author : Gary M. Weiss, Kenichi Yoneda, and Thaier Hayajneh (Department of Computer and Information Science, Fordham University, Bronx, NY 10458 USA )   Abstract :  Smartphones and smartwatches, which include powerful sensors, provide a readily available platform for implementing and deploying mobile motion-based behavioral biometrics. However, the few studies that utilize these commercial devices for motion-based biometrics are quite limited in terms of the sensors and physical activities that they evaluate. In many such studies, only the smartwatch accelerometer is utilized and only one physical activity, walking, is investigated. In this study we consider the accelerometer and gyroscope sensor on both the smartphone and smartwatch, and determine which combination of sensors ... Continue reading →


Extending Model Checking with Dynamic Analysis

by Jihun Bae on 2020-11-02 02:56:17

Date: 2020. 11.02 (Mon) 14:00 Locate: EB5. 527 Presenter: Jihun Bae Title: Extending Model Checking with Dynamic Analysis Author: Alex Groce and Rajeev Joshi Abstract: In model-driven verification a model checker executes a pro- gram by embedding it within a test harness, thus admitting program verification without the need to translate the program, which runs as native code. Model checking techniques in which code is actually executed have recently gained popularity due to their ability to handle the full semantics of actual implementation languages and to support verifi- cation of rich properties. In this paper, we show that combination with dynamic analysis can, with relatively low overhead, considerably extend the capabilities of this style of model checking. In particular, we show how to use the CIL framework to instrument code in order to allow the SPIN model checker, when verifying C programs, to check additional properties, simulate ... Continue reading →


Constraint-Aware Importance Estimation for Global Filter Pruning under Multiple Resource Constraints

by Jinse Kwon on 2020-10-13 15:39:32

Date : 2020. 10. 26 (Mon) 14:00 Locate : EB5. 507 Presenter : Jinse Kwon   Title : Constraint-Aware Importance Estimation for Global Filter Pruning under Multiple Resource Constraints Author : Yu-Cheng Wu, Chih-Ting Liu, Bo-Ying Chen, Shao-Yi Chien (NTU IoX Center, National Taiwan University, Graduate Institute of Electronic Engineering, National Taiwan University)   Abstract : Filter pruning is an efficient way to structurally remove the redundant parameters in convolutional neural network, where at the same time reduces the computation, memory storage and transfer cost. Recent state-of-the-art methods globally estimate the importance of each filter based on its impact to the loss and iteratively remove those with smaller values until the pruned network meets some resource constraints, such as the commonly used number (or ratio) of filter left. However, when there is a more practical constraint like the total number of FLOPs, ... Continue reading →


A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs

by Juwon You on 2020-10-11 15:36:02

Date: 2020. 10. 12 (Mon) 14:00 Locate: EB5. 507 Presenter: Juwon You Title: A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs Author: Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, Yinghan Li (Center for Energy-efficient Computing and Applications, School of EECS, Peking University SenseTime Incorporation) Abstract: General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work ... Continue reading →


Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

by Jaemin Kang on 2020-09-16 13:24:59

Date: 2020. 09.21 (Mon) 14:00 Locate: EB5. 533 Presenter: Jaemin Kang Title: Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? Author: Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Gee Hock Ong, Yeong Tat Liew, Krishnan Srivatsan3, Duncan Moss3, Suchit Subhaschandra3, Guy Boudoukh Abstract: Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today’s GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming ... Continue reading →


Performance, Design, and Autotuning of Batched GEMM for GPUs

by Sujin Kim on 2020-09-10 17:59:44

Date: 2020. 09. 14 (Mon) 14:00-16:00 Locate: EB5. 533 Presenter: Sujin Kim Title: Performance, Design, and Autotuning of Batched GEMM for GPUs Author: Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville, USA Oak Ridge National Laboratory, Oak Ridge, USA University of Manchester, UK Abstract: The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scienti c applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK ... Continue reading →


ZeroQ: A Novel Zero Shot Quantization Framework

by Jinse Kwon on 2020-08-07 17:51:18

Date : 2020. 08. 11 (Tue) 16:00 Locate : EB5. 607 Title : ZeroQ: A Novel Zero Shot Quantization Framework   Author : Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer (Peking University, University of California, Berkeley)   Abstract : Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zero-shot quantization methods use different heuristics to address this, but they result in poor performance, especially when quantizing to ultra-low precision. Here, we propose ZeroQ , a novel zero-shot quantization framework to address this. ZeroQ enables mixed-precision quantization without any access to the ... Continue reading →


NeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems

by Sihyeong Park on 2020-08-04 09:27:13

Date : 2020. 08. 24 (Tue) 16:00 Locate : EB5. 607 Title : ANeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems Authors:  Soroush Bateni and Cong Liu, University of Texas at Dallas Abstract:  Deep neural networks (DNNs) used in computed vision have become widespread techniques commonly used in autonomous embedded systems for applications such as image/object recognition and tracking. The stringent space, weight, and power constraints seen in such systems impose a major impediment for practical and safe implementation of DNNs, because they have to be latency predictable while ensuring minimum energy consumption and maximum accuracy. Unfortunately, exploring this optimization space is very challenging because (1) smart coordination has to be performed among system- and application-level solutions, (2) layer characteristics should be taken into account, and more importantly, (3) when multiple DNNs exist, a consensus on system ... Continue reading →


A Hardware-Software Blueprint for Flexible Deep Learning Specialization

by Jinse Kwon on 2020-07-24 17:55:37

Date : 2020. 07. 28 (Tue) 10:00 Locate : EB5. 607 Title : A Hardware-Software Blueprint for Flexible Deep Learning Specialization   Author : Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy (University of Washington, hanghai Jiao Tong University)   Abstract : Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is ... Continue reading →


Harmoniing Performance and Isolation in Microkernels with Efficient Intra-kernel

by Sihyeong Park on 2020-07-16 11:13:41

Date: 2020. 07. 21 (Tue) 10:00-12:00 Locate: EB5. 607 Presenter: Sihyeong Park Title: Harmonizing Performance and Isolation in Microkernels with Efficient Intra-kernel Isolation and Communication Author: Jinyu Gu, Xinyue Wu, Wentai Li, Nian Liu, Zeyu Mi, Yubin Xia, and Haibo Chen, Shanghai Jiao Tong University Abstract: This paper presents UnderBridge, a redesign of traditional microkernel OSes to harmonize the tension between messaging performance and isolation. UnderBridge moves the OS components of a microkernel between user space and kernel space at runtime while enforcing consistent isolation. It retrofits Intel Memory Protection Key for Userspace (PKU) in kernel space to achieve such isolation efficiently and design a fast IPC mechanism across those OS components. Thanks to PKU’s extremely low overhead, the inter-process communication (IPC) roundtrip cost in UnderBridge can be as low as 109 cycles. We have designed and implemented a new ... Continue reading →


Towards Efficient Model Compression via Learned Global Ranking

by Jinse Kwon on 2020-07-11 16:39:39

Date : 2020. 07. 14 (Tue) 10:00 Locate : EB5. 533 Presenter : Jinse Kwon Title : Towards Efficient Model Compression via Learned Global Ranking   Author : Ting-Wu Chin, Ruizhou Ding, Cha Zhang, Diana Marculescu   Abstract : Pruning convolutional filters has demonstrated its effectiveness in compressing ConvNets. Prior art in filter pruning requires users to specify a target model complexity (e.g., model size or FLOP count) for the resulting architecture. However, determining a target model complexity can be difficult for optimizing various embodied AI applications such as autonomous robots, drones, and user-facing applications. First, both the accuracy and the speed of ConvNets can affect the performance of the application. Second, the performance of the application can be hard to assess without evaluating ConvNets during inference. As a consequence, finding a sweet-spot between the accuracy and speed via filter pruning, which needs to ... Continue reading →


Classifying images with Lenet-5 model

by Sujin Kim on 2020-06-19 11:55:08

Date: 2020. 06. 19 (FRI) 15:00 Locate: EB5. 533 Presenter: Sujin Kim Title: Classifying images with Lenet-5 model Article source: Continue reading →


이기종 디바이스에서의 CPU, GPU 작업 분할을 이용한 딥러닝 학습 속도 개선

by Donghee Ha on 2020-06-01 15:57:38

Date: 2020. 06. 5 (Fri) 17:30 Locate: EB5. 533 Presenter: Donghee Ha Title: 이기종 디바이스에서의 CPU, GPU 작업 분할을 이용한 딥러닝 학습 속도 개선 Continue reading →