Jeff Dean (Google) (Mar 2018)

Jeff Dean (Google Senior Fellow) – Systems and Machine Learning Symbiosis (Mar 2018)

Chapters

00:00:17 Machine Learning Systems for Training and Inference

Deep Learning’s Impact on Computing:
Moore’s Law’s slowdown in recent years has coincided with deep learning’s insatiable demands for computing power. Training and inference for deep learning models require more computational power than traditional HPC applications.

Properties of Deep Learning Models:
Reduced precision (16-bit floating point) is sufficient for inference and even training. Many deep learning models involve a handful of specific operations, primarily linear algebra operations like matrix multiplies and vector dot products.

Custom Machine Learning Hardware:
Google has been building custom machine learning hardware, starting with an ASIC for inference. Inference is easier to handle than training, as a single chip can typically accommodate the computation needed for a model.

Training Systems:
Training systems require a more holistic approach than a single chip. The Tensor Processing Unit (TPU) version 2 is Google’s first training system. Each TPU chip features a large matrix multiply unit, scalar and vector units, two cores, and 16 gigabytes of HBM memory. TPUs support reduced precision in the multiplier units but allow 32-bit floating point operations in the rest of the chip.

TPU Pods:
TPU devices can be connected together into larger systems called pods. A pod consists of 64 TPU devices arranged in a 16 by 16 mesh with wraparound links. This configuration provides 11 and a half petaflops of compute, which is suitable for experimenting with various ideas. Pods can be sliced into subpods of different sizes to accommodate diverse user needs.

High-Level Programming Abstractions:
Google has focused on developing high-level programming abstractions in TensorFlow to facilitate easy performance optimization across different hardware platforms. The same TensorFlow program can run with minor modifications on CPUs, GPUs, and TPUs. Synchronous data parallelism is supported without modification on TPU pods, allowing for seamless scaling.

Cloud TPU Accelerators:
Google has been alpha testing Cloud TPUs with a select group of customers. Cloud TPUs are now available in beta as part of Google’s cloud products. Customers have reported significant speedups, with tasks that previously took days now taking hours.

Official Models for Cloud TPUs:
Google has released a set of official models that have been optimized for Cloud TPUs. These models converge to the desired accuracy levels and provide a compelling starting point for users. An experimental directory contains additional models that are still under development.

00:10:43 Cloud Tensor Processing Unit Enables Faster Training of Machine Learning Models

00:14:21 Exploring Machine Learning's Impact on Computer Systems and Hardware

00:21:32 Machine Learning and Computer Systems

00:32:13 Optimizing Machine Learning Training with Large Batch Sizes

00:34:14 Hardware and Software Co-evolutions in Machine Learning

Abstract

The Evolution of Computing: Deep Learning, Google’s TPUs, and the Future of Machine Learning Hardware

The relentless advancement of deep learning has ushered in a new era in computing, marked by the insatiable demand for more powerful computational systems. This evolution has been significantly shaped by Google’s development of custom machine learning hardware, particularly the Tensor Processing Unit (TPU), designed for efficient deep learning model training and inference. These developments highlight a future where machine learning is not just a tool for software, but an integral component of hardware design, promising substantial improvements in computational speed, efficiency, and the ability of systems to adapt to changing algorithms and models.

The Demand for Enhanced Computational Power

Deep learning models require substantial computing power, a demand that has outpaced traditional single-thread performance improvements. This gap has spurred the need for hardware optimized for deep learning’s key characteristics: reduced precision and specific linear algebra operations. Moore’s Law’s slowdown in recent years has coincided with deep learning’s insatiable demands for computing power. Training and inference for deep learning models require more computational power than traditional HPC applications. As a result, companies like Google have embarked on creating specialized hardware like TPUs, focusing initially on inference tasks and later expanding to model training. TPUs stand out for their ability to perform linear algebra operations efficiently, backed by a robust memory system and the ability to interconnect multiple chips into powerful pods.

Hardware Acceleration for Machine Learning

Machine learning hardware customization offers exciting prospects for optimizing the performance of large-scale machine learning models. Specialization and focus on specific operations can unleash creativity in computer architecture, beyond traditional Moore’s law improvements.

The Role of TPUs in Accelerating Deep Learning

Google’s TPUs, especially the second version geared towards training, are pivotal in accelerating deep learning processes. These specialized chips, with their matrix multiply units and high-bandwidth memory, cater specifically to the heavy computational requirements of training large models with extensive datasets. Significantly, TPUs’ performance has been evidenced in reducing training times drastically, for instance, decreasing the training duration of search ranking models from 132 to 9 hours.

Applying Machine Learning to Data Structures and Algorithms

Research is exploring the use of machine learning to replace traditional algorithms and data structures. Learned models can approximate the behavior of B-trees, hash functions, Bloom filters, and more, with potential for improved performance and space utilization.

High-Level Programming and Accessibility

To complement the hardware advancements, Google has also developed high-level programming abstractions in TensorFlow, allowing the same programs to run on CPUs, GPUs, and TPUs with minimal changes. This development not only simplifies the programming process but also enhances scalability and accessibility. TPUs have been made available in Google Cloud, with beta access to Cloud TPU accelerators, democratizing access to this powerful technology.

Machine Learning for Heuristic Optimization

Many areas of computer systems, including compilers, networking, and operating systems, rely on heuristics for decision-making. Machine learning can be employed to learn and adapt these heuristics, resulting in improved performance and adaptability to specific usage patterns.

The Challenges and Future of Machine Learning Hardware

Despite these advancements, the rapidly evolving field of machine learning presents significant challenges in hardware design. The unpredictable nature of future algorithmic requirements and the long development timelines of Application-Specific Integrated Circuits (ASICs) like TPUs create a dynamic environment where hardware must be both flexible and future-proof. To address this, research is focusing on new algorithmic ideas and hardware architectures that can handle low-precision training and inference, explore sparsity and embeddings, and increase batch sizes while maintaining accuracy.

Opportunities for Machine Learning in Compiler Optimization

Instruction scheduling, register allocation, loop nest parallelization, and other compiler optimizations can benefit from machine learning techniques.

Applying Machine Learning to Computer Systems

An emerging trend is the application of machine learning to improve the performance of computer systems themselves. This includes using machine learning for resource management, scheduling, and security in operating systems and compilers. Furthermore, machine learning is being explored to replace traditional algorithms and data structures, promising faster performance and improved resource utilization.

Learning in Networking and Operating Systems

Machine learning can be applied to network decisions like backoff strategies and window size adjustments, as well as operating system tasks such as buffer cache management and file system prefetching.

The Interplay of Hardware and Algorithms

The future of machine learning hardware is intricately tied to its algorithms. As hardware designs evolve, they must consider the primitives that allow for the efficient computation of current and potential future algorithms. This co-evolution necessitates a balance between flexibility for experimentation and the efficiency of specialized designs. Moreover, the development of automated placement techniques and high-level programming abstractions is critical to relieve programmers from low-level optimization concerns, raising the level of abstraction and enabling them to focus on more complex aspects of model design and implementation.

Conclusion

The integration of machine learning into hardware design signifies a transformative phase in computing. As this field continues to evolve, it holds the promise of creating systems that are not only faster and more efficient but also capable of adapting to the ever-changing landscape of machine learning algorithms and models. This synergy between hardware and software is poised to unlock new frontiers in computational capability and efficiency, shaping the future of technology.

Notes by: MatrixKarma

Jeff Dean (Google Senior Fellow) – Systems and Machine Learning Symbiosis (Mar 2018)

Chapters

Abstract

Related posts: