Jeff Dean (Google Senior Fellow) – Exciting Directions for ML Models and the Implications for Computing Hardware (Sep 2023)
Chapters
00:03:24 Implications of Machine Learning Model Trends for Computer Hardware
Exciting Directions for ML Models: ML models have revolutionized computer capabilities, enabling tasks like image/speech recognition, language understanding, and text-to-image generation. Recently, generative models have emerged, allowing users to input text or descriptions and generate realistic images or code. Interactive conversational systems like Barb enable back-and-forth discussions and can generate Python code and explanations.
Fine-tuning General Models: Fine-tuning general models for specific tasks has shown promising results. MedPALM1 achieved a 67% score on a medical board exam after fine-tuning, and a later model scored 86.5%.
Multimodal Models: Multimodal models can process various input modalities (images, text, speech, code) and generate different outputs. Pa11y can generate alt text for images and create images from textual descriptions.
Trends and Implications for Computer Architects: The talk focuses on trends in ML models and their implications for computer architects. The goal is to design ML hardware and deploy it quickly to keep up with the rapidly evolving field of ML. The discussion explores ways to deliver significant increases in compute capacity and efficiency to advance ML further.
00:09:36 Machine Learning Models: Sparsity, Adaptivity, and Dynamic Neural Networks
Sparse Computation: Sparse models have different pathways that are adaptively called upon as needed, making them more efficient and accurate. The right pieces of the overall model are learned during the training process, specializing different parts for different inputs. Sparse models result in touching only a small percentage of a large model, leading to improved responsiveness and higher accuracy. Modern hardware supports various levels of sparsity, including coarse-grained (large modules activated or not) and fine-grained (sparsity within a single vector or tensor).
Computational Balance in Sparse Models: Traditional sparse models use the same size and structure for each expert, with computational balance achieved through equal computation and flow of examples to each expert. All-to-all shuffle performance across accelerators is crucial for all sparse models to quickly route data between different parts of the models.
Dynamic Computational Costs: Instead of fixed computational costs, varying the computational cost of different pieces of the model can improve efficiency. Smaller experts with less computation can handle simpler examples, while larger experts with more computation can handle difficult situations. Experts may have different computational structures, from simple one-layer models to complex multi-layered structures.
Mapping Dynamic Computational Costs onto Hardware: Mapping larger and computationally expensive experts onto more chips can optimize hardware resource allocation. For example, a complex expert might be allocated to 16 chips, while a simpler expert might be allocated to only one chip.
00:13:35 Trends and Considerations for Computer Architects and System Builders in Machine Learning
Key Takeaways for Computer Architects and System Builders: Connectivity of accelerators, bandwidth, and latency are crucial for training large models using many chips. Scale matters for both training and inference. Sparse models strain memory capacity and demand efficient routing. ML software must facilitate expressing complex models beyond regular dense computations. Power, sustainability, and reliability are key considerations.
Trends in Machine Learning Models: Moving away from separate models for different tasks towards single models that generalize across many tasks. Transitioning from dense models to more efficient sparse models. Expanding beyond single modality models to models that can handle multiple modalities as inputs and outputs.
CO2 Emissions in Machine Learning Training: Misinformation regarding CO2 emissions in machine learning training is prevalent. Accurate data reveals that CO2 emissions from training a single NLP model using efficient hardware and the discovered transformer model are significantly lower than previously estimated.
System Design Considerations for Gen-AI: Need to change how we think about system design end-to-end to meet the challenges and opportunities of Gen-AI. Focus on system good put, considering the entire system’s efficiency and effectiveness, not just individual components. Design systems that can adapt to changing model architectures and resource requirements dynamically. Consider power, sustainability, and reliability as key design constraints.
00:20:52 Redefining Computing Infrastructure for Machine Learning
Key Challenges in Computing Infrastructure for Machine Learning: The rapidly increasing demand for computing power, reliability, and carbon dioxide equivalents requires new design approaches.
Exponential Growth in Model Parameters: The number of dense parameter models has grown 10x per year over the past several years, leading to a surge in computation costs.
Conventional Wisdom Thrown Out the Window: Traditional assumptions about general-purpose compute being sufficient for all computations no longer hold true.
TPUs and Specialized Hardware Innovations: TPUs and other specialized hardware architectures have emerged to meet the unique demands of machine learning.
TPU Innovations: High-bandwidth synchronous interconnect for direct connectivity between computing units. Liquid cooling for improved power efficiency. Specialized data representations for optimized performance. Optical circuit switching for reliable large-scale computer connectivity. Hardware for efficient scatter-gather operations and dense matrix multiplications. High-bandwidth memory stacked on top of compute units for low latency and high bandwidth.
Overall Impact: Specialized hardware architectures like TPUs have enabled 10x to 100x improvements in system efficiency, considering performance, power, and cost.
Conclusion: Machine learning has revolutionized computing infrastructure, leading to radically different systems compared to traditional general-purpose compute.
00:24:34 Accelerated Computing: Overcoming Challenges and Achieving the Next 100x
TPU Architecture Iteration at Google: Google has consistently improved its TPU architecture over the past eight years. TPUv1 was designed for internal inference, while TPUv2 introduced a high-bandwidth interconnect for communication among hundreds of chips. TPUv3 and the recently announced TPUv4 consist of thousands of TPUs operating synchronously on large-scale problems. Google Cloud Next may reveal more about the TPU roadmap.
Accelerated Computing Progress: Accelerated computing has made remarkable strides in the past decade, surpassing general-purpose compute by a factor of 100. This acceleration has enabled breakthrough advancements in AI, capturing widespread attention. The computational power and data processing capabilities achieved today were unimaginable a decade ago.
Challenges and Future Needs: Current progress in accelerated computing is insufficient to meet future demands. The rapid increase in dense parameters per model requires computational improvements beyond accelerated computing. To sustain AI advancements, another 100x improvement over the baseline is necessary.
Focus on System Efficiency: This presentation focuses on optimizing system efficiency in terms of power, reliability, and carbon dioxide equivalents. Traditional metrics like chip performance are inadequate for evaluating overall system performance.
00:27:47 Designing Computing Systems for Efficiency and Sustainability
Headlines vs. Real Performance: Headline numbers like spec int or max flops don’t reflect the actual performance of computations running synchronously across hundreds of thousands of chips. MLPerf benchmark focuses on absolute performance at a given system size, ignoring system costs, carbon emissions, efficiency, and power consumption.
Perf/TCO Metric: Google’s Perf/TCO metric evaluates system goodness by considering both benchmark performance (numerator) and total cost of ownership (TCO) (denominator). TCO includes capital expenditures (chip, board, enclosure, memory, accelerator, network, optics, racks, data center, space, and power) and operating expenditures (data center provisioning costs and electricity costs).
Hidden Assumptions and Room for Improvement: Perf/TCO assumes sufficient data center capacity, fixed data center capacity costs, accurate attribution of consumed power to individual workloads, and accurate performance representation of present and future workloads. Google is evolving its considerations to address these assumptions and improve the metric.
00:33:35 Optimizing Machine Learning Performance and Energy Efficiency
Introduction: The traditional metric of performance per chip is no longer sufficient for evaluating computing systems. Factors such as reliability, power consumption, and carbon dioxide emissions are becoming increasingly important.
System Performance per Average Watt: The metric of performance per average watt encourages the efficient utilization of available power capacity. It is important to consider the entire system, not just the peak performance of individual components.
System Performance per Carbon Dioxide Emissions: The metric of system performance per carbon dioxide emissions accounts for the environmental impact of building and operating data centers. Carbon dioxide emissions should be minimized throughout the lifecycle of the infrastructure.
Optimizing Power Consumption: It is possible to improve throughput and reduce power consumption by understanding the power characteristics of jobs and making adjustments accordingly. System-level scheduling optimizations can also lead to significant efficiency improvements.
Cell-Wide Control Plane: A cell-wide control plane can be used to manage power consumption and optimize scheduling. High-power jobs can be spread across multiple busbars to avoid overloading any single busbar.
Reliability and Silent Data Corruption: Reliability is becoming increasingly important as machine learning workloads run synchronously across thousands of compute nodes. Silent data corruption is a growing challenge that affects both CPUs and accelerated compute chips.
00:44:25 Challenges in Scaling Stochastic Gradient Descent
Challenges of Silent Data Corruption in Large-Scale Computing: Amin Vahdat highlights the issue of certain compute elements producing incorrect results non-deterministically, leading to silent data corruption (SDC). This becomes particularly problematic in synchronous stochastic gradient descent across thousands of elements, as a single erroneous result can spread and corrupt the entire computation.
Monitoring Gradient Norm to Detect SDC: Vahdat explains that they have implemented monitoring to capture the gradient norm, which can indicate SDC. However, similar spikes in the norm can occur during normal operation, making it challenging to differentiate between normal behavior and SDC.
Rapid Checkpointing and Restart Mechanisms: To address the challenges of SDC, the infrastructure is designed to support rapid checkpointing of the computation. This allows them to restart the computation from a known good state if SDC is suspected or a system failure occurs.
Validating SDC and Restarting from Good Checkpoints: Sophisticated mechanisms are in place to validate whether SDC has occurred by rerunning the computation on hot spares. If SDC is confirmed, the affected elements are removed from the computation using optical circuit switching, and the computation is restarted from the last known good checkpoint.
Performance Trade-Offs and Future Directions: Vahdat emphasizes that these measures to mitigate SDC can lead to lost good put and reduced headline chip performance. The community should consider trade-offs between lower headline performance and more reliable detection and handling of silent data corruption.
00:47:54 Accelerating Chip Design with Machine Learning
Design Challenges in Hardware Development: Machine learning (ML) can be used to design specialized hardware faster and more efficiently. Current industry best practice for developing a chip accelerator takes about three years. The chip design phase is time-consuming, and ML can be applied to accelerate it.
Architectural Exploration and Synthesis: ML can be used to automate architectural exploration and synthesis of high-level to low-level designs. ML can help explore design space choices, such as cache size, memory bandwidth, and DRAM channels. It can also consider compiler transformations to optimize hardware utilization.
Performance Gains with Customized Hardware: Customizing hardware design parameters and compilers for specific models can yield significant performance improvements. A hypothetical TPUv3-like system with optimized compiler settings showed performance gains over the baseline. Further customization for a mix of workloads can lead to even higher performance.
Verification and Placement: ML can be used to speed up verification by generating test coverage with a small set of human-authored tests. Reinforcement learning can be applied to quickly generate high-quality placement and routing decisions. This method has been shown to find better solutions than human experts in less time.
Conclusion: ML has the potential to revolutionize hardware design by accelerating the development process and enabling the creation of more specialized and efficient hardware for various applications.
00:55:11 Innovation in Chip Design and Deployment for Machine Learning
Opportunities for ML in Chip Design and Manufacturing: ML capabilities are rapidly advancing, enabling fundamental changes in the computing community. ML models are becoming increasingly dynamic and evolving structures. Focus should be on system good put rather than chip headline performance. Power, CO2e efficiency, and SDCs are crucial metrics to measure and improve.
Shorter Timelines for Chip Design and Deployment: Shorter timelines are essential to adapt quickly to the changing ML landscape. ML automation can help streamline the design process.
Challenges in ML Automation for EDA: Challenges include data availability, physics-aware modeling, and human expert involvement. Potential benefits of liquid cooling include improved performance, TCO, and reliability. However, liquid cooling adds complexity to system deployment, requiring specialized teams and infrastructure.
Abstract
Revolutionizing Computing: The Future of Machine Learning and System Design
Machine learning (ML) is transforming the landscape of computing, necessitating radical changes in hardware design and system architecture. This article delves into the latest trends and challenges in ML, such as the shift towards more dynamic, sparse, and efficient models, the implications for computer architects, and the urgent need for scalable, sustainable systems. It highlights the evolving metrics for machine learning workloads, including power efficiency, carbon dioxide emissions, and reliability. The role of accelerated chip design and system optimization in adapting to these rapid advancements is also explored, emphasizing the importance of a holistic approach to system design that prioritizes throughput, sustainability, and adaptability.
—
Exciting Directions for ML Models:
Machine learning models have led to significant advancements in computers’ capabilities in areas like image classification, speech recognition, and language understanding. Recent developments have enabled the generation of images, text, and speech from textual descriptions, which broadens their application spectrum. The emergence of conversational systems that can generate interactive discussions and Python code is a testament to their educational utility. There’s a promising trend in fine-tuning general models for specific tasks, such as medical exams, indicating a surge in domain-specific applications. The rise of multimodal models capable of handling diverse inputs and outputs is another notable advancement.
Implications for Computer Architects and ML Hardware Design:
There’s a growing focus on delivering significant increases in compute capacity and efficiency to keep pace with ML advancements. This necessitates the design of ML hardware that can accommodate the rapidly evolving models. The fast-paced development in ML thus demands a swift deployment of corresponding hardware solutions.
Sparsity in Machine Learning Models:
Sparse models, characterized by adaptively activated pathways, offer higher efficiency, greater capacity, and improved accuracy. Understanding the difference between coarse-grained and fine-grained sparsity is crucial. Modern hardware’s support for sparsity is now a vital factor for computer architects to consider.
Adaptive Computation and Dynamically Changing Neural Networks:
Adaptive computation allows for varying computational costs, allocating more resources to more complex examples, thus enhancing efficiency. The emerging capability of neural networks to continuously adapt their structure and parameters represents a significant development in the field.
Pathways System:
The Pathways system is designed for dynamic resource management, facilitating hardware addition or removal during runtime and managing communication across multiple network transports. This system is a crucial development in handling dynamic computing needs.
Model Trends:
The trend in machine learning is shifting towards single models capable of handling multiple tasks. There’s a move towards sparse models for increased efficiency and an emphasis on handling various inputs and outputs within a single model.
Key Takeaways for Architects:
For architects, the focus should be on the importance of connectivity, bandwidth, and latency in accelerators. The significance of scale for effective training and inference, the influence of sparse models on memory capacity and routing, and the need for user-friendly ML software for complex models are critical. Power efficiency, sustainability, and reliability are essential considerations in this context.
CO2 Emissions in Machine Learning Training:
Previous overestimations of CO2 emissions in machine learning training have been corrected with more accurate data. Advancements in transformer models have played a role in reducing carbon emissions.
System Design for General AI:
There is a pressing need to rethink system design comprehensively, with a focus on overall system throughput rather than component-level optimization. This approach prioritizes system-level optimization.
Exponential Growth in Model Parameters:
The number of dense parameter models has been growing at an exponential rate, increasing the computation costs associated with these models.
Conventional Wisdom Thrown Out the Window:
The traditional belief that general-purpose compute is sufficient for all computations is no longer tenable in the current landscape of machine learning.
TPUs and Specialized Hardware Innovations:
The emergence of TPUs and other specialized hardware architectures to meet the unique demands of machine learning represents a significant shift. Innovations in TPU technology include high-bandwidth synchronous interconnects for connectivity between computing units, liquid cooling for enhanced power efficiency, specialized data representations for optimized performance, optical circuit switching for reliable large-scale computer connectivity, hardware tailored for efficient scatter-gather operations and dense matrix multiplications, and high-bandwidth memory stacked on compute units for low latency and high bandwidth.
The future of ML demands systems that are scalable, efficient, and holistic in design, with a focus on power, sustainability, and reliability.
Shifting the Focus of Computing Metrics from Headline Performance to System Performance per Watt and Carbon Dioxide Emissions:
The traditional metric of performance per chip is becoming obsolete, with factors such as reliability, power consumption, and carbon dioxide emissions becoming increasingly important. The metric of system performance per average watt encourages efficient utilization of available power capacity, and it’s crucial to consider the entire system, not just the peak performance of individual components. Similarly, the metric of system performance per carbon dioxide emissions emphasizes the environmental impact of data centers, urging minimization of carbon dioxide emissions throughout the lifecycle of the infrastructure.
Optimizing Power Consumption:
Improving throughput while reducing power consumption involves understanding the power characteristics of jobs and making appropriate adjustments. System-level scheduling optimizations can lead to significant efficiency improvements. A cell-wide control plane can manage power consumption and optimize scheduling, spreading high-power jobs across multiple busbars to avoid overloading.
Reliability and Silent Data Corruption:
As machine learning workloads run synchronously across thousands of compute nodes, reliability becomes a critical concern. Silent data corruption poses a growing challenge that affects both CPUs and accelerated compute chips.
Benchmarking and Design Targets:
In designing computing systems for ML, power, reliability, and carbon dioxide equivalents are key benchmarks to consider.
Rise in Computing Demand:
There has been a staggering 10x annual growth in model parameters, leading to a corresponding increase in computing demands.
Radical Shift from General-Purpose Compute:
The demands of ML are driving a fundamental shift in computing systems, distinctly different from traditional general-purpose computers.
Accelerated Computing Progress:
Advancements in accelerated computing have surpassed those in general-purpose compute, enabling various breakthroughs in the field.
Computational Needs and Future Challenges:
The increasing complexity of models and data demands computational capabilities that extend beyond current accelerated computing solutions.
System Optimization Focus:
The focus is now on optimizing for system throughput, power, reliability, and carbon footprint, a shift from previous approaches.
Current Metrics Limitations:
Traditional metrics, such as chip performance alone, are insufficient for a comprehensive evaluation of computing systems.
Headline Numbers vs. System Performance:
Common metrics often fail to represent the actual performance of computations across complex systems.
MLPerf Benchmark Limitations:
The MLPerf benchmark, which focuses on performance at a given system size, tends to neglect other critical factors like system costs, emissions, efficiency, and power consumption.
System Performance over TCO:
Google assesses system designs based on performance over the total cost of ownership, including both capital and operating expenditures.
Perf TCO Assumptions and Limitations:
The assumptions of Perf TCO regarding data center capacity and power attribution are being reevaluated to better align with current needs.
Evolving Considerations:
Google is evolving its considerations to address the limitations of Perf TCO, reflecting a shift in evaluation criteria for computing systems.
TPU Architecture Evolution:
Over the past eight years, Google’s TPU architecture has evolved significantly, adapting to increasingly complex computational problems.
Perf-TCO: An Evolving Metric for Machine Learning Workloads:
Traditional performance metrics are proving insufficient for evaluating ML workloads, leading to the development of new metrics that account for reliability, power consumption, and carbon dioxide emissions.
Systems Perf per Average Watt:
There is an increased emphasis on system performance per average watt, encouraging better power utilization in computing systems.
Systems per Carbon Dioxide Emissions:
The focus is also on system performance relative to carbon dioxide emissions, highlighting the need for environmentally sustainable computing solutions.
Exploring Silent Data Corruption in Compute Elements:
Silent data corruption (SDC) is a growing challenge in large-scale computing, leading to incorrect results and potentially corrupting entire computations. Monitoring gradient norms can indicate SDC, but differentiating it from normal behavior is challenging. Rapid checkpointing and restart mechanisms are utilized to mitigate the impact of SDC.
Machine Learning in Hardware Design:
Machine learning is increasingly being used to accelerate and improve the design of specialized hardware. This includes architectural exploration, synthesis, verification, placement optimization, and customization of hardware design parameters and compilers for specific models. The development of ML-based tools and methodologies is revolutionizing hardware design.
Opportunities for ML in Chip Design and Manufacturing:
Machine learning capabilities are rapidly advancing, enabling fundamental changes in the computing community. ML models are becoming increasingly dynamic with evolving structures. The focus should be on system throughput rather than chip headline performance. Metrics such as power, CO2e efficiency, and SDCs are crucial to measure and improve.
Shorter Timelines for Chip Design and Deployment:
Shorter timelines are essential to adapt quickly to the changing ML landscape. ML automation can help streamline the design process.
Challenges in ML Automation for EDA:
Challenges in ML automation for electronic design automation (EDA) include data availability, physics-aware modeling, and the involvement of human experts.
Climate change is real and primarily caused by human activities, and solutions exist for carbon capture and storage. Transitioning to cleaner energy sources, improving fracking practices, and reducing methane emissions are crucial for combating climate change....
Climate change is real and urgent, requiring immediate action to reduce carbon emissions and transition to cleaner energy sources. Embracing sustainable energy practices and fostering collaboration in education and innovation are crucial for creating a more sustainable and prosperous world....
Dr. Steven Chu, a Nobel Prize-winning physicist and former U.S. Secretary of Energy, emphasized the urgent need to address climate change and presented innovative solutions for a sustainable energy future. He proposed capturing carbon dioxide from the air and converting it into fuels, creating a closed-loop energy cycle....
Vinod Khosla's pragmatic approach to climate change emphasizes scalable, cost-effective technologies and a shift from 'clean tech' to 'main tech' solutions with broader market applicability. He advocates for radical technological innovations, focusing on 'black swan' technologies and the replacement of fossil fuels with low-carbon alternatives to achieve substantial reductions in...
Rodney Brooks challenges the dominant view that cognition is inherently computational, suggesting it might be a societal construct rather than a universal property....
Machine learning integration in chip design automates complex tasks, reducing design cycles from years to days and enhancing computational capabilities. Custom chip designs can be efficiently created in a short time frame using reinforcement learning and other ML techniques....
Carver Mead's pioneering work in chip design, biologically-inspired computing, and event-driven computation continues to shape the evolution of modern computing and AI. His vision for pattern recognition processors and software-driven chip designs highlights the intersection of technology and biology, opening new pathways for innovation in the field....