Andy Bechtolsheim (Arista) (Jan 2020)

Andy Bechtolsheim (Arista Co-Founder) – Keynote (Jan 2020)

Chapters

00:00:06 Cloud Data Center Networks: Scaling to Meet the Demands of Modern Applications

Data Center Networking in the Cloud:
The world’s data has been moving to the cloud, leading to a rise in cloud data centers. Billions of tablets and mobile devices are being supported by millions of servers in the cloud. Google’s data centers are massive, and employees use bicycles to travel between them.

Business Shift to the Cloud:
A significant portion of enterprise workloads are expected to move to the cloud by 2025. The cloud business is growing at about 20% annually, driving hardware investment.

Hardware Cost Reduction:
The cost of hardware, particularly flash and memory, has decreased, allowing for more equipment per investment. Over half of new servers are being deployed in cloud environments, leading to a decline in on-premise deployments.

Growth of Intra Data Center Traffic:
Data center networking is now primarily focused on cloud data centers. Intra data center traffic has grown rapidly, with flash, distributed storage, and applications like AI and video streaming driving this growth.

Challenges in Cloud Networking:
The challenge in cloud networking is interconnecting hundreds of thousands of servers and millions of virtual machines. Achieving high-speed connectivity, up to multiple petabits of aggregate bandwidth, with low latency is crucial.

Leaf-Spine Networks:
Before large network switches were developed, clusters of servers called pods were used. The current universally adopted network architecture is leaf-spine networks, ensuring equal distance between any two servers. Data centers have multiple tiers of leaf-spine networks and are connected to external gateways.

Speed Upgrades:
Data center networks have been undergoing continuous speed upgrades. In the past decade, we have seen the transition from 10 gigabit to 40 gigabit, 100 gigabit, and now 400 gigabit.

Port Transition:
Port transitions in data center networking follow a similar trend to graphics. 100 gigabit quickly gained dominance due to cost parity with 40 gigabit. 400 gigabit is emerging, but its higher cost per port may slow its adoption.

Growth of 400 Gigabit Ports:
The projected volume for 400 gigabit ports in the next four years is 20 million, driven by their four times higher speed than 100 gigabit ports. 400 gigabit is expected to surpass 100 gigabit in terms of shipment bandwidth by the end of 2021.

00:05:08 Ethernet Speed Transitions in Cloud Computing

00:08:14 Market Transformation from 25G to 100G Ethernet

00:11:24 The Future of Optics Standards in Data Networking

00:13:25 Network Interconnections for Scalable Distributed Storage

PCIe Generation Evolution:
PCIe 3.0 has been the standard for nearly a decade, with 8 gigabits per lane. PCIe 4.0 doubles the speed to 16 gigabits per lane, enabling faster I.O., network interfaces, and support for dedicated GPUs. PCIe 5.0 further doubles the speed to 32 gigabits per lane, providing even more I.O. bandwidth for conventional servers and AI GPU servers.

Network Interconnect Efficiency:
Top-of-rack switches have evolved significantly, with the latest Tomahawk 3 chip offering four times the bandwidth of its predecessor. This allows for interconnecting a large number of servers at very high speeds and low costs using copper cables.

Disaggregating Data Placement for Scalability:
Disaggregating data placement from the application is crucial in the cloud to ensure scalability and flexibility. The spine tier, or top tier of data center networks, provides the total throughput and can reach multiple petabits in large data centers. Servers can connect to any other server in a few microseconds through these networks, enabling efficient data access.

Storage Protocols and Performance:
RDMA over Converged Ethernet (RoCE) has been widely adopted for disaggregated storage, with two versions: priority flow control (PFC) and early congestion notification (ECN). PFC ensures no packet drops for optimal performance but doesn’t scale well in larger networks. ECN is a newer version that avoids the need for PFC and is still under evaluation for scalability.

NVMe over TCPIP:
NVMe over TCPIP is a newer standard that offers advantages in reliability and ease of deployment. It leverages the reliable transport layer of TCP to guarantee packet order and retransmission. Implementations are underway, with hardware optimizations being explored for performance improvements.

Performance Considerations:
Storage servers can saturate 100 gig links, and conventional servers can saturate at least 50 gig interfaces. Network utilization should be kept below 50% to avoid latency and congestion issues. Simulations using NS3 have shown that increasing network interface speeds improves overall performance.

00:22:57 Desegregated Storage Architectures and Networking Challenges

Disaggregated Storage:
Disaggregated storage using Rocky style or NVMe Express over TCP is now functional in large cloud networks. It requires careful tuning to achieve optimal performance. Easy-to-use disaggregated storage is expected to become widely available in the near future.

Two-Layer Networks:
Using a two-layer network with core and edge is problematic. Leaf spine architectures with larger switches can support a significant number of servers. Additional layers introduce more problems, and down speed conversion can cause network issues. Overprovisioning the network is recommended to minimize issues.

Performance Considerations:
The network speed is limited by TCP throughput and the capabilities of smart NICs. RDMA can offer higher speeds than TCP. Multiple NICs are often used to boost performance. Dual 25 gig NICs are common, providing 25-50 gig speeds. Servers typically have ample M2 flash memory sticks for high IO capacity. The network stack must be optimized to provide the required performance.

CXL and Gen Z Impact:
CXL and Gen Z interfaces may impact networking architecture. Applications that benefit from large CXL memory environments may emerge. Persistent memory and large memory configurations may enable new possibilities. However, this is unlikely to change the current cloud computing landscape.

Speed and Performance Improvements:
Moving from 100 gig to 200 gig NICs can improve performance. Quantifying the improvement depends on various factors and workload profiles. NUMA-aware networking within servers is influenced by changing CPU trends. Future chips may have multiple sockets in a single device for better interconnectivity.

Economic Trade-offs:
Network bandwidth comes at a cost and must be balanced with other components. Cloud providers aim to find the optimal spending ratio between storage, servers, and network. The goal is to prevent network bottlenecks or overspending that doesn’t benefit server performance. Currently, the sweet spot is around 25-50 gig per server, with 3:1 oversubscription at the ToR. As CPU cores and flash speeds increase, the network speed is adjusted accordingly.

Abstract

The Evolution of Cloud Data Centers: Disaggregated Storage and High-Speed Connectivity

—

Introduction

The landscape of cloud data centers is undergoing a transformative evolution, driven by the exponential growth in cloud computing and the surging demand for high-speed connectivity. Billions of devices now rely on cloud-hosted services, necessitating massive data center expansions. This article delves into the pivotal advancements shaping this domain, including the rise of disaggregated storage, rapid Ethernet speed transitions, and the challenges and solutions in network and storage technologies. By adopting an inverted pyramid writing style, we prioritize the most significant developments and their implications for the future of cloud data centers.

The Rise of Disaggregated Storage in Cloud Data Centers

Key Developments and Challenges

The intra-data center traffic surge, driven by flash and distributed storage systems, has spurred unprecedented growth in server-to-server traffic. To support intense data demands, cloud data centers are transitioning to speeds of up to 400 gigabits per second. These data centers utilize leaf-spine network architectures, ensuring uniform server-to-server distances, critical for maintaining consistent performance. Furthermore, data centers are interlinked within metropolitan areas to ensure redundancy and scalability, forming scalable and redundant clusters.

Ethernet speed transitions have also been rapid, transitioning from 40 gigabit to 100 gigabit ports, driven by cloud adoption and the need for cost-effective technologies. Advancements in silicon technology, adhering to Moore’s Law, continue to influence network silicon development, although it lags behind CPUs and GPUs. The dramatic decrease in both the cost and size of network switches has enhanced data center efficiency.

Implications for Storage and Connectivity

Disaggregated storage, separating data storage from applications, enhances reliability and scalability. Protocols like RDMA over Converged Ethernet (RoCE) and NVM Express over TCP/IP (NVMe-oTCP) play crucial roles in this regard. Current technologies enable storage servers to saturate 100 gigabit links, while maintaining network utilization below 50% is vital to avoid latency issues. The adoption of 100 gigabit and beyond, including 200 gigabit NICs, is predominantly driven by economic factors, balancing costs with performance enhancements.

Silicon Photonics Optical Interface Evolution:

The optical interface landscape is undergoing a significant transition. Majority of current optics standards are based on 25 gig NRZ technology, limiting SFP, QSFP, and OSFP modules to 25 gigabit speeds. The industry is moving towards 100 gig single laser optics (100 gig DR), offering cost advantages over 50 gig optics. This will lead to the transition of SFP, QSFP, and OSFP modules from 25 gig to 100 gig speeds.

Additionally, 400 gig optics (DR4, FR4, LR4) and 800 gig optics (SR8, DR8, dual FR4, FR8) will be based on 100 gig lambda technology. However, moving beyond 800 gig poses challenges due to the alignment of the minimum packet size (666 picoseconds) with the internal clock rate of chips. Achieving 1,600 gigabit Ethernet would require two parallel pipes, resulting in scalability issues at the silicon level.

The Future of Cloud Data Center Networking

Emerging Technologies and Standards

Emerging technologies like PCIe 4.0 and 5.0 significantly enhance I/O speeds for servers, crucial for AI and conventional data processing. Innovations like the Tomahawk 3 chip have revolutionized bandwidth capacities, allowing for more efficient data center configurations. The adoption of 100 gigabit and 400 gigabit optics is crucial for matching the increasing speed demands in data centers.

PCIe Generation Evolution:

PCIe 3.0 has been the standard for nearly a decade, offering 8 gigabits per lane. PCIe 4.0 doubles the speed to 16 gigabits per lane, enabling faster I.O., network interfaces, and support for dedicated GPUs. PCIe 5.0 further doubles the speed to 32 gigabits per lane, providing even more I.O. bandwidth for conventional servers and AI GPU servers.

Network Interconnect Efficiency:

Top-of-rack switches have evolved significantly, with the latest Tomahawk 3 chip offering four times the bandwidth of its predecessor. This enables interconnecting a large number of servers at very high speeds and low costs using copper cables.

Disaggregating Data Placement for Scalability:

Disaggregating data placement from the application is crucial in the cloud to ensure scalability and flexibility. The spine tier, or top tier of data center networks, provides the total throughput and can reach multiple petabits in large data centers. Servers can connect to any other server in a few microseconds through these networks, enabling efficient data access.

Storage Protocols and Performance:

RDMA over Converged Ethernet (RoCE) has been widely adopted for disaggregated storage, with two versions: priority flow control (PFC) and early congestion notification (ECN). PFC ensures no packet drops for optimal performance but doesn’t scale well in larger networks. ECN is a newer version that avoids the need for PFC and is still under evaluation for scalability.

NVMe over TCPIP is a newer standard that offers advantages in reliability and ease of deployment. It leverages the reliable transport layer of TCP to guarantee packet order and retransmission. Implementations are underway, with hardware optimizations being explored for performance improvements.

Performance Considerations:

Storage servers can saturate 100 gig links, and conventional servers can saturate at least 50 gig interfaces. Network utilization should be kept below 50% to avoid latency and congestion issues. Simulations using NS3 have shown that increasing network interface speeds improves overall performance.

Challenges and Considerations

With the trend towards fewer CPU sockets, the relevance of NUMA-aware networking is diminishing, but it still plays a role in maximizing memory bandwidth. Emerging interfaces like CXL and Gen Z could impact networking architectures, particularly for applications benefiting from large shared memory environments. Studies conducted by Ariel Handel and Pallavi Shripali at Facebook highlight the importance of simulations in optimizing network performance.

Future of Server Networking:

Disaggregated Storage:

Disaggregated storage using Rocky style or NVMe Express over TCP is now functional in large cloud networks. It requires careful tuning to achieve optimal performance. Easy-to-use disaggregated storage is expected to become widely available in the near future.

Two-Layer Networks:

Using a two-layer network with core and edge is problematic. Leaf spine architectures with larger switches can support a significant number of servers. Additional layers introduce more problems, and down speed conversion can cause network issues. Overprovisioning the network is recommended to minimize issues.

Performance Considerations:

The network speed is limited by TCP throughput and the capabilities of smart NICs. RDMA can offer higher speeds than TCP. Multiple NICs are often used to boost performance. Dual 25 gig NICs are common, providing 25-50 gig speeds. Servers typically have ample M2 flash memory sticks for high IO capacity. The network stack must be optimized to provide the required performance.

CXL and Gen Z Impact:

CXL and Gen Z interfaces may impact networking architecture. Applications that benefit from large CXL memory environments may emerge. Persistent memory and large memory configurations may enable new possibilities. However, this is unlikely to change the current cloud computing landscape.

Speed and Performance Improvements:

Moving from 100 gig to 200 gig NICs can improve performance. Quantifying the improvement depends on various factors and workload profiles. NUMA-aware networking within servers is influenced by changing CPU trends. Future chips may have multiple sockets in a single device for better interconnectivity.

Economic Trade-offs:

Network bandwidth comes at a cost and must be balanced with other components. Cloud providers aim to find the optimal spending ratio between storage, servers, and network. The goal is to prevent network bottlenecks or overspending that doesn’t benefit server performance. Currently, the sweet spot is around 25-50 gig per server, with 3:1 oversubscription at the ToR. As CPU cores and flash speeds increase, the network speed is adjusted accordingly.

Conclusion

The evolution of cloud data centers is marked by a relentless pursuit of higher speed, efficiency, and scalability. Disaggregated storage, rapid advancements in network technology, and the adoption of high-speed connectivity standards are reshaping how data centers operate and evolve. As the industry continues to innovate, challenges related to cost, scalability, and technological integration remain at the forefront. However, with ongoing research and development, the future of cloud data centers looks poised to meet the ever-growing demands of the digital world.

Notes by: WisdomWave

Andy Bechtolsheim (Arista Co-Founder) – Keynote (Jan 2020)

Chapters

Abstract

Related posts: