本文将mark下计算机网络中ECMP(Equal-Cost Multi-Path)和packet spraying的相关notes。

ECMP

在路由协议中,如果下一跳有多个路径可以选择,并且多个路径的 cost metric 相等,那么路由器就会根据包的 header,计算一个 hash 值,然后根据这个 hash 值对这个 flow 选择一条固定的路径,作为下一跳。

ECMP是一个基于流的负载均衡策略,当路由器发现同一目的地址出现多个最优路径时,会更新路由表,为此目的地址添加多条规则,对应于多个下一跳。可同时利用这些路径转发数据,增加带宽。ECMP算法被多种路由协议支持,例如: OSPF、ISIS、EIGRP、BGP等。

对于未开启ECMP的网络来说,无法充分利用路径资源。例如下图所示:

假设从S0到Server的路径为S0-S1-S2-S4(即图中橘色路径),那么即便存在另一条等价路径(即图中蓝色路径),路由器仍然会每次选择第一条橘色路径转发数据。除非此条路径发生拥塞,才会重新选择路径。

当开启ECMP功能时,便可同时利用两条路径,进行基于流的负载均衡。例如主机A到Server的数据流选择橘色路径,主机B到Server的数据流选择蓝色路径。

ECMP的路径选择策略有多种方法:

  • 哈希: 例如根据源IP地址的哈希为流选择路径
  • 轮询: 各个流在多条路径之间轮询传输
  • 基于路径权重: 根据路径的权重分配流,权重大的路径分配的流数量更多

ECMP’s Dilemma in AI Workloads

Datacenters often utilize Clos networks as their underlying fabric, which provides multiple equal-cost paths between any source and destination. To distribute traffic across these paths,ECMP is the most widely used load balancing(LB) mechanism, which determines the path of a flow by hashing the 5-tuple in the packet header. ECMP works well for traditional workloads, where there are millions of flows that ensure a relatively even distribution of traffic. However, unlike traditional datacenter workloads, AI training workloads exhibit traffic patterns that are fundamentally mismatched with ECMP’s design:

  1. Small number of flows: In AI training job, each node establishes very few connections, as communication is only required with a limited set of peers.
  2. Large flow sizes: The flow sizes typically range from several MBs to hundreds of MBs.
  3. Bursty traffic: AI training is an inherently synchronized process, where most nodes enter the communication phase almost simultaneously. This synchronization results in a bursty traffic pattern, with large volumes of data being exchanged in a short period of time.

These flow characteristics (i.e., few in number, large in size) result in a high ECMP collision rate, as the small number of flows can’t be evenly distributed across available paths, leading to severe performance degradation in AI training workloads.

packet spraying

Packet spraying is a high-performance network load-balancing technique that distributes individual packets of a data flow across multiple available network paths rather than relying on a single path. By utilizing all paths, it achieves high link utilization and reduces tail latency, particularly beneficial for data center AI/ML workloads and bursty traffic.

Key details about packet spraying:

  • Benefits: It provides high throughput, mitigates congestion, and is effective for short-lived, latency-sensitive traffic.
  • Techniques:
    • Random Packet Spraying (RPS): Randomly assigns packets to paths to balance load.
    • Deterministic Packet Spraying: Uses a counter to distribute packets to ensure even path distribution.
    • Adaptive Packet Spraying (APS): Dynamically steers packets away from congested paths based on real-time feedback.
  • Challenges: The main drawback is potential out-of-order packet delivery, which can trigger unnecessary retransmissions.

总结

  • ECMP是flow粒度的Load Balance技术
  • packet spraying是packet粒度的Load Balance技术,粒度更细,但是会引入out-of-order packet delivery问题

参考资料:

  1. Unlocking ECMP Programmability for Precise Traffic Control(NSDI’25)
  2. 数据中心网络高可用技术:ECMP
  3. ECMP网络的介绍(转)
  4. Equal-Cost Multi-Path Routing (ECMP)
  5. Enabling Packet Spraying over Commodity RNICs with In-Network Support(APNET’25)
  6. On the Impact of Packet Spraying in Data Center Networks(Infocom’13)