MPLS in the DC

nabil.netapp · ‎03-12-2018

Greetings,

I am looking for suggestions on multitenancy options for a small DC, the design is VRF heavy (about 400-500). The hardware setup is simple, Agg/ToR using the Nexus 3232c. I am trying to keep the VLANs in the ToR and route between the Agg/ToR, the tricky part is the path isolation between the two layers. I would rather not use sub-interface for each VRF, I am thinking MPLS or BGP-LU but I am looking for suggestions for alternative approaches if there are any.

Cloud

|

Agg1---|----Agg2

| \ / |

| / \ |

ToR1---|----ToR2

|

Tenants

Thanks.

Adam Vitkovsky · ‎03-19-2018

Hi,

All your options in a nutshell:

Fabric: IP or MPLS

Overlay boundary*: ToR (switch) or Hypervisor (vTEP/vPE/…)

Overlay type: VXLAN/MPLS-native/MPLS over GRE or IP-in-IP tunnels)

*In order to talk to non-virtualized workloads you may need to have boundary to physical world in the DC –so if you haven’t virtualized 100% of your DC workloads you still need to have the overlay boundary at the ToR device.

How I see it there are two camps on the DC fabric

1) DC Fabric itself doesn’t really need to be complex. It just needs to be pure and simple IP fabric capable of ECMP (ECMP is inherently FRR) - at&t doesn’t even have QOS configured on these underlay switches. I’s merely a “fat pipe” (as fat as your Clos L2 stride –a.k.a number of spine switches) between each pair of end nodes.

And then all the complexity (multi-tenancy) is in the overlays –the overlay can start at the ToR SW or at the hypervisor.

2) Folks in camp 1) assume the DC fabric is non-blocking as that’s how it works in router-fabrics, well but routers achieve that by QOS and clever arbitration and “sort of per-packet” load-sharing.

However DC fabric doesn’t work like this, there’s no arbitration just QOS at the edge and there’s no per-packet load-sharing but per-flow load-sharing at most which may lead to congestion on some of the fabric paths (two elephant flows hashed onto the same path causing congestion for real-time mice flows) –this is where one needs traffic-engineering capabilities in the fabric itself to be able to distribute elephant flows around mice flows in a non-blocking fashion.

If DC fabric understands MPLS it then no longer is a passive pipe, but instead can actively participate in DC routing by peering with fabric edge (the vPE can now peer with the ToR switch)

The complexity level remains the same (same multi-tenancy) constrained to the edge but now we have better control over how traffic is transported across the fabric thus better use its resources.

In order to hide the complexity for DevOps and reduce the cost of the solution:

> Managing MPLS is hard. So, hide the complexity.

> The MPLS control plane is expensive. So, separate it from the hardware.

> “Edge MPLS” is more expensive than “core MPLS”. So, move it to the server.

> “Core MPLS” is not yet commodity. So, use a different tunneling technology.

SDN and Model-driven Service Abstraction Layer (MD-SAL) in particular will hide the underlying complexity from DevOps and allow for “point and click” provisioning and NFV will drive down the cost.

Especially the cost involved with scaling –holding many routes and keeping complex Control-Plane state is much cheaper in x86 systems based on COTS HW –that’s why may SPs are looking into NFV for their iBGP infrastructure (virtual Route-Reflectors) and number of other use cases.

> Up until now, these IP-based encapsulations have served their purpose well because we typically only needed two levels of hierarchy: a transport tunnel and one piece of meta-data.

> As an example, the transport tunnel could identify the server and the meta-data could identify the virtual machine.

> For the meta-data, we could use a GRE key or a few other IP-based tricks such as using multiple loopbacks as tunnel end-points.

> It has become clear over time that it is useful to make more than two levels or hierarchy, i.e. to have multiple levels of encapsulation. MPLS is perfectly suited for this because it supports MPLS label stacks.

However one has to be aware of limitations of these DC switches with regards to MPLS.

These switches are very cheap (much cheaper than PEs or P routers for that matter) that is because they are based around merchant silicon and this introduces some performance and scaling limitations.

> Today’s merchant silicon has some limitations that must be carefully taken into account when using deeper label stacks.

> For example, Broadcom Trident 2 silicon can only push 3 MPLS labels.

> Such limitations can be circumvented by doing some of the processing (e.g. deep MPLS label pushes) in the servers rather than the switches.

These performance and scalability limitations have to be taken into consideration when selecting whitebox switches as well.

With regards to the MPLS all the way down to Hypervisor (vPEs / vRouters) approach.

There are also some scaling limits to consider here. Each vPE/vRouter has to be addressed at least by its router-id IP address and if you have tens of thousands of vPEs in the DC then IGP use in there –or in the core if a common IGP is used across DC-Core-Agg may be outgrown by the share amount of nodes on the MPLS network.

However Tier-1 ISPs like at&t and then several years later some of the mobile operators pushing MPLS all the way down to RAN faced the same problem and rest assured there’s a time proven solution to address this particular scaling issue.

adam

netconsultings.com

::carrier-class solutions for the telecommunications industry::

adam