My OTV TakePosted: April 1, 2013
After my recent DFW VMUG presentation where I spoke on the topic, a friend emailed me and asked what I thought about OTV.
“You mentioned that you were against OTV. Curious on your take on this, as we are using it across two datacenters using N7K, UCS, NetApp and VMware.”
I’d like to share my response to him here.
Please don’t get me wrong. If one is forced to implement a Layer 2 Data Center Interconnect (DCI), OTV is probably the best solution. Sometimes, L2 connectivity between data centers is a functional requirement – perhaps even a constraint. In these cases, one should look at the benefits and risks of implementing an L2 DCI and then make an informed decision on whether they should continue with such a deployment. Should they choose to deploy OTV, someone needs to accept the risks associated with OTV in its current implementation.
For instance, one problem that still exists with OTV in its current form is a traffic trombone should a VM move from one DC to the other, say from Data Center A to Data Center B. OTV solves only half the traffic trombone problem through the use of FHRP Isolation. Basically, only outbound traffic is path optimized. This is to say that the VM will see its default gateway exist in Data Center B and *not* Data Center A after the vMotion. External clients with active connections to the VM as it is vMotioned (which is the whole point of a live migration – to keep active connections) from Data Center A to Data Center B will still send traffic to Data Center A. This causes external client traffic to traverse the WAN in order to reach the VM in Data Center B.
The other problem with OTV that if there is a broadcast storm in one data center, say, Data Center A, it can spill over into Data Center B provided that the VLAN in which the broadcast storm occurs is stretched across the WAN using the overlay. Don’t mistake this for the STP Isolation that OTV offers. STP Isolation says that STP BPDUs will not traverse the overlay. When an STP BPDU reaches the OTV interface, it will be blocked from traversing the overlay. STP Isolation does *not* block these broadcasts. One must implement storm control or similar technologies to protect against these broadcast storms.
Some easy solutions to the traffic trombone problem would be to cold migrate the VM, reboot it once it’s in the far DC, or somehow kill all the current external connections once the VM is in the far DC. That way, new connections will enter Data Center B in the first place, assuming your load balancers/DNS servers are aware of the move. This problem can be solved somewhat easily without much downtime at the expense of losing active connections during the move from DC A to DC B.
The one big argument that has existed for decades and will likely continue is the fact that Layer 2 switching does not scale well. And when you force one layer 2 domain (read VLAN) to exist in two data centers at the same time, you’ve now created a single fault domain out of which there used to be two. Now, many folks will tell you that they’ve been using OTV for over a year and half with no problems. Those same people will say that they’ve turned off Spanning Tree in their LAN because they’ve been burned by it in the past and they haven’t had any problems so far. These are accidents waiting to happen. Someone always plugs a cable into the wrong port, fat-fingers something at the CLI, a NIC starts flapping, or one could just wait for the next new software bug, but these problems always happen. When they do, they’ll not only take out a VLAN in Data Center A, but they’ll also take out that same VLAN in Data Center B.
Now, this is not to say, again, that sometimes, one seems to be forced to implement a L2 DCI, In these cases, just be sure to document your objections and be a good soldier. Make sure the decision makers understand the risks and accept them. Then implement and move on.
To say I’m against OTV just isn’t accurate. I agree with smarter dudes than I that one should be sure that a Layer 3 solution wouldn’t work better before settling for a L2 DCI. There are also some up and coming technologies that show promise in solving the traffic trombone, such as LISP. As a note, as of today, beware that VXLAN is *not* a L2 DCI.
The January 2012 OTV primer from Cisco (http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DCI/whitepaper/DCI3_OTV_Intro_WP.pdf) states their broadcast control policy as such (speaking to my broadcast storm statement above):
“Broadcast Policy Control
…OTV will provide additional functionality such as broadcast suppression, broadcast white-listing, and so on, to reduce the amount of overall Layer 2 broadcast traffic sent across the overlay. Details will be provided upon future functional availability. (emphasis mine)”
The way I read this, such broadcast control is not implemented yet. Now, January 2012 is getting pretty dated. I’m not familiar with newer documentation that states otherwise.