Replacing a Cisco Nexus 2224 Fabric ExtenderPosted: January 14, 2012
So my team and I got a call to swing by a customer’s site on our way to another job. They told us half the ports went bad on a FEX and we were to install the replacement that just arrived onsite. In this post, I’ll explain how to replace the FEX (which is trivial) and more importantly how to verify that it’s working after installation.
The configuration consisted of two Nexus 5020s and two 2224 Fabric Extenders. The FEXs are only connected to their parent 5020. There was no cross-connection between the 5020s and the FEXs. The servers running ESXi 4.1 and NetApp storage were using the FEXs for management and vMotion traffic. NIC teaming was not used at the servers. The management and vMotion VMkernel ports were configured to route based on virtual port ID. The NetApp controllers were configured to use LACP port-channeling as well as a virtual port channel (vPC). So basically, there was full redundancy at the Nexus’s because everything downstream from them was connected to each pair of Nexus’s. There was no fear of losing connectivity during the replacement, although redundancy would be lost for several minutes.
As I mentioned, the physical replacement of the FEX is trivial. All the cables were already labeled according their destination port on the FEX. We removed the power cables from the FEX. At this point, anyone connected to the vSphere Client would see alarms pop for network redundancy on the management and vMotion vSwitches. We disconnected all the Ethernet cables and the 10GbE Twinax uplinks to the 5020, un-racked the FEX, moved the rails to the replacement FEX, and re-racked it. Plugging the power cables back in, we could see the power supply LEDs turn green and the status LED turn amber.
Verify Image Download
We stayed logged into the parent 5020 during the whole process. When the power plugs were initially pulled, running a show interface status command showed that the FEX ports had, of course, disappeared. When we first plugged in the replacement FEX and the status LED was amber, the same command showed the same results.
I hadn’t replaced a production FEX before. My previous datacenter implementations called for an initial configuration and setup of the both the 5020s and FEXs. I knew that the FEX downloaded its image from its parent 5020, but I wasn’t sure exactly how a replacement would work. Was there any configuration to it at all? Surely it wasn’t as easy as a simple rip-and-replace. So I hit up the Cisco Support Forums and asked. Here’s the thread I started: https://supportforums.cisco.com/thread/2124831.
As we were logged into the parent 5020 and watching the status LED blink green, I obsessively kept running the show fex command. This was very useful, although at first I was kind of worried. I only saw this!
Nexus-5020# show fex
It didn’t show anything! But, I kept running show fex over and over, and after a minute or two, it started to show some output. It said it was downloading the image. Excellent!
Nexus-5020# show fex
FEX FEX FEX FEX
Number Description State Model Serial
170 FEX-170 Image Download N2K-2224TPGE xxxxxxxxxxx
Several minutes later, the same command showed the FEX State as offline, then when the Status LED turned solid green, it showed online. A show interface status command showed the previous configuration of the FEX ports. The FEX uplinks to the 5020 and the host ports all showed connected.
To test failover and properly verify that the replacement FEX is functional, simply perform a continuous ping to several different hosts that connect to the FEX and pull a host connection to the FEX that was not replaced. This will force traffic to the FEX that was just replaced if it wasn’t using it already. You can reconnect host links to both FEXs and pull the other link to verify full connectivity. You should only lose one ping maximum during the failover, if you lose any at all. In our case, since our vMotion network goes through the FEXs, we could also vMotion a test VM and pull a single uplink at a time while watching a continuous ping to the VM. Again, a maximum of one ping should be lost.
So that’s all there was. A simple hardware replacement, running some commands to verify the system re-configured itself, then running some simple ping tests. It couldn’t be easier.