Reviewing NetApp Shelf Faults
Posted: March 11, 2014 | Author: VirtuallyMikeB | Filed under: NetApp, Storage | Tags: (SHELF_FAULT) WARNING, an autosupport message was generated, fault, ha group notification, NetApp, netapp shelf fault, shelf fault, shelf_fault, warning |1 CommentWith the arrival of Spring days away, I’m getting the fever to get moving and share some great content. I’m excited to be putting out some NetApp-related posts that I think people will find useful. I’ve installed several new NetApp systems for clients recently and these posts should help them start managing, monitoring, and configuring their systems correctly from the start. Or perhaps you’ve had NetApp in your environment for some time but have had questions about getting insight into your systems or some “best practices.” These posts should get you started and answer some of those questions.
A coworker recently received an alert from NetApp AutoSupport and he didn’t know what to do with it. Since the alert itself wasn’t detailed enough to take action, I thought I’d share this for those who also receive these less than ideal alerts.
The original message read,
It says there’s a shelf fault warning but gives no additional information. If we go to MyAutoSupport online we see the same message. These generic shelf fault messages mean that there’s a hardware fault like a fan, power supply, or temperature sensor. At this point, we have two options. We can either run some commands from the CLI to get more information or we can open a case with NetApp. I’ll tell you though, if you call NetApp, they’re going to have you run these comamnds anyways, in addition to physically visiting the NetApp to see if you’ve lost power or if it’s obviously too hot or too cold. Since we’re CLI ninjas, we’ll open an SSH session with NetApp-B and see what we can see.
We want to run the “environment” command to gather more info.
NetApp-B> environment Usage: environment status | [status] [shelf [<adapter>[.<shelf-number>]]] | [status] [shelf_log] | [status] [shelf_stats] | [status] [shelf_power_status] | [status] [chassis [all | list-sensors |]]
In particular, we want to run “environment status”.” I’ll highlight the lines you want to pay attention to. These lines identify what hardware is installed and if there are any errors. For instance, perhaps there are 4 power supplies installed and 1 of them has an error. In the output below, there are no errors found
NetApp-B> environment status Sensor Name State Current Critical Warning Warning Critical Reading Low Low High High ------------------------------------------------------------------------------------------------- In Flow Temp normal 40 C 0 C 10 C 49 C 55 C Out Flow Temp normal 46 C 0 C 10 C 62 C 67 C CPU Temp Margin normal -43 C -- -- -5 C 0 C CPU VCC normal 902 mV 708 mV 746 mV 1348 mV 1425 mV CPU VTT normal 1105 mV 931 mV 989 mV 1212 mV 1261 mV CPU 1.05V normal 1057 mV 892 mV 940 mV 1154 mV 1202 mV CPU 1.5V normal 1493 mV 1270 mV 1348 mV 1649 mV 1726 mV CPU 1.8V normal 1818 mV 1535 mV 1625 mV 1973 mV 2064 mV 10G 1.0V normal 989 mV 853 mV 902 mV 1096 mV 1154 mV USB 5.0V normal 5030 mV 4252 mV 4495 mV 5491 mV 5759 mV PCH_3.3V normal 3307 mV 2798 mV 2973 mV 3625 mV 3800 mV SASS 1.0V normal 999 mV 853 mV 902 mV 1096 mV 1154 mV SASS 1.2V normal 1183 mV 1018 mV 1076 mV 1319 mV 1377 mV STBY 1.8V normal 1804 mV 1532 mV 1619 mV 1978 mV 2066 mV STBY 2.5V normal 2480 mV 2121 mV 2246 mV 2745 mV 2870 mV STBY 5.0V normal 4957 mV 4252 mV 4495 mV 5491 mV 5759 mV Power Good OK AC Power Fail OK Bat 1.5V normal 1522 mV 1277 mV 1341 mV 1651 mV 1728 mV Bat 8.0V normal 8000 mV -- -- 8600 mV 8700 mV Bat Curr normal 0 mA -- -- 800 mA 900 mA Bat Run Time normal 152 hr 76 hr 78 hr -- -- Bat Temp normal 32 C 0 C 10 C 55 C 64 C Charger Curr normal 0 mA -- -- 2200 mA 2300 mA Charger Volt normal 8200 mV -- -- 8600 mV 8700 mV SP Status IPMI_HB_OK PSU1 PRESENT PSU1 5V normal 5150 mV -- -- -- -- PSU1 12V normal 12220 mV -- -- -- -- PSU1 5V Curr normal 5030 mA -- -- -- -- PSU1 12V Curr normal 9410 mA -- -- -- -- PSU1 Fan 1 normal 3900 RPM -- -- -- -- PSU1 Fan 2 normal 3670 RPM -- -- -- -- PSU1 Inlet Temp normal 32 C 5 C 10 C 50 C 55 C PSU1 Hotspot Temp normal 40 C 5 C 10 C 65 C 70 C PSU2 PRESENT PSU2 5V normal 5150 mV -- -- -- -- PSU2 12V normal 12300 mV -- -- -- -- PSU2 5V Curr normal 4800 mA -- -- -- -- PSU2 12V Curr normal 6640 mA -- -- -- -- PSU2 Fan 1 normal 3820 RPM -- -- -- -- PSU2 Fan 2 normal 3600 RPM -- -- -- -- PSU2 Inlet Temp normal 32 C 5 C 10 C 50 C 55 C PSU2 Hotspot Temp normal 40 C 5 C 10 C 65 C 70 C PSU3 PRESENT PSU3 5V normal 5110 mV -- -- -- -- PSU3 12V normal 12140 mV -- -- -- -- PSU3 5V Curr normal 4530 mA -- -- -- -- PSU3 12V Curr normal 6640 mA -- -- -- -- PSU3 Fan 1 normal 3900 RPM -- -- -- -- PSU3 Fan 2 normal 3600 RPM -- -- -- -- PSU3 Inlet Temp normal 32 C 5 C 10 C 50 C 55 C PSU3 Hotspot Temp normal 40 C 5 C 10 C 65 C 70 C PSU4 PRESENT PSU4 5V normal 5150 mV -- -- -- -- PSU4 12V normal 12220 mV -- -- -- -- PSU4 5V Curr normal 7180 mA -- -- -- -- PSU4 12V Curr normal 7340 mA -- -- -- -- PSU4 Fan 1 normal 3670 RPM -- -- -- -- PSU4 Fan 2 normal 3600 RPM -- -- -- -- PSU4 Inlet Temp normal 33 C 5 C 10 C 50 C 55 C PSU4 Hotspot Temp normal 41 C 5 C 10 C 65 C 70 C PSU_FAN OK Ambient Temp normal 23 C 0 C 5 C 40 C 42 C Backplane Temp normal 32 C 5 C 10 C 50 C 55 C Module A Temp normal 44 C 5 C 10 C 55 C 60 C Module B Temp normal 44 C 5 C 10 C 55 C 60 C Board Backup Temp NORMAL Usbmon Status OK Usbmon Pres PRESENT Environment for channel 0a Number of shelves monitored: 1 enabled: yes Environmental failure on shelves on this channel? no Channel: 0a Shelf: 0 SES device path: local access: 0a.00.99 Module type: IOM6E; monitoring is active Shelf status: normal condition SES Configuration, shelf 0: logical identifier=0x50050cc102049db8 vendor identification=NETAPP product identification=DS4246 product revision level=0131 Vendor-specific information: Product Serial Number: SHJMS0000004AAF Status reads attempted: 389820; failed: 0 Control writes attempted: 1140; failed: 0 Shelf bays with disk devices installed: 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 with error: none Power Supply installed element list: 1, 2, 3, 4; with error: none Power Supply information by element: [1] Serial number: PMW8256200CF4AF Part number: 0082562-20 Type: 9C Firmware version: 0311 Swaps: 0 [2] Serial number: PMW8256200E8A16 Part number: 0082562-20 Type: 9C Firmware version: 0311 Swaps: 0 [3] Serial number: PMW8256200CF4EB Part number: 0082562-20 Type: 9C Firmware version: 0311 Swaps: 0 [4] Serial number: PMW8256200AB63F Part number: 0082562-20 Type: 9C Firmware version: 0311 Swaps: 0 Voltage Sensor installed element list: 1, 2, 3, 4, 5, 6, 7, 8; with error: none Shelf voltages by element: [1] 5.15 Volts Normal voltage range [2] 12.18 Volts Normal voltage range [3] 5.15 Volts Normal voltage range [4] 12.34 Volts Normal voltage range [5] 5.11 Volts Normal voltage range [6] 12.22 Volts Normal voltage range [7] 5.15 Volts Normal voltage range [8] 12.26 Volts Normal voltage range Current Sensor installed element list: 1, 2, 3, 4, 5, 6, 7, 8; with error: none Shelf currents by element: [1] 5030 mA Normal current range [2] 9210 mA Normal current range [3] 4600 mA Normal current range [4] 6560 mA Normal current range [5] 4410 mA Normal current range [6] 6640 mA Normal current range [7] 7180 mA Normal current range [8] 7180 mA Normal current range Cooling Unit installed element list: 1, 2, 3, 4, 5, 6, 7, 8; with error: none Cooling Units by element: [1] 3900 RPM [2] 3600 RPM [3] 3820 RPM [4] 3600 RPM [5] 3970 RPM [6] 3600 RPM [7] 3670 RPM [8] 3670 RPM Temperature Sensor installed element list: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12; with error: none Shelf temperatures by element: [1] 23 C (73 F) (ambient) Normal temperature range [2] 32 C (89 F) Normal temperature range [3] 32 C (89 F) Normal temperature range [4] 40 C (104 F) Normal temperature range [5] 32 C (89 F) Normal temperature range [6] 40 C (104 F) Normal temperature range [7] 32 C (89 F) Normal temperature range [8] 40 C (104 F) Normal temperature range [9] 33 C (91 F) Normal temperature range [10] 41 C (105 F) Normal temperature range [11] 44 C (111 F) Normal temperature range [12] 44 C (111 F) Normal temperature range Temperature thresholds by element: [1] High critical: 42 C (107 F); high warning: 40 C (104 F) Low critical: 0 C (32 F); low warning: 5 C (41 F) [2] High critical: 55 C (131 F); high warning: 50 C (122 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [3] High critical: 55 C (131 F); high warning: 50 C (122 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [4] High critical: 70 C (158 F); high warning: 65 C (149 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [5] High critical: 55 C (131 F); high warning: 50 C (122 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [6] High critical: 70 C (158 F); high warning: 65 C (149 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [7] High critical: 55 C (131 F); high warning: 50 C (122 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [8] High critical: 70 C (158 F); high warning: 65 C (149 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [9] High critical: 55 C (131 F); high warning: 50 C (122 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [10] High critical: 70 C (158 F); high warning: 65 C (149 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [11] High critical: 60 C (140 F); high warning: 55 C (131 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) [12] High critical: 60 C (140 F); high warning: 55 C (131 F) Low critical: 5 C (41 F); low warning: 10 C (50 F) ES Electronics installed element list: 1, 2; with error: none ES Electronics reporting element: 2 ES Electronics information by element: [1] Serial number: 9404744698 Part number: 111-00846+D0 CPLD version: 14 Swaps: 0 [2] Serial number: 9404754978 Part number: 111-00846+D0 CPLD version: 14 Swaps: 0 Enclosure element list: 1; with error: none Enclosure information: [1] WWN: 50050cc102049db8 Shelf ID: 00 Serial number: SHJMS0000004AAF Part number: 111-01137+B0 Midplane serial number: BPS0961627G41DC Midplane part number: 096 SAS connector attached element list: 1, 3; with error: none SAS cable information by element: [1] Internal connector [2] Vendor: <N/A> (disconnected) Type: <N/A> <N/A> <N/A> ID: <N/A> Swaps: 0 Serial number: <N/A> Part number: <N/A> [3] Internal connector [4] Vendor: <N/A> (disconnected) Type: <N/A> <N/A> <N/A> ID: <N/A> Swaps: 0 Serial number: <N/A> Part number: <N/A> ACP installed element list: 1, 2; with error: none ACP information by element: [1] MAC address: 00:A0:98:31:CE:5A [2] MAC address: 00:A0:98:31:C6:F2 Processor Complex attached element list: 1, 2; with error: none SAS Expander Module installed element list: 1, 2; with error: none SAS Expander master module: 1 Shelf mapping (shelf-assigned addresses) for channel 0a: Shelf 0: 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
The output above is from a FAS2240-4 with no external disk shelves and so is relatively short. If you have many shelves (and many paths) and you know which one has the error, you can run the environment status command with a focus on a particular shelf and path. This can make the output to an SSH session a bit leaner and easier to read. You may want to adjust Putty’s buffer so you can scroll back up or dump the output to a text file and search from there.
environment status shelf 0a.00 Environment for channel 0a Number of shelves monitored: 1 enabled: yes Environmental failure on shelves on this channel? no
When you identify the source of the fault, you can remediate by either modifying your A/C (I’ve had to deflect cold air before because it got too cold), or replacing power supplies, fans, or simply plugging the power cable back in.
I was told on these types of models, if the SP is on the same Vlan as the data, it causes the system to generate these messages even if there is no real fault detected. I have this system running Cdot 9, which we had upgraded from 7-mode. When we were on 7-mode w never had a shelf fault, moving to Cdot seemed to cause this more often.