Reviewing NetApp Shelf Faults


imageWith the arrival of Spring days away, I’m getting the fever to get moving and share some great content.  I’m excited to be putting out some NetApp-related posts that I think people will find useful.  I’ve installed several new NetApp systems for clients recently and these posts should help them start managing, monitoring, and configuring their systems correctly from the start.  Or perhaps you’ve had NetApp in your environment for some time but have had questions about getting insight into your systems or some “best practices.”  These posts should get you started and answer some of those questions.


A coworker recently received an alert from NetApp AutoSupport and he didn’t know what to do with it.  Since the alert itself wasn’t detailed enough to take action, I thought I’d share this for those who also receive these less than ideal alerts.

The original message read,

image

It says there’s a shelf fault warning but gives no additional information.  If we go to MyAutoSupport online we see the same message.  These generic shelf fault messages mean that there’s a hardware fault like a fan, power supply, or temperature sensor.  At this point, we have two options.  We can either run some commands from the CLI to get more information or we can open a case with NetApp.  I’ll tell you though, if you call NetApp, they’re going to have you run these comamnds anyways, in addition to physically visiting the NetApp to see if you’ve lost power or if it’s obviously too hot or too cold.  Since we’re CLI ninjas, we’ll open an SSH session with NetApp-B and see what we can see.

We want to run the “environment” command to gather more info.

NetApp-B> environment
Usage: environment status |
                   [status] [shelf [<adapter>[.<shelf-number>]]] |
                   [status] [shelf_log] |
                   [status] [shelf_stats] |
                   [status] [shelf_power_status] |
                   [status] [chassis [all | list-sensors |]]

In particular, we want to run “environment status”.”  I’ll highlight the lines you want to pay attention to.  These lines identify what hardware is installed and if there are any errors.  For instance, perhaps there are 4 power supplies installed and 1 of them has an error. In the output below, there are no errors found

NetApp-B> environment status
Sensor Name              State          Current    Critical     Warning     Warning    Critical
                                        Reading       Low         Low         High       High
-------------------------------------------------------------------------------------------------
In Flow Temp             normal            40 C         0 C        10 C        49 C        55 C
Out Flow Temp            normal            46 C         0 C        10 C        62 C        67 C
CPU Temp Margin          normal           -43 C        --          --          -5 C         0 C
CPU VCC                  normal           902 mV      708 mV      746 mV     1348 mV     1425 mV
CPU VTT                  normal          1105 mV      931 mV      989 mV     1212 mV     1261 mV
CPU 1.05V                normal          1057 mV      892 mV      940 mV     1154 mV     1202 mV
CPU 1.5V                 normal          1493 mV     1270 mV     1348 mV     1649 mV     1726 mV
CPU 1.8V                 normal          1818 mV     1535 mV     1625 mV     1973 mV     2064 mV
10G 1.0V                 normal           989 mV      853 mV      902 mV     1096 mV     1154 mV
USB 5.0V                 normal          5030 mV     4252 mV     4495 mV     5491 mV     5759 mV
PCH_3.3V                 normal          3307 mV     2798 mV     2973 mV     3625 mV     3800 mV
SASS 1.0V                normal           999 mV      853 mV      902 mV     1096 mV     1154 mV
SASS 1.2V                normal          1183 mV     1018 mV     1076 mV     1319 mV     1377 mV
STBY 1.8V                normal          1804 mV     1532 mV     1619 mV     1978 mV     2066 mV
STBY 2.5V                normal          2480 mV     2121 mV     2246 mV     2745 mV     2870 mV
STBY 5.0V                normal          4957 mV     4252 mV     4495 mV     5491 mV     5759 mV
Power Good                                  OK
AC Power Fail                               OK
Bat 1.5V                 normal          1522 mV     1277 mV     1341 mV     1651 mV     1728 mV
Bat 8.0V                 normal          8000 mV       --          --        8600 mV     8700 mV
Bat Curr                 normal             0 mA       --          --         800 mA      900 mA
Bat Run Time             normal           152 hr       76 hr       78 hr       --          --
Bat Temp                 normal            32 C         0 C        10 C        55 C        64 C
Charger Curr             normal             0 mA       --          --        2200 mA     2300 mA
Charger Volt             normal          8200 mV       --          --        8600 mV     8700 mV
SP Status                               IPMI_HB_OK
PSU1                                    PRESENT
PSU1 5V                  normal          5150 mV       --          --          --          --
PSU1 12V                 normal         12220 mV       --          --          --          --
PSU1 5V Curr             normal          5030 mA       --          --          --          --
PSU1 12V Curr            normal          9410 mA       --          --          --          --
PSU1 Fan 1               normal          3900 RPM      --          --          --          --
PSU1 Fan 2               normal          3670 RPM      --          --          --          --
PSU1 Inlet Temp          normal            32 C         5 C        10 C        50 C        55 C
PSU1 Hotspot Temp        normal            40 C         5 C        10 C        65 C        70 C
PSU2                                    PRESENT
PSU2 5V                  normal          5150 mV       --          --          --          --
PSU2 12V                 normal         12300 mV       --          --          --          --
PSU2 5V Curr             normal          4800 mA       --          --          --          --
PSU2 12V Curr            normal          6640 mA       --          --          --          --
PSU2 Fan 1               normal          3820 RPM      --          --          --          --
PSU2 Fan 2               normal          3600 RPM      --          --          --          --
PSU2 Inlet Temp          normal            32 C         5 C        10 C        50 C        55 C
PSU2 Hotspot Temp        normal            40 C         5 C        10 C        65 C        70 C
PSU3                                    PRESENT
PSU3 5V                  normal          5110 mV       --          --          --          --
PSU3 12V                 normal         12140 mV       --          --          --          --
PSU3 5V Curr             normal          4530 mA       --          --          --          --
PSU3 12V Curr            normal          6640 mA       --          --          --          --
PSU3 Fan 1               normal          3900 RPM      --          --          --          --
PSU3 Fan 2               normal          3600 RPM      --          --          --          --
PSU3 Inlet Temp          normal            32 C         5 C        10 C        50 C        55 C
PSU3 Hotspot Temp        normal            40 C         5 C        10 C        65 C        70 C
PSU4                                    PRESENT
PSU4 5V                  normal          5150 mV       --          --          --          --
PSU4 12V                 normal         12220 mV       --          --          --          --
PSU4 5V Curr             normal          7180 mA       --          --          --          --
PSU4 12V Curr            normal          7340 mA       --          --          --          --
PSU4 Fan 1               normal          3670 RPM      --          --          --          --
PSU4 Fan 2               normal          3600 RPM      --          --          --          --
PSU4 Inlet Temp          normal            33 C         5 C        10 C        50 C        55 C
PSU4 Hotspot Temp        normal            41 C         5 C        10 C        65 C        70 C
PSU_FAN                                     OK
Ambient Temp             normal            23 C         0 C         5 C        40 C        42 C
Backplane Temp           normal            32 C         5 C        10 C        50 C        55 C
Module A Temp            normal            44 C         5 C        10 C        55 C        60 C
Module B Temp            normal            44 C         5 C        10 C        55 C        60 C
Board Backup Temp                       NORMAL
Usbmon Status                               OK
Usbmon Pres                             PRESENT
        Environment for channel 0a
        Number of shelves monitored: 1  enabled: yes
        Environmental failure on shelves on this channel? no

        Channel: 0a
        Shelf: 0
        SES device path: local access: 0a.00.99
        Module type: IOM6E; monitoring is active
        Shelf status: normal condition
        SES Configuration, shelf 0:
         logical identifier=0x50050cc102049db8
         vendor identification=NETAPP
         product identification=DS4246
         product revision level=0131
        Vendor-specific information:
         Product Serial Number: SHJMS0000004AAF
        Status reads attempted: 389820; failed: 0
        Control writes attempted: 1140; failed: 0
        Shelf bays with disk devices installed:
          23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
          with error: none
        Power Supply installed element list: 1, 2, 3, 4; with error: none
        Power Supply information by element:
          [1] Serial number: PMW8256200CF4AF  Part number: 0082562-20
              Type: 9C
              Firmware version: 0311  Swaps: 0
          [2] Serial number: PMW8256200E8A16  Part number: 0082562-20
              Type: 9C
              Firmware version: 0311  Swaps: 0
          [3] Serial number: PMW8256200CF4EB  Part number: 0082562-20
              Type: 9C
              Firmware version: 0311  Swaps: 0
          [4] Serial number: PMW8256200AB63F  Part number: 0082562-20
              Type: 9C
              Firmware version: 0311  Swaps: 0
        Voltage Sensor installed element list: 1, 2, 3, 4, 5, 6, 7, 8; with error: none
        Shelf voltages by element:
          [1] 5.15 Volts Normal voltage range
          [2] 12.18 Volts Normal voltage range
          [3] 5.15 Volts Normal voltage range
          [4] 12.34 Volts Normal voltage range
          [5] 5.11 Volts Normal voltage range
          [6] 12.22 Volts Normal voltage range
          [7] 5.15 Volts Normal voltage range
          [8] 12.26 Volts Normal voltage range
        Current Sensor installed element list: 1, 2, 3, 4, 5, 6, 7, 8; with error: none
        Shelf currents by element:
          [1] 5030 mA Normal current range
          [2] 9210 mA Normal current range
          [3] 4600 mA Normal current range
          [4] 6560 mA Normal current range
          [5] 4410 mA Normal current range
          [6] 6640 mA Normal current range
          [7] 7180 mA Normal current range
          [8] 7180 mA Normal current range
        Cooling Unit installed element list: 1, 2, 3, 4, 5, 6, 7, 8; with error: none
        Cooling Units by element:
          [1] 3900 RPM
          [2] 3600 RPM
          [3] 3820 RPM
          [4] 3600 RPM
          [5] 3970 RPM
          [6] 3600 RPM
          [7] 3670 RPM
          [8] 3670 RPM
        Temperature Sensor installed element list: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12; with error: none
        Shelf temperatures by element:
          [1] 23 C (73 F) (ambient)  Normal temperature range
          [2] 32 C (89 F)  Normal temperature range
          [3] 32 C (89 F)  Normal temperature range
          [4] 40 C (104 F)  Normal temperature range
          [5] 32 C (89 F)  Normal temperature range
          [6] 40 C (104 F)  Normal temperature range
          [7] 32 C (89 F)  Normal temperature range
          [8] 40 C (104 F)  Normal temperature range
          [9] 33 C (91 F)  Normal temperature range
          [10] 41 C (105 F)  Normal temperature range
          [11] 44 C (111 F)  Normal temperature range
          [12] 44 C (111 F)  Normal temperature range
        Temperature thresholds by element:
          [1] High critical: 42 C (107 F); high warning: 40 C (104 F)
              Low critical:  0 C (32 F); low warning: 5 C (41 F)
          [2] High critical: 55 C (131 F); high warning: 50 C (122 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [3] High critical: 55 C (131 F); high warning: 50 C (122 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [4] High critical: 70 C (158 F); high warning: 65 C (149 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [5] High critical: 55 C (131 F); high warning: 50 C (122 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [6] High critical: 70 C (158 F); high warning: 65 C (149 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [7] High critical: 55 C (131 F); high warning: 50 C (122 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [8] High critical: 70 C (158 F); high warning: 65 C (149 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [9] High critical: 55 C (131 F); high warning: 50 C (122 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [10] High critical: 70 C (158 F); high warning: 65 C (149 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [11] High critical: 60 C (140 F); high warning: 55 C (131 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
          [12] High critical: 60 C (140 F); high warning: 55 C (131 F)
              Low critical:  5 C (41 F); low warning: 10 C (50 F)
        ES Electronics installed element list: 1, 2; with error: none
        ES Electronics reporting element: 2
        ES Electronics information by element:
          [1] Serial number: 9404744698  Part number: 111-00846+D0
              CPLD version: 14  Swaps: 0
          [2] Serial number: 9404754978  Part number: 111-00846+D0
              CPLD version: 14  Swaps: 0
        Enclosure element list: 1; with error: none
        Enclosure information:
          [1] WWN: 50050cc102049db8  Shelf ID: 00
              Serial number: SHJMS0000004AAF  Part number: 111-01137+B0
              Midplane serial number: BPS0961627G41DC  Midplane part number: 096
        SAS connector attached element list: 1, 3; with error: none
        SAS cable information by element:
          [1] Internal connector
          [2] Vendor: <N/A> (disconnected)
              Type: <N/A> <N/A>  <N/A>  ID: <N/A>  Swaps: 0
              Serial number: <N/A>  Part number: <N/A>
          [3] Internal connector
          [4] Vendor: <N/A> (disconnected)
              Type: <N/A> <N/A>  <N/A>  ID: <N/A>  Swaps: 0
              Serial number: <N/A>  Part number: <N/A>
        ACP installed element list: 1, 2; with error: none
        ACP information by element:
          [1] MAC address: 00:A0:98:31:CE:5A
          [2] MAC address: 00:A0:98:31:C6:F2
        Processor Complex attached element list: 1, 2; with error: none
        SAS Expander Module installed element list: 1, 2; with error: none
        SAS Expander master module: 1

        Shelf mapping (shelf-assigned addresses) for channel 0a:
          Shelf   0:  23  22  21  20  19  18  17  16  15  14  13  12  11  10   9

The output above is from a FAS2240-4 with no external disk shelves and so is relatively short.  If you have many shelves (and many paths) and you know which one has the error, you can run the environment status command with a focus on a particular shelf and path.  This can make the output to an SSH session a bit leaner and easier to read.  You may want to adjust Putty’s buffer so you can scroll back up or dump the output to a text file and search from there.

environment status shelf 0a.00
        Environment for channel 0a
        Number of shelves monitored: 1  enabled: yes
        Environmental failure on shelves on this channel? no

When you identify the source of the fault, you can remediate by either modifying your A/C (I’ve had to deflect cold air before because it got too cold), or replacing power supplies, fans, or simply plugging the power cable back in.

Advertisement

One Comment on “Reviewing NetApp Shelf Faults”

  1. pirhounix says:

    I was told on these types of models, if the SP is on the same Vlan as the data, it causes the system to generate these messages even if there is no real fault detected. I have this system running Cdot 9, which we had upgraded from 7-mode. When we were on 7-mode w never had a shelf fault, moving to Cdot seemed to cause this more often.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s