Check hard disk with Smart

Unlike other server providers on the market, we already replace hard disks when they show a possible defect. This means that we do not wait until the hard disk damages a software RAID array or until it is completely defective before the exchange is carried out by a technician. Therefore, it is recommended for our customers to regularly check if the hard disks in the RAID array show conspicuous SMART values.

 

Reading SMART values of a hard disk

Reading the SMART values of a hard disk is very easy under Linux. You only have to boot the server into the boot image SysrescueCD using our boot mode or install the tool smartmontools on the Debian / Ubuntu. The command for the installation is:
apt-get install smartmontools
If the tool smartmontools was installed, you can use the command:
smartctl -a /dev/sda
to read out the smart values of the corresponding hard disk. Depending on how many hard disks are used in the server, you have to replace sda with sdb, sdc, ... here.

 

The result of the smart values could then look like this:

 

SSD
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   ---   ---   ---    Pre-fail  Always       -       14597175146359
  5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       -       1
  9 Power_On_Hours          0x0032   ---   ---   ---    Old_age   Always       -       5861434915543908352
 12 Power_Cycle_Count       0x0032   ---   ---   ---    Old_age   Always       -       353380
171 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       5
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       3
181 Program_Fail_Cnt_Total  0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   030   030   000    Old_age   Always       -       30 (Lifetime Min/Max 30/30)
195 Program_Failure_Blk_Ct  0x001c   ---   ---   ---    Old_age   Offline      -       14597175146616
196 Erase_Failure_Blk_Ct    0x0033   ---   ---   ---    Pre-fail  Always       -       91236
201 Write_Commands_Tot_Ct   0x001c   ---   ---   ---    Old_age   Offline      -       14597175146616
204 Bad_Block_Full_Flag     0x001c   ---   ---   ---    Old_age   Offline      -       14597175146616
230 Head_Amplitude          0x0013   100   100   000    Pre-fail  Always       -       100
231 Temperature_Celsius     0x0013   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0000   000   000   000    Old_age   Offline      -       2721
234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       3727
241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       3727
242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       4102


HDD
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   095   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       41
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail  Always       -       449237574
  9 Power_On_Hours          0x0032   046   046   000    Old_age   Always       -       47975
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       41
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   034   045    Old_age   Always   In_the_past 37 (Lifetime Min/Max 32/46)
194 Temperature_Celsius     0x0022   037   066   000    Old_age   Always       -       37 (0 21 0 0)
195 Hardware_ECC_Recovered  0x001a   056   053   000    Old_age   Always       -       19103120
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

 

Meaning of the SMART values

In the following you will find the meaning of the individual SMART positions. Those that can indicate a hard disk defect are marked in red.

 

RAW_READ_ERROR_RATE

Number of read errors that have occurred so far. However, this value does not necessarily indicate a hard disk defect, because the hard disk manufacturers subject some hard disks to several additional tests before delivery, therefore this value can show a very large number e.g. with Seagate hard disks already from delivery, although the hard disk is completely in order.

 

REALLOCATED_SECTOR_CT

Each hard disk has a certain number of spare sectors. Therefore, if a sector of the hard disk has a defect, in the best case the spare sectors are activated. If this value increases significantly, this can be a first sign of a hard disk defect.

 

SPIN_UP_TIME

Time in milliseconds until the hard disk reaches full rotation speed. A low value indicates a possible damage of the motor. This could also be manifested by a slow loading time of programs.

 

START_STOP_COUNT

Number of times the hard disk motor has started.

 

SEEK_ERROR_RATE

This value describes the number of position errors of the read/write head of the hard disk. If this value increases this can be a sign of a creeping hard disk defect.

 

POWER_ON_HOURS

Previous operating time of the hard disk measured in hours. However, this depends on the manufacturer. Some manufacturers also measure the operating time in minutes. This can be critical if the MTBF time specified by the manufacturer is exceeded. Normally, however, this is a value under optimal conditions that is rarely reached in reality.

 

SPIN_RETRY_COUNT

Number of start attempts of the hard disk. If this value increases, this is also a sign of a creeping hard disk defect.

 

POWER_CYCLE_COUNT

Number of power-on attempts of the hard disk. This value is normally very high for Green hard disks, since these reduce the speed when the hard disk is little in use. However, no Green hard disks are installed in our server systems, since the life expectancy of such a Green hard disk is too low due to continuous operation.

 

WEAR_LEVELING_COUNT

This is a value that only occurs with SSD hard disks. It defines the number of deletions on a single memory block. If this value is very high, the hard disk should also be replaced.

 

PROGRAM_FAIL_CNT_TOTAL

If writing to the SSD hard disk fails or the flash memory of the SSD hard disk is exhausted.

 

ERASE_FAIL_COUNT_TOTAL

This value reflects the number of failed erase operations on the SSD hard disk.

 

HIGH_FLY_WRITES

Informs when the read/write head is outside the normal operating range. This value should always be 0. Otherwise it also indicates a hard disk defect.

 

AIRFLOW_TEMPERATURE_CEL

Ambient temperature of the hard disk. This can be somewhat higher for our server systems, since in the data center environment the ambient temperature can be quite high due to the number of server systems. A value like (Lifetime Min/Max 32/46) is quite normal, depending on how many hard disks are mounted on top of each other in the case. If the value is significantly higher, this can indicate a fan defect in the case. This should then be checked by a technician.

 

REPORTED_UNCORRECT / HARDWARE_ECC_RECOVERED

Number of UNC errors that are corrected by the error correction code.

 

OFFLINE_UNCORRECTABLE

Number of bad sectors detected during an offline scan of the hard disk. These scans start e.g. when the hard disk is idle comparable to a self test.

 

CURRENT_PENDING_SECTOR

The number of sectors waiting to be restarted or reallocated. In the optimal case this value is also 0.

 

TEMPERATURE_CELSIUS

Temperature of the hard disk. Whether this value is critical or not depends on the type of hard disk. Some hard disks have a higher operating temperature and some have a lower one. However, the operating temperature of the hard disk should not exceed 45 degrees.

 

BAD_BLOCK_FULL_FLAG

Indicates the number of bad blocks, in the optimum case this value is 0, however it can be manufacturer-dependent, how this value behaves with SSD hard disks.

 

MEDIA_WEAROUT_INDICATOR

Indicates the remaining lifetime of the SSD hard disk.

 

TOTAL_LBAS_WRITTEN

Total number of 512 byte sectors written during the lifetime of the device.

 

TOTAL_LBAS_READ

Total number of 512 byte sectors read during the lifetime of the device.

 

UDMA_CRC_ERROR_COUNT

Number of transmission errors during data transfer. A high value indicates a cable or SATA controller defect.

 

MULTI_ZONE_ERROR_RATE

Number of errors when writing data to the hard disk. If the value increases, it may indicate a defect of the disk surface of the magnetic disk.

 

DATA_ADDRESS_MARK_ERRS

Reflects the number of incorrect or invalid data addresses on the disk. This value should also be 0.

 

Check hard disk on RAID controller

Some of our customers use a hardware RAID controller. In our new servers we exclusively install RAID controllers from Adaptec. With the help of the Adaptec Storage Manager you can use the command:
/usr/StorMan/arcconf GETLOGS 1 DEVICE

Den Fehlerspeicher des RAID Controllers auslesen. Im Optimalfall sieht das Ergebnis so aus:

Controllers found: 1
<ControllerLog controllerID="0" type="0" time="1388047007" version="3" tableFull="false">
</ControllerLog>

Command completed successfully.
To read out the error memory of the RAID controller. In the optimal case the result looks like this:
Controllers found: 1
<ControllerLog controllerID="0" type="0" time="1388047103" version="3" tableFull="false">
    <driveErrorEntry smartError="false" vendorID="Hitachi " serialNumber="MSK4215H1EUG7G" wwn="0000000000000000" deviceID="0" productID="HDS72105" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="32" mediumErrors="0" smartWarning="0" />
    <driveErrorEntry smartError="true" vendorID="" serialNumber="Z2A68BHK" wwn="0000000000000000" deviceID="0" productID="ST350041" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="0" smartWarning="0" />
    <driveErrorEntry smartError="false" vendorID="" serialNumber="9QM1C4VC" wwn="0000000000000000" deviceID="1" productID="ST350082" numParityErrors="0" linkFailures="4" hwErrors="0" abortedCmds="64" mediumErrors="0" smartWarning="0" />
</ControllerLog>

Command completed successfully.

 

The values numParityErrors="0″, linkFailures="0″, hwErrors="0″, abortedCmds="0″, mediumErrors="0″ and smartWarning="0″ should always result in 0 in the optimal case. Should a value deviate from 0 here, this is already a sign of a hard disk defect for us. This can result in the RAID array becoming inconsistent, the Linux switching to Real/Only mode or the RAID array assuming Offline status during operation.

 

To avoid data loss, we therefore already exchange hard disks that have a value other than 0 in the controller log.

 

For a 3Ware RAID controller the commands look like this - of course tw_cli has to be installed first:

 

tw_cli info

 

Shows the information of the RAID controller, which RAID array is configured.

 

tw_cli info c0

 

Shows the status of the RAID array and if a disk failed.

 

smartctl -a -d 3ware,0 /dev/twa0

 

Reads out the SMART values of the hard disk on port 0.

 

smartctl -a -d 3ware,1 /dev/twa0

 

Displays the smart values of the hard disk on port 1.

 

Have a hard disk exchanged

If one of your hard disks has a defect, please inform us about the serial number of the defective or non-defective hard disk, so that our technicians can exchange it free of charge. The serial number of the hard disk is already listed in the log for RAID controllers, for a software RAID you can find it out with the command:
smartctl -i /dev/sda
If you do not know how to restore a software RAID after replacing the hard disk, we will be happy to assist you in the recovery process.

 

Tags