Unlike other server providers on the market, we already replace hard disks when they show a possible defect. This means that we do not wait until the hard disk damages a software RAID array or until it is completely defective before the exchange is carried out by a technician. Therefore, it is recommended for our customers to regularly check if the hard disks in the RAID array show conspicuous SMART values.
Reading SMART values of a hard disk
Reading the SMART values of a hard disk is very easy under Linux. You only have to boot the server into the boot image SysrescueCD using our boot mode or install the tool smartmontools on the Debian / Ubuntu. The command for the installation is:
apt-get install smartmontools
If the tool smartmontools was installed, you can use the command:
smartctl -a /dev/sda
to read out the smart values of the corresponding hard disk. Depending on how many hard disks are used in the server, you have to replace sda with sdb, sdc, ... here.
The result of the smart values could then look like this:
SSD
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f --- --- --- Pre-fail Always - 14597175146359 5 Reallocated_Sector_Ct 0x0033 100 100 003 Pre-fail Always - 1 9 Power_On_Hours 0x0032 --- --- --- Old_age Always - 5861434915543908352 12 Power_Cycle_Count 0x0032 --- --- --- Old_age Always - 353380 171 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0 172 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0 174 Unknown_Attribute 0x0030 000 000 000 Old_age Offline - 5 177 Wear_Leveling_Count 0x0000 000 000 000 Old_age Offline - 3 181 Program_Fail_Cnt_Total 0x0032 000 000 000 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 000 000 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 030 030 000 Old_age Always - 30 (Lifetime Min/Max 30/30) 195 Program_Failure_Blk_Ct 0x001c --- --- --- Old_age Offline - 14597175146616 196 Erase_Failure_Blk_Ct 0x0033 --- --- --- Pre-fail Always - 91236 201 Write_Commands_Tot_Ct 0x001c --- --- --- Old_age Offline - 14597175146616 204 Bad_Block_Full_Flag 0x001c --- --- --- Old_age Offline - 14597175146616 230 Head_Amplitude 0x0013 100 100 000 Pre-fail Always - 100 231 Temperature_Celsius 0x0013 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0000 000 000 000 Old_age Offline - 2721 234 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 3727 241 Total_LBAs_Written 0x0032 000 000 000 Old_age Always - 3727 242 Total_LBAs_Read 0x0032 000 000 000 Old_age Always - 4102
HDD
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 095 006 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 41 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail Always - 449237574 9 Power_On_Hours 0x0032 046 046 000 Old_age Always - 47975 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 41 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 063 034 045 Old_age Always In_the_past 37 (Lifetime Min/Max 32/46) 194 Temperature_Celsius 0x0022 037 066 000 Old_age Always - 37 (0 21 0 0) 195 Hardware_ECC_Recovered 0x001a 056 053 000 Old_age Always - 19103120 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
Meaning of the SMART values
In the following you will find the meaning of the individual SMART positions. Those that can indicate a hard disk defect are marked in red.
RAW_READ_ERROR_RATE
Number of read errors that have occurred so far. However, this value does not necessarily indicate a hard disk defect, because the hard disk manufacturers subject some hard disks to several additional tests before delivery, therefore this value can show a very large number e.g. with Seagate hard disks already from delivery, although the hard disk is completely in order.
REALLOCATED_SECTOR_CT
Each hard disk has a certain number of spare sectors. Therefore, if a sector of the hard disk has a defect, in the best case the spare sectors are activated. If this value increases significantly, this can be a first sign of a hard disk defect.
SPIN_UP_TIME
Time in milliseconds until the hard disk reaches full rotation speed. A low value indicates a possible damage of the motor. This could also be manifested by a slow loading time of programs.
START_STOP_COUNT
Number of times the hard disk motor has started.
SEEK_ERROR_RATE
This value describes the number of position errors of the read/write head of the hard disk. If this value increases this can be a sign of a creeping hard disk defect.
POWER_ON_HOURS
Previous operating time of the hard disk measured in hours. However, this depends on the manufacturer. Some manufacturers also measure the operating time in minutes. This can be critical if the MTBF time specified by the manufacturer is exceeded. Normally, however, this is a value under optimal conditions that is rarely reached in reality.
SPIN_RETRY_COUNT
Number of start attempts of the hard disk. If this value increases, this is also a sign of a creeping hard disk defect.
POWER_CYCLE_COUNT
Number of power-on attempts of the hard disk. This value is normally very high for Green hard disks, since these reduce the speed when the hard disk is little in use. However, no Green hard disks are installed in our server systems, since the life expectancy of such a Green hard disk is too low due to continuous operation.
WEAR_LEVELING_COUNT
This is a value that only occurs with SSD hard disks. It defines the number of deletions on a single memory block. If this value is very high, the hard disk should also be replaced.
PROGRAM_FAIL_CNT_TOTAL
If writing to the SSD hard disk fails or the flash memory of the SSD hard disk is exhausted.
ERASE_FAIL_COUNT_TOTAL
This value reflects the number of failed erase operations on the SSD hard disk.
HIGH_FLY_WRITES
Informs when the read/write head is outside the normal operating range. This value should always be 0. Otherwise it also indicates a hard disk defect.
AIRFLOW_TEMPERATURE_CEL
Ambient temperature of the hard disk. This can be somewhat higher for our server systems, since in the data center environment the ambient temperature can be quite high due to the number of server systems. A value like (Lifetime Min/Max 32/46) is quite normal, depending on how many hard disks are mounted on top of each other in the case. If the value is significantly higher, this can indicate a fan defect in the case. This should then be checked by a technician.
REPORTED_UNCORRECT / HARDWARE_ECC_RECOVERED
Number of UNC errors that are corrected by the error correction code.
OFFLINE_UNCORRECTABLE
Number of bad sectors detected during an offline scan of the hard disk. These scans start e.g. when the hard disk is idle comparable to a self test.
CURRENT_PENDING_SECTOR
The number of sectors waiting to be restarted or reallocated. In the optimal case this value is also 0.
TEMPERATURE_CELSIUS
Temperature of the hard disk. Whether this value is critical or not depends on the type of hard disk. Some hard disks have a higher operating temperature and some have a lower one. However, the operating temperature of the hard disk should not exceed 45 degrees.
BAD_BLOCK_FULL_FLAG
Indicates the number of bad blocks, in the optimum case this value is 0, however it can be manufacturer-dependent, how this value behaves with SSD hard disks.
MEDIA_WEAROUT_INDICATOR
Indicates the remaining lifetime of the SSD hard disk.
TOTAL_LBAS_WRITTEN
Total number of 512 byte sectors written during the lifetime of the device.
TOTAL_LBAS_READ
Total number of 512 byte sectors read during the lifetime of the device.
UDMA_CRC_ERROR_COUNT
Number of transmission errors during data transfer. A high value indicates a cable or SATA controller defect.
MULTI_ZONE_ERROR_RATE
Number of errors when writing data to the hard disk. If the value increases, it may indicate a defect of the disk surface of the magnetic disk.
DATA_ADDRESS_MARK_ERRS
Reflects the number of incorrect or invalid data addresses on the disk. This value should also be 0.
Check hard disk on RAID controller
Some of our customers use a hardware RAID controller. In our new servers we exclusively install RAID controllers from Adaptec. With the help of the Adaptec Storage Manager you can use the command:
/usr/StorMan/arcconf GETLOGS 1 DEVICE
Den Fehlerspeicher des RAID Controllers auslesen. Im Optimalfall sieht das Ergebnis so aus:
Controllers found: 1 <ControllerLog controllerID="0" type="0" time="1388047007" version="3" tableFull="false"> </ControllerLog> Command completed successfully.
To read out the error memory of the RAID controller. In the optimal case the result looks like this:
Controllers found: 1 <ControllerLog controllerID="0" type="0" time="1388047103" version="3" tableFull="false"> <driveErrorEntry smartError="false" vendorID="Hitachi " serialNumber="MSK4215H1EUG7G" wwn="0000000000000000" deviceID="0" productID="HDS72105" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="32" mediumErrors="0" smartWarning="0" /> <driveErrorEntry smartError="true" vendorID="" serialNumber="Z2A68BHK" wwn="0000000000000000" deviceID="0" productID="ST350041" numParityErrors="0" linkFailures="0" hwErrors="0" abortedCmds="0" mediumErrors="0" smartWarning="0" /> <driveErrorEntry smartError="false" vendorID="" serialNumber="9QM1C4VC" wwn="0000000000000000" deviceID="1" productID="ST350082" numParityErrors="0" linkFailures="4" hwErrors="0" abortedCmds="64" mediumErrors="0" smartWarning="0" /> </ControllerLog> Command completed successfully.
The values numParityErrors="0″, linkFailures="0″, hwErrors="0″, abortedCmds="0″, mediumErrors="0″ and smartWarning="0″ should always result in 0 in the optimal case. Should a value deviate from 0 here, this is already a sign of a hard disk defect for us. This can result in the RAID array becoming inconsistent, the Linux switching to Real/Only mode or the RAID array assuming Offline status during operation.
To avoid data loss, we therefore already exchange hard disks that have a value other than 0 in the controller log.
For a 3Ware RAID controller the commands look like this - of course tw_cli has to be installed first:
tw_cli info
Shows the information of the RAID controller, which RAID array is configured.
tw_cli info c0
Shows the status of the RAID array and if a disk failed.
smartctl -a -d 3ware,0 /dev/twa0
Reads out the SMART values of the hard disk on port 0.
smartctl -a -d 3ware,1 /dev/twa0
Displays the smart values of the hard disk on port 1.
Have a hard disk exchanged
If one of your hard disks has a defect, please inform us about the serial number of the defective or non-defective hard disk, so that our technicians can exchange it free of charge. The serial number of the hard disk is already listed in the log for RAID controllers, for a software RAID you can find it out with the command:
smartctl -i /dev/sda
If you do not know how to restore a software RAID after replacing the hard disk, we will be happy to assist you in the recovery process.