The ability of the RAID to handle failures of its hard drives
relies on two things:
- Built-in redundancy of the storage. The RAID has some extra space in it,
and the end-user capacity of a fault-tolerant array is always less than
the combined capacity of its member disks.
- Diligence of people to provide additional redundant storage should the built-in
reserve fail.
Once redundancy is lost because of the first drive failure,
the human intervention is needed to correct the problem and
restore the required level of redundancy.
Redundancy is to be restored quickly, or otherwise there is no point
in having the RAID at all.
However, you do not know when to act if you do not monitor the array and the disks
often enough.
-
Regularly check the SMART status on the drives using appropriate software.
With a software RAID, use SpeedFan or HDDLife.
With a hardware RAID, use the vendor-supplied monitoring software.
-
The so-called "scrubbing" should be used whenever possible.
The scrubbing process reads all the data on the array during idle periods or
per the predefined schedule. This allows to discover the newly developed
bad sectors on the drives before encountering them in actual use.
The data can then be relocated away from the unreliable spots or the disk can be replaced.
-
Any unexplained drop in the throughput may indicate a problem
with one of the hard drives. With certain systems
(like a QNAP NAS based on Linux MD RAID) the imminent disk failure may manifest
itself as the unit "stalling" long before the drive is declared dead
(see QNAP story).
|