Now we have a /dev/md1 which has just lost a device.
To remove a 'good' disk, you have to tell the array to put it into the Note that mdadm cannot pull a disk out of a running array.įor obvious reasons, only faulty disks can be hot-removed from anĪrray (even stopping and unmounting the device won't help - if you ever want Let's fix things up.įirst, we will remove the failed disk from the array. Now you've seen how it goes when a device fails. Spare disk available, reconstruction should have started. Kernel: md1: resyncing spare disk sdb7 to replace failed diskĬhecking /proc/mdstat out will show the degraded array. Kernel: raid1: Disk failure on sdc2, disabling device. Second line will appear if you have spare disks configured. Like the first line of this on your system's log. Should be enough to fail the disk /dev/sdc2 of the array /dev/md1. Mdadm -manage -set-faulty /dev/md1 /dev/sdc2 You can just simulate a drive failure without unplugging things. When you've re-connected the disk again (with the power off, ofĬourse, remember), you can add the "new" device to the RAID again, Also, users of mdadm should see the device state as Did it work? Did you get an email from the mdadm monitor?įaulty disks should appear marked with an (F) if you look at Look in the syslog, and look at /proc/mdstat to see how the RAID isĭoing. Take the system down, unplug the disk, and boot it up again) If your HW does not support disk hot-unplugging, you should do this with the power off (if you are interested in testing whether your data can survive with a disk less than the usual number, there is no point in being a hot-plug cowboy here. If you want to simulate a drive failure, you can just plug out theĭrive. Kernel: hde: read_intr: error=0x10 for your array to beĪble to survive a disk failure.
Kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002 Kernel: sidisk I/O error: dev 08:01, sector 1590410 Most often, disk failures look like these, Kernel: scsi0: *** BusLogic BT-958 Initialized Successfully *** Kernel: scsi0: Resetting BusLogic BT-958 due to Target 0 Kernel: scsi0: CCB #2669 to Target 0 Aborted Kernel: SCSI host 0 channel 0 reset (pid 2644) timed out - trying harder Kernel: scsi0: Aborting CCB #2669 to Target 0 Kernel: scsi : aborting command due to timeout : pid 2649, scsi0, channel 0, id 0, lun 0 Write (6) 18 33 11 24 00 Kernel: scsi0: Bus Device Reset CCB #2666 to Target 0 Completed Kernel: scsi0: Sending Bus Device Reset CCB #2666 to Target 0 Kernel: SCSI bus is being reset for host 0 channel 0. Kernel: scsi0 channel 0 : resetting for second half of retries. But, when it's about a diskĬrash, huge lots of kernel errors are reported. It's always a must for /var/log/messages to fill screens with tons ofĮrror messages, no matter what happened. Of course the standard log and stat files will record more details about a drive failure. There could be some redundancy to keep your files alive, you mustįirstly: mdadm has an excellent 'monitor' mode which will send an email when a problem is detected in any array (more about that later). Remember that you are working with entire filesystems. Note that when it comes to md devices manipulation, you should always
This section is about life with a software RAID system, that'sĬommunicating with the arrays and tinkertoying them.