Browny's Grotto :: http://www.clan-elite.info/

Replacing a faulty drive in an mdadm managed RAID5 array

1 pages: [ 1 ] [ View newest post ] Post reply.
Taken from elsewhere for my own notes:

1. Use "mdadm --manage /dev/md0 -r /dev/sdd" to remove the drive that was marked as faulty from the array.

2. Power down and replace the drive with a good drive.

3. Power up and set the partition table on the new drive to match those of the other drives in the array. Here we used "sfdisk -d /dev/sda | sfdisk /dev/sdd".

4. Add the proper partition on the new drive into the array, "mdadm --manage /dev/md0 -a /dev/sdd2"

5. Sit back and wait for the recovery to happen, you can "cat /proc/mdstat" to watch its progress; you should see something like:

Bash Script:
Personalities : [raid5]
md0 : active raid5 sdd2[4] sdc2[2] sdb2[1] sda2[0]
731985408 blocks level 5, 256k chunk, algorithm 2 [4/3] [UUU_]
[===>.................] recovery = 19.7% (48253056/243995136) finish=59.1min speed=55184K/sec
To get identification of the failed drive so it can be removed without error:
Bash Script:
emerge --quiet sdparm
sdparm -i /dev/sd*
If you wish to use the opportunity of a failed drive to increase the size of your array, refer to this:

http://linux-raid.osdl.org/index.php/Growing
Another backup of a useful thread.

How to backup your hard drive (the type of format doesn't matter) using dd.

Boot to some rescue mode by using the install media (generally "linux rescue") otherwise enter rescue mode manually: Linux Recovery

Make sure not to be booted to your hard drive, nor to have any of those partitions mounted.

Now use any combination of dd, ssh or rsh, gzip or bzip2 to backup the drive (I recommend using ssh versus rsh; however, ssh is generally not available during the rescue mode, whereas rsh is available):

You can backup the whole drive (if you have enough space on your destination system) as follows (This method also grabs the MBR):

Bash Script:
dd if=/dev/sda | rsh user@dest "gzip -9 >20030220-backup-sda.dd.gz"


Or (if bandwidth is short)
Bash Script:
dd if=/dev/sda | gzip -c9 | rsh remuser@remhost 'cat >whateveryoulike.gz'


A restore using this method would be as follows:

Bash Script:
rsh user@dest "cat 20080220-backup-sda.dd.gz | gunzip" | dd of=/dev/sda


To backup individual partitions, be sure to grab the MBR because it contains the partition table, as well as any partitions you want to backup:

Bash Script:
dd if=/dev/sda bs=512 count=1 | rsh user@dest "cat - > 20030220-backup-mbr.dd"
dd if=/dev/sda1 | rsh user@dest "gzip -9 > 20030220-backup-sda1.dd.gz"


A restore would go as follows - be sure to restore the MBR, reboot, then restore the other partitions.

Bash Script:
rsh user@dest "cat 20030220-backup-mbr.dd" | dd of=/dev/sda
reboot to re-read partition table (come back into rescue mode)
rsh user@dest "cat 20030220-backup-sda1.dd.gz | gunzip" | dd of=/dev/sda1


Depending on which machine is the fastest and how fast your network is, you need to decide when you will do the compression. Your choices are to compress before sending over the network, but if this machine is much slower than the server you are sending to, then it may be better to send the uncompressed data over the network to the destination server and compress as the data arrives. Just keep in mind that the transfer over the network will be a little slower if sending uncompressed data rather than compressed -- also the network speed affects this too -- 10 Mbit vs. 100 Mbit. Use your best judgement.
Following the instructions in one of the links above I'd managed to render parts of my RAID5 partition unreadable after growing it:
Bash Script:
mdadm /dev/md0 --grow --size=max
After doing this I would receive the following errors when trying to check and expand the filesystem:
Bash Script:
e2fsck 1.39 (29-May-2006)
The filesystem size (according to the superblock) is 366285952 blocks
The physical size of the device is 195697024 blocks
Either the superblock or the partition table is likely to be corrupt!
Abort<y>? yes
Bash Script:
ls /mnt/raid/
ls: cannot access /mnt/raid/old_drives: Input/output error
ls: cannot access /mnt/raid/docs: Input/output error
ls: cannot access /mnt/raid/iso+vm: Input/output error
ls: cannot access /mnt/raid/music: Input/output error
...
(some files and folders still accessible)
Bash Script:
e2fsck -f /dev/md0

...

Error reading block 205980086 (Invalid argument) while doing inode scan.  Ignore error<y>? yes

Force rewrite<y>? 

/dev/md0: e2fsck canceled.

/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****

/dev/md0: ********** WARNING: Filesystem still has errors **********
The solution is to use mdadm --grow again to return the array to its original size, then back everything up, then recreate the array.

Step #1, find out how big the array should be:
Bash Script:
morgul ~ # cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 sda1[0] sdc1[2] sdb1[1]
      2930271744 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>
morgul ~ # mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Mon Sep 11 02:00:29 2006
     Raid Level : raid5
     Array Size : 782788096 (746.52 GiB 801.58 GB)
  Used Dev Size : 1465135872 (1397.26 GiB 1500.30 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Dec 24 15:43:29 2008
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 256K

           UUID : e2c34576:3fcb9849:534480be:98638fa7
         Events : 0.15459491

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
Step #2, grow (or shrink) the array back to the size specified by "Array Size". Be careful! The change should happen instantaneously:
Bash Script:
morgul ~ # mdadm /dev/md0 --grow --size=782788096
morgul ~ # cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 sda1[0] sdc1[2] sdb1[1]
      1565576192 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>
I was then able to mount the RAID without error to backup the data.

Weirdly, once fixed I could see that the mdadm detail report for the problem partition had "backwards" numbers for Array and Used Dev size. Here is the output of the "healthy" array, with the "unhealthy" one already printed above:
Bash Script:
morgul ~ # mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Mon Sep 11 02:00:29 2006
     Raid Level : raid5
     Array Size : 1565576192 (1493.05 GiB 1603.15 GB)
  Used Dev Size : 782788096 (746.52 GiB 801.58 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Dec 24 16:06:53 2008
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 256K

           UUID : e2c34576:3fcb9849:534480be:98638fa7
         Events : 0.15459622

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
1 pages: [ 1 ] [ View newest post ] Post reply.