How do I recover a volume after repair process silently failed?

2»

All Replies

  • Mijzelf
    Mijzelf Posts: 2,598  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Maybe. Did you reboot already? You can check if the new headers describe the same array type,offset,blocksize as before:
    mdadm --examine /dev/sd[abcd]3

    Further don't trust the firmware. Have a look at the kernel view on the array:
    cat /proc/mdstat

  • bjorn
    bjorn Posts: 8  Freshman Member
    Oops! My apologies, Mijzelf. Following the reboot here are the mdadm outputs. I'm partially back up, with the degraded volume,
    ~ # mdadm --examine /dev/sda3
    /dev/sda3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b
               Name : NAS540:2  (local to host NAS540)
      Creation Time : Tue Dec  1 11:31:20 2020
         Raid Level : raid5
       Raid Devices : 4
    
     Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
         Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
      Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 5b44e4bd:e37142b2:23d26a4d:cb281462
    
        Update Time : Wed Dec  2 13:18:45 2020
           Checksum : 61fc7575 - correct
             Events : 122
    
             Layout : left-symmetric
         Chunk Size : 64K
    
       Device Role : Active device 0
       Array State : A.AA ('A' == active, '.' == missing)
    ~ # mdadm --examine /dev/sdb3
    /dev/sdb3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : ad82b6f7:6aacc5f3:c7a86a8b:25240df4
               Name : NAS540:2  (local to host NAS540)
      Creation Time : Thu Jul 27 14:12:32 2017
         Raid Level : raid5
       Raid Devices : 4
    
     Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
         Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
      Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 0a5c35b6:3bd8a182:5030b8be:51bbe238
    
        Update Time : Thu Oct 22 22:21:39 2020
           Checksum : 77ae1fd - correct
             Events : 47
    
             Layout : left-symmetric
         Chunk Size : 64K
    
       Device Role : Active device 1
       Array State : AAAA ('A' == active, '.' == missing)
    ~ # mdadm --examine /dev/sdc3
    /dev/sdc3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b
               Name : NAS540:2  (local to host NAS540)
      Creation Time : Tue Dec  1 11:31:20 2020
         Raid Level : raid5
       Raid Devices : 4
    
     Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
         Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
      Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : ebcb668f:f6c07008:9efd8d2f:ad7314ad
    
        Update Time : Wed Dec  2 13:18:45 2020
           Checksum : b6d0c276 - correct
             Events : 122
    
             Layout : left-symmetric
         Chunk Size : 64K
    
       Device Role : Active device 2
       Array State : A.AA ('A' == active, '.' == missing)
    ~ # mdadm --examine /dev/sdd3
    /dev/sdd3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b
               Name : NAS540:2  (local to host NAS540)
      Creation Time : Tue Dec  1 11:31:20 2020
         Raid Level : raid5
       Raid Devices : 4
    
     Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
         Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
      Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 460883bd:f423662f:25ca9304:9fe9a52e
    
        Update Time : Wed Dec  2 13:18:50 2020
           Checksum : 6279a450 - correct
             Events : 124
    
             Layout : left-symmetric
         Chunk Size : 64K
    
       Device Role : Active device 3
       Array State : A.AA ('A' == active, '.' == missing)
    
    and
    ~ # cat /proc/mdstat
    Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md3 : inactive sdb3[1](S)
          3902886912 blocks super 1.2
    
    md2 : active raid5 sda3[0] sdd3[3] sdc3[2]
          11708660160 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU]
    
    md1 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[4]
          1998784 blocks super 1.2 [4/4] [UUUU]
    
    md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[4]
          1997760 blocks super 1.2 [4/4] [UUUU]
    
    unused devices: 
    
    It looks like the missing drive is the one I'm still hoping to recover, which shows `Lost` (the amelia-1 share),



    Anything left to try?

    I really appreciate your effort.

  • Mijzelf
    Mijzelf Posts: 2,598  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    The headers look good, and the array is up, and seeing that 'Lost' changed to 'Disabled', I think you only have to enter the Shares menu to enable them.
    As far as the firmware knows, you put in 3 disks containing a new volume, so the 'old' shares are no longer available.
  • bjorn
    bjorn Posts: 8  Freshman Member
    I've been able to get a decent amount off the drive once I enabled them. (It's very very slow and still obviously beeping.) After this completes, are there any diagnostics I can run to fully assess each disk and make sure all are in working order before reseting it?
  • Mijzelf
    Mijzelf Posts: 2,598  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    edited December 2020
    It's very very slow and still obviously beeping.

    About beeping

    buzzerc -s && mv /sbin/buzzerc /sbin/buzzerc.old

    will stop the buzzer, and remove the possibility for the firmware to start it again. Till the next reboot.

    About slow, it shouldn't be much slower than before. The array is degraded, which means one out of 3 blocks has to be recalculated from the parity, but the NAS can do that at 1GB/sec, so hardly noticeable. The box should do 75-100MB/sec for big files. (Which of course, is still 37 hours for 10TB)

    After this completes, are there any diagnostics I can run to fully assess each disk

    That's complicated. The SMART values of the disks can tell if the disks themselves 'feel healthy'.

    It is possible that SMART says the disks are completely healthy, while still a disk will be dropped if you try to add a 4th disk. The reason is aging of the data. A modern harddisk has a very high data density. A single bit is a few square nm, and so only a few dozens of magnetic atoms. Ideally those atoms are all orientated the same, so a clear 0 or 1 can be read. But due to thermal noise over time some atoms can loose their orientation, blurring the signal. At some moment it's not longer possible to tell if it's a 0 or 1. Because of this the sector has some extra bits, to be able to restore a few unreadable bits, but sometimes it stops, and the sector is unreadable. The disk will try several times to read the sector, because to positioning of the head is not 100% reproducable, so a new read will pull in some other, maybe readable atoms.

    Finally the disk will report an I/O error, the sector is not readable. The raid manager will drop the disk. But, this disk is perfectly healthy. One sector is not readable, but you can simply write new data to it. If you would know what to write.

    The solution is to 'resilver' the disk, which means reading each sector and writing it back. This way all atoms are orientated again, and ready for years. (It is possible that the sector which caused your problem hasn't been written to since factory. If you succeed in copying all data, you have proved the problem sector is not in use by the filesystem). Modern filesystem like ZFS en BTRFS have build-in functions for this, but unfortunately the used software raid hasn't, AFAIK.

    For really unusable sectors the disk has a number of spare sectors, which can replace unusable sectors. In the SMART values there is an entry for that: "Reallocated Sectors Count". The raw value is the number of replaced sectors, the percentage is the amount of spare sectors left.
    To find if a disk is still trustworthy, you should make a note of the raw value and percentage, overwrite the complete disk, and look if the value didn't change much. If not, all sectors are readable again (you just re-orientated all atoms) and there was not a significant number of hard failing sectors.
    Unfortunately this will kill your data. I'm not aware of any way to resilver the data on the disk reliable. A naive way is
    dd if=/dev/md2 of=/dev/md2 bs=16M
    This will copy all data from md2 to md2 in blocks of 16M. That should resilver the whole surface, but unfortunately it will stop at first read error. And it's dangerous to do while the filesystem is mounted, as you could overwrite pending changes with older data.

    dd if=/dev/zero of=/dev/md2 bs=16M
    will write zeros to md2. It will also possibly overwrite pending changes, but as the filesystem is destroyed anyway, that doesn't matter. But make sure to read the SMART data before and after.



Consumer Product Help Center