PERC RAID5 阵列离线后数据救援一例

TL;DR:

我不是神仙，RAID5 真坏两块盘是很难救回来的。坏了要及时换盘。
记得备份，备份，备份！
ZFS 在数据恢复中是好文明，压缩、快照、克隆等功能都非常有用。

前两天学校停电倒闸后，杰哥告诉我机房有一台 Dell 服务器的 RAID5 VD 坏了一块盘，让我有空的时候换掉（注：其实它两年前已经开始报告错误了）。但当我今天再登录这台服务器的时候，却发现这个 RAID5 对应的目录已经开始报告 Input/output error 了。再随手一看：

harry@raid-server:~$ mount | grep sda
/dev/sda1 on /data type ext4 (rw,relatime,errors=remount-ro)

harry@raid-server:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       366G  233G  115G  68% /
/dev/sda1        11T  1.6T  8.8T  15% /data

harry@raid-server:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sdb      8:16   0 446.6G  0 disk
├─sdb1   8:17   0 372.6G  0 part /
├─sdb2   8:18   0     1K  0 part
└─sdb5   8:21   0    74G  0 part
sdc      8:32   0  10.9T  0 disk

原本的 sda 已经从系统中消失，变成了同样大小的 sdc（访问也报 EIO），但挂载的文件系统还有残留。dmesg 中有大量的 megaraid_sas 和 ata 报告的硬盘 IO 错误，还有 ext4 报告的文件系统错误。这个文件系统里有一些比较有价值的数据，这下麻烦了。

数据恢复过程

强制上线 RAID 并导出磁盘镜像

使用 perccli（也就是 storcli 的 Dell OEM 版本）检查硬盘（截取了一些输出），发现了非常坏的消息：

harry@raid-server:~$ sudo perccli64 /call/eall/sall show all
...
----------------------------------------------------------------
DG/VD TYPE    State Access Consist Cache Cac sCC       Size Name
----------------------------------------------------------------
0/0    RAID1  OptL   RW    No      NRWTD -   0FF 446.625 GB
1/1    RAID5  OfLn   RW    No      NRWTD -   0FF  10.914 TB
----------------------------------------------------------------
...
------------------------------------------------------------------------------
EID:Slt DID State DG      Size Intf Med SED PI SeSz           Model    Sp Type
------------------------------------------------------------------------------
32:0    0 Onln    0 446.625 GB SATA SSD N    N  512B SSDSC2KB480G8R      U -
32:1    1 Onln    0 446.625 GB SATA SSD N    N  512B SSDSC2KB480G8R      U -
32:4    4 Onln    1   3.637 TB SATA HDD N    N  512B ST4000NM0033-9ZM170 U -
32:5    5 Failed  1   3.637 TB SATA HDD N    N  512B ST4000NM0033-9ZM170 U -
32:6    6 Failed  1   3.637 TB SATA HDD N    N  512B ST4000NM0033-9ZM170 U -
32:7    7 Onln    1   3.637 TB SATA HDD N    N  512B ST4000NM0033-9ZM170 U -
------------------------------------------------------------------------------

这下大事不妙了，四盘 RAID5 有两块盘（槽位 5 和 6）被控制器标记为了 Failed。正当我觉得已经彻底完蛋，决定开始手工从别的地方恢复一些数据的时候，SMART 数据给了我一些希望：

harry@raid-server:~$ sudo smartctl -a /dev/sdc -d megaraid,5
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
Raw_Read_Error_Rate     0x010f   061   061   ---    Pre-fail  Always       -       7418009
Spin_Up_Time            0x0103   092   092   ---    Pre-fail  Always       -       0
Start_Stop_Count        0x0032   100   100   ---    Old_age   Always       -       138
Reallocated_Sector_Ct   0x0133   021   021   ---    Pre-fail  Always       -       13048
Seek_Error_Rate         0x000f   090   061   ---    Pre-fail  Always       -       1116579259
Power_On_Hours          0x0032   027   027   ---    Old_age   Always       -       64444
Spin_Retry_Count        0x0013   100   100   ---    Pre-fail  Always       -       0
Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       138
End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
Reported_Uncorrect      0x0032   095   095   ---    Old_age   Always       -       5
Command_Timeout         0x0032   100   095   ---    Old_age   Always       -       4295032838
High_Fly_Writes         0x003a   100   100   ---    Old_age   Always       -       0
Airflow_Temperature_Cel 0x0022   069   052   ---    Old_age   Always       -       31 (Min/Max 30/36)
G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
Power-Off_Retract_Count 0x0032   100   100   ---    Old_age   Always       -       132
Load_Cycle_Count        0x0032   056   056   ---    Old_age   Always       -       89153
Temperature_Celsius     0x0022   031   048   ---    Old_age   Always       -       31 (0 14 0 0 0)
Hardware_ECC_Recovered  0x001a   015   005   ---    Old_age   Always       -       7418009
Reallocated_Event_Count 0x0032   000   000   ---    Old_age   Always       -       13113
Current_Pending_Sector  0x0012   080   080   ---    Old_age   Always       -       3336
Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       39850 (101 109 0)
Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       5038718942
Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       1089162748411

harry@raid-server:~$ sudo smartctl -a /dev/sdc -d megaraid,6
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
Raw_Read_Error_Rate     0x010f   081   063   ---    Pre-fail  Always       -       143337086
Spin_Up_Time            0x0103   093   092   ---    Pre-fail  Always       -       0
Start_Stop_Count        0x0032   100   100   ---    Old_age   Always       -       102
Reallocated_Sector_Ct   0x0133   100   100   ---    Pre-fail  Always       -       0
Seek_Error_Rate         0x000f   091   060   ---    Pre-fail  Always       -       1395588958
Power_On_Hours          0x0032   018   018   ---    Old_age   Always       -       72016
Spin_Retry_Count        0x0013   100   100   ---    Pre-fail  Always       -       0
Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       100
End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
Reported_Uncorrect      0x0032   094   094   ---    Old_age   Always       -       6
Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
High_Fly_Writes         0x003a   100   100   ---    Old_age   Always       -       0
Airflow_Temperature_Cel 0x0022   068   052   ---    Old_age   Always       -       32 (Min/Max 31/37)
G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
Power-Off_Retract_Count 0x0032   100   100   ---    Old_age   Always       -       97
Load_Cycle_Count        0x0032   041   041   ---    Old_age   Always       -       119629
Temperature_Celsius     0x0022   032   048   ---    Old_age   Always       -       32 (0 16 0 0 0)
Hardware_ECC_Recovered  0x001a   021   003   ---    Old_age   Always       -       143337086
Reallocated_Event_Count 0x0032   000   000   ---    Old_age   Always       -       21845
Current_Pending_Sector  0x0012   100   100   ---    Old_age   Always       -       0
Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       39054 (140 124 0)
Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       5257600282
Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       1497495161858

虽说这两块硬盘的数据都已经非常糟糕了（除了上面的数据，还有一大堆 error log），但至少看起来 6 号盘还有救（Reallocated_Sector_Ct 和 Current_Pending_Sector 都是 0）。于是我强制让 6 号盘上线：

harry@raid-server:~$ sudo perccli64 /c0/e32/s6 set online

幸好命令成功了，此时 sdc 变成了可读的状态！这已经非常幸运了，如果阵列无法上线，我将不得不对每块物理磁盘分别镜像，并在拯救文件系统之前先尝试组装 RAID。

话虽如此，我当然是不敢挂载文件系统的，只读也不敢。我立刻找了另一台空间足够的 ZFS 的服务器，新建了一个 dataset 用 NFS 挂到这台服务器上。然后使用 ddrescue 来对 sdc 进行镜像：

ddrescue -d -D -n -r0 -v --ask -S -c 2048 /dev/sdc /mnt/backup/data/raid.img /mnt/backup/data/raid.map

其中：

img 文件是获得的镜像，map 文件是读取状态的记录文件，方便中断后继续。
-d, -D：对于源和目的都使用直接 IO（O_DIRECT），绕过内核缓存。生成镜像时读写都是一次性的，没必要污染缓存；有缓存还会导致速度统计不准确。
-n, -r0：跳过 scrape 阶段，不对坏块进行重试，先把容易读出的数据复制出来。
-v, --ask：显示详细的进度信息，并且在开始之前确认参数。
-S：写入稀疏文件，一定要打开，否则会导致非常大的空间浪费。
-c 2048：把 IO 的粒度改成 2048 扇区，即 1MB。

ddrescue 在顺利地读取了大概 6TB 的数据后（没有遇到坏块），就卡住再也不动了。这时我从 dmesg 中又看到了新的、极其大量的错误：

[  +0.011405] megaraid_sas 0000:e1:00.0: 252132 (832085413s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0x20/s4) at e8d85fbe
[  +0.011081] megaraid_sas 0000:e1:00.0: 252133 (832085413s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0x20/s4) at e8d85fbe
[  +0.011220] megaraid_sas 0000:e1:00.0: 252134 (832085413s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0x20/s4) at e8d85fbe

晴天霹雳！本来状态良好的 4 号盘也出现了问题。我再尝试读取它的 SMART 信息，已经完全得不到任何响应（但 perccli 还显示它正常）。看来，要么控制器进入了某些不正确的状态，要么就是这块盘也彻底坏了。在尝试重启机器继续之前，我首先进行了一些恢复的尝试。

从不完整的磁盘镜像恢复文件

在接收备份侧，我们已经能看到大小不完整的 raid.img 文件了，它是稀疏的（全 0 部分不实际分配空间），并且我的 ZFS 还开启了压缩，因此实际占用的空间要小得多。RAID 的大小是大概 10TB，我们已经复制出了 6TB，在稀疏化后，它的体积是 1.54TB（此为 ZFS 报告的 logical used），经过 ZSTD 压缩后，实际占用只有 1.3TB（此为 ZFS 的 used）。

保险起见，在做任何事情之前，我先 snapshot 并 clone 了一份新的数据集，这样就能随便进行写入操作来尝试恢复，而如果有问题就可以直接回滚；并且如果原始 RAID5 能恢复，我也可以在这个 snapshot 上继续进行写入。

由于磁盘上的分区是 GPT 的，可以尝试用 gdisk 来分析：

# gdisk raid.img
GPT fdisk (gdisk) version 1.0.10

Warning! Disk size is smaller than the main header indicates! Loading
secondary header from the last sector of the disk! You should use 'v' to
verify disk integrity, and perhaps options on the experts' menu to repair
the disk.
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Warning! Error 25 reading partition table for CRC check!
Warning! One or more CRCs don't match. You should repair the disk!
Main header: OK
Backup header: ERROR
Main partition table: OK
Backup partition table: ERROR

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: damaged

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************

Command (? for help): p
Disk raid.img: 23438819295 sectors, 10.9 TiB
Sector size (logical): 512 bytes
Disk identifier (GUID): [REDACTED]
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 23438819294
Partitions will be aligned on 2048-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048     23438819294   10.9 TiB    8300

毫无意外地是损坏的，因为镜像并不完整，但还是能看到分区。于是尝试从分区起始处挂载：

# losetup -o 1048576 -r /dev/loop1 raid.img
# mount -o ro /dev/loop1 /mnt/recovery

但挂不上，日志说：

[May14 22:23] loop1: detected capacity change from 0 to 11719479296
[  +3.289364] mount: attempt to access beyond end of device
[May14 22:24] EXT4-fs (loop1): bad geometry: block count 2929852155 exceeds size of device (1464934912 blocks)

这比较好解决，我 truncate -s 了一下这个镜像，让它的大小延伸到分区末尾对应的位置（这也是稀疏的，因此瞬间就能完成）。然后再重新尝试，这次能挂上了！但是尝试 ls 根目录时，获得了一大堆错误：

# ls -alih /mnt/recovery
ls: cannot access 'harry': Bad message
ls: cannot access 'docker': Bad message
...
total 121G
        2 drwxrwxrwx 24 root root 4.0K May 30  2025 .
  1831425 drwxr-xr-x  5 root root 4.0K May 14 22:18 ..
        ? d?????????  ? ?    ?       ?            ? docker
151519233 drwx--x--x 14 root root 4.0K Mar 26  2021 docker.old
        ? d?????????  ? ?    ?       ?            ? harry
 31195137 drwxr-xr-x 10 root root 4.0K Nov  2  2020 k8s
       11 drwxrw-rw-  2 root root  16K Aug  3  2020 lost+found
  5439489 drwxr-xr-x 10 1006 1006 4.0K Jul 26  2024 static

# dmesg -H 
[May14 22:31] loop1: detected capacity change from 0 to 23438817247
[May14 22:32] EXT4-fs (loop1): mounted filesystem [REDACTED] ro without journal. Quota mode: none.
[  +2.797028] EXT4-fs error (device loop1): ext4_lookup:1819: inode #229244929: comm ls: iget: checksum invalid
[  +0.001432] EXT4-fs error (device loop1): ext4_lookup:1819: inode #335085569: comm ls: iget: checksum invalid
[  +0.001535] EXT4-fs error (device loop1): ext4_lookup:1819: inode #252116993: comm ls: iget: checksum invalid
[  +0.099194] EXT4-fs error (device loop1): ext4_lookup:1819: inode #262864897: comm ls: iget: checksum invalid
[  +0.017124] EXT4-fs error (device loop1): ext4_lookup:1819: inode #342818817: comm ls: iget: checksum invalid
[  +0.023974] EXT4-fs error (device loop1): ext4_lookup:1819: inode #187039745: comm ls: iget: checksum invalid
[  +0.019762] EXT4-fs error (device loop1): ext4_lookup:1819: inode #265945089: comm ls: iget: checksum invalid

似乎这些目录项指向的 inode 或相关元数据没有在复制过来的镜像范围中，因此无法访问了。很不幸的是，我的目录（harry）也在其中。Claude 教我用 debugfs 的 inode_dump 来查看 inode 的内容，然而：

# debugfs /dev/loop1
debugfs 1.47.2 (1-Jan-2025)
/dev/loop1: Block bitmap checksum does not match bitmap while reading allocation bitmaps
debugfs:  inode_dump 229244929
inode_dump: Filesystem not open

看来并不奏效。我尝试对着镜像运行了 e2fsck -fy，它确实修理了大量的问题。再次 mount 时，EXT4 就不再报那些错误了，然而这些文件夹同样再也找不回来了；对应地，文件系统使用的空间缩小到了 1TB 不到，说明 fsck 真的删掉了很多东西。

从几乎完整的磁盘镜像恢复文件

那怎么办？在我准备掏出 photorec 作为最后手段前，抱着死马当活马医的心态，我强制重启了一下这台有 RAID5 磁盘的服务器。没想到重启完之后，阵列状态又变成了可用（当然，磁盘 5 依旧是坏的）。于是我继续开始运行 ddrescue，静候花开。

我原本决定在第一次运行结束之后，去掉 -n -r0 重新运行一遍 ddrescue，让它对原本读取不成功的部分再尝试一下。然而第二天起床时，发现它卡在了 backwards pass，并且又触发了类似的 megaraid_sas 错误。此时它有 1MB 多的 non-trimmed 数据，和 53MB 的 non-tried 数据，剩下都已经成功读取了，大小也能对上了。考虑到我们的磁盘其实非常稀疏，我直接挂上了这个镜像，没想到一点错误也没有，连 e2fsck 也没有发现任何问题。于是这次数据救援就宣告结束了。我们把其中有用的数据复制出来，恢复了大部分的服务；其他数据也就留在了这个 RAID1 ZFS 上作为存档（和最后的备份）。

其他

ZFS 异常性能诊断

在我一开始用 ddrescue 从 RAID 设备读出镜像到远端 NFS 时，IO 性能非常差：zpool iostat 报告只有 40MB/s 左右。考虑到这个 pool 是两块盘的镜像，每块盘上的 IO 应该和 pool 一致。虽说 HDD 的 IOPS 比较差，但 ddrescue 原理上在大部分时间产生的 IO 都应该是连续的，完全没道理会这么慢。

于是我进行了如下的尝试：

zfs set sync=disabled：zpool 写入速度变成了 50MB/s；
ddrescue -c 2048（把 IO 的粒度改成 2048 扇区，即 1MB）：写入速度变成了 60MB/s；
zfs set recordsize=1M 并且给 NFS 设置挂载选项 rsize=1048576,wsize=1048576：速度上升到 75MB/s（然而收益可能并非来自于此，见下）。

尽管确实获得了一些提升，但依旧离我期待的性能（基本打满目标硬盘的带宽）相差甚远。

首先看一下 ddrescue 是不是真的在连续写入：

# strace -p $(pgrep ddrescue) -e t=write
write(4, "[REDACTED]"..., 1048576) = 1048576
write(1, "\r\33[A\33[A\33[A\33[A\33[A\33[A     ipos:   "..., 95) = 95
write(1, "     opos:    7974 GB, non-scrap"..., 76) = 76
write(1, "non-tried:    4026 GB,  bad-sect"..., 76) = 76
write(1, "  rescued:    7973 GB,   bad are"..., 76) = 76
write(1, "pct rescued:   66.44%, read erro"..., 76) = 76

看起来没问题。然而到 ZFS 这一侧，IO 模式就变成了：

# zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
mypool      2.88T  13.5T    172    587   692K  42.8M
mypool      2.88T  13.5T    239    643   959K  81.2M
mypool      2.88T  13.5T    210    560   842K  70.1M
mypool      2.88T  13.5T    237    582   949K  72.9M
mypool      2.88T  13.5T    204    532   819K  70.0M
（省略若干行大概 75 MB/s 左右的写入）
mypool      2.88T  13.5T    231    596   925K  76.7M
mypool      2.88T  13.5T    240    656   961K  83.2M
mypool      2.88T  13.5T    212    535   849K  67.3M
mypool      2.88T  13.5T    214    599   857K  79.2M
mypool      2.88T  13.5T    220    635   881K  80.0M
mypool      2.88T  13.5T    130    427   622K  39.7M
mypool      2.88T  13.5T      0  12.2K      0   295M
mypool      2.88T  13.5T      0  11.5K      0   282M
mypool      2.88T  13.5T    205    709   822K  72.3M
mypool      2.88T  13.5T    282    778  1.10M  97.3M
mypool      2.88T  13.5T    269    744  1.05M  94.1M
mypool      2.88T  13.5T    238    643   955K  81.3M
mypool      2.88T  13.5T    215    551   863K  71.8M
mypool      2.88T  13.5T    230    610   923K  79.0M
mypool      2.88T  13.5T    205    559   822K  71.9M
mypool      2.88T  13.5T    237    639   950K  81.8M
...（重复以上模式）

我把这些内容复制给 Claude 分析，它说 recordsize=1M 其实没有生效：

正常行: 227 read / 674 write / 910K / 84.9M

Read  IO size: 910K / 227  ≈ 4.0 KB   ← 和之前完全一样
Write IO size: 84.9M / 674 ≈ 126 KB   ← 还是 128K，不是 1MB！

写入一个 128KB 数据块的完整流程：

1. 从磁盘读取父级 indirect block（~4KB）   ← 你看到的 read IOPS
2. 在 ARC 中更新 block pointer
3. COW 写新的数据块（128KB）               ← 你看到的 write IOPS
4. COW 写新的 indirect block（~4KB）        ← 计入 write IOPS

损失去哪了？
每次写 128KB 数据，还要额外做一次 4KB 随机读
HDD 随机读：~100-150 IOPS ≈ 约 200-230 IOPS 的 overhead
这些随机读直接消耗了磁盘的寻道时间

这确实很有道理。但其实此时的读写比例（略高于 1:3）还要远高于它描述的场景（每写 132K 读 4K），那么多出来的这么多 4K 读 IO 是在做什么呢？这次我没有问 LLM 就意识到了问题：接收镜像的 dataset 继承了原本用于节省 home 目录空间占用的 dedup 开关，这些 IO 其实是写入之前在读取 pool 的 DDT 哈希表。然而在我们的场景下（要尽快完成镜像），dedup 的意义并不大。

于是我暂时停止传输，关闭 dedup，并进行了 zfs send | zfs recv 来重新创建整个 dataset。然后再继续进行传输：

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
mypool      3.27T  13.1T      0  12.3K      0   302M
mypool      3.27T  13.1T      0  12.5K      0   308M
mypool      3.27T  13.1T      0  12.6K      0   312M
mypool      3.27T  13.1T      0  12.5K      0   309M
mypool      3.27T  13.1T      0  12.5K      0   310M
mypool      3.27T  13.1T      0  12.5K      0   308M
mypool      3.27T  13.1T      0  12.2K      0   298M

这看起来正常多了！虽然此时的每个写 IO 大小也只有 24K 左右，但很明显它们已经是连续的了，因此能比较充分地利用 HDD 的带宽。果然 dedup 总是影响 ZFS 性能的万恶之源，尤其是在机械硬盘上。

上篇在 RTX 5090 (SM120) 上补全 NVFP4 量化相关 kernel

下篇在 RTX 5090 上启用 GPUDirect RDMA 通信支持