dd, bs= and why you should use conv=fsync

This story starts with me having to simulate a faulty disk device for testing. The Linux Kernel Device mapper is a good solution for this, so i created a faulty device with a simple file backend:

 truncate -s 1G /tmp/baddisk
 losetup /dev/loop2 /tmp/baddisk
 dmsetup create baddisk << EOF 
    0 6050 linear /dev/loop2 0
    6050 155 error
    6205 2090947 linear /dev/loop2 6205 
 EOF

These commands setup a new device on /dev/mapper/baddisk with 1GB of size. Starting from sector 6050, there are 155 faulty sectors, where any write and read operation should cause I/O errors.

I went on and used dd to write to the device, as the first faulty block should start around 3MB, i used the following command:

  dd if=/dev/zero of=/dev/mapper/baddisk bs=4096 count=1000
  4096000 bytes (4.1 MB, 3.9 MiB) copied, 0.0107267 s, 382 MB/s

To my surprise, the command succeeded. I assumed some error in my setup and after re-creating the device mapper target, i tried again. This time with the following command:

 dd if=/dev/zero of=/dev/mapper/baddisk
 dd: writing to '/dev/mapper/baddisk': Input/output error
 3096576 bytes (3.1 MB, 3.0 MiB) copied, 0.0238947 s, 130 MB/s

Nice, the device behaves as expected! While taking notes in another terminal and switching back and forth workspaces, i issued the following command again:

  dd if=/dev/zero of=/dev/mapper/baddisk bs=4096 count=1000
  4096000 bytes (4.1 MB, 3.9 MiB) copied, 0.0107267 s, 382 MB/s

What? It succeeded writing 4.1MB of data to a faulty segment of the disk which should clearly fail! This was strange, but still, after many attempts with this command writing to the complete device until it got end of space, no I/O errors were reported by dd.

Looking at the dmesg output, the kernel correctly reported errors with the underlying device:

 Buffer I/O error on dev dm-3, logical block 757, lost async page write
 [..]

And running badblocks on the device also correctly reported them.

Why does dd not report this error?

The difference between the commands is the used block size, so i assumed some caching beeing the cause for this situation, or maybe dd opening the file with different flags like O_DIRECT or O_SYNC if smaller block sizes are used?

I straced the dd command and the openat/write and close functions behaved exacly the same, this time i used a 5MB block size for simpler debugging:

 openat(AT_FDCWD, "/dev/mapper/baddisk", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5242880) = 5242880
 close(1)                                = 0

The strace output shows that the succeeding command opens the file without any notable difference to the command writing with 512 bytes block size. The write and close functions return with no error whatsoever. dd simply does not notice the data loss while writing to the storage!

Making dd use the O_DIRECT flag during file open, or the O_SYNC option catches the error:

 dd if=/dev/zero of=/dev/mapper/baddisk bs=5M oflag=direct
 dd: error writing '/dev/mapper/baddisk': Input/output error

What is the reason for this? I assume dd, with its standard block size of 512 bytes does not the hit the linux kernels buffered I/O. And with bigger block sizes, the I/O becomes buffered, async, and as it stands, dd as user space application does not validate the write operation (using fsync) to notice errors during buffered I/O operations by default.

This leads us to my next finding: “the linux fsync() gate” that dates back to 2018, starting with the following question on stackoverflow:

Writing programs to cope with I/O errors causing lost writes on Linux

And resulting LWN articles:

Improved block-layer error handling

PostgreSQL’s fsync() surprise

which provide great insight into the linux kernels error handling and how these errors are upstreamed to user space applications while writing data to faulty devices.

Long story short: If one uses dd with a bigger block size (>= 4096), be sure to use either the oflag=direct or conv=fsync option to have proper error reporting while writing data to a device. I would prefer conv=fsync, dd will then fsync() the file handle once and report the error, without having the performance impact which oflag=direct has.

dd if=/dev/zero of=/dev/mapper/baddisk bs=4096 count=1500 conv=fsync
dd: fsync failed for '/dev/mapper/baddisk': Input/output error
Written on January 6, 2021