Detecting a Corrupt Solid State Hard Drive
by February 11, 2014

Filed under: Performance Monitoring

Corrupted data is hard to detect since there is usually no direct indication in the logs. Many times when you notice corruption it is already too late because the machine doesn’t boot up anymore or you suddenly can’t run a command anymore which is essential for normal operation. Corruption itself is hard to prevent, esp. on flash hard drives. The more important it is to make sure the storage device works together well with the rest of the hardware and to find corruption as quickly as possible so that it can be fixed before it’s too late.

This article focuses on using flash memory such as Compact Flash (CF) and Secure Digital (SD) to run a Linux System on it.

Not all Flash devices play nicely together with the other hardware components in your computer. This doesn’t mean that the CF would be defective. It might just work fine with a different system. This could be related to the Speed rating (DMA mode) that’s being used to access the CF, Firmware Revision, or Batch Number. For example, we had a CF from Sandisk (SDCFH-004G) that didn’t run properly with speed UDMA 4 but only for the cards with the Firmware Revision HDX 7.07. We didn’t have problems with the same model on HDX 6.03. And we also didn’t encounter issues with HDX 7.07 when operating at speed UDMA 2.

So you can’t really tell just from the model number if the flash device is suitable for your purpose. Most consumer grade flash devices are only being tested in digital cameras but not on computer systems. If your use case requires reliable data retention then you should go with Industrial flash devices to be on the safe side. But Industrial flash devices are far more expensive than consumer grade flash devices. The choice is yours.

Test Flash Memory for Corruption

So let’s say you’ve purchased a new CF card. What could you do to test it with your existing hardware?

First we’ll check for obvious data corruption by running a simple MD5 Checksum test. The idea is to write a huge file (300-500MB) a couple of times to the CF and compare its content after.

These are the commands (make sure to run as root):

mkdir -p /root/wr_test
cd /root/wr_test
for A in $(seq 1 5); do
    dd if=/root/hugeFile of=file$A;
done
md5sum $(find -type f)
sync
echo 3 > /proc/sys/vm/drop_caches
md5sum $(find -type f)

The first two lines create a new directory and change to it:

mkdir -p /root/wr_test
cd /root/wr_test

Then a for-loop copies the file /root/hugeFile (replace with your own file) five times into the newly created directory:

for A in $(seq 1 5); do
    dd if=/root/hugeFile of=file$A;
done

Get the MD5 hash from the files (this will read the files from the file system cache):

md5sum $(find -type f)

The next step is important. Since we’re interested in the data on the disk and not in the data from system cache, we need to tell the kernel towrite the cache to disk which may take a while:

sync

After that wipe out all file system cache that got written to disk:

echo 3 > /proc/sys/vm/drop_caches

Finally run the same MD5 command as above and you will notice pretty soon if the CF passed the simple MD5 test:

md5sum $(find -type f)

In case of corruption one or more copied files will have different checksums.

For example on a good CF the result will look like this:

e0657151e780c19f6b6588174af094c3  ./file1
e0657151e780c19f6b6588174af094c3  ./file2
e0657151e780c19f6b6588174af094c3  ./file3
e0657151e780c19f6b6588174af094c3  ./file4
e0657151e780c19f6b6588174af094c3  ./file5

And on a corrupted CF the result will look like this:

d7cb89ccee796dde6691d8dd55a36085  ./file1
33d4c92031ec8db990fc67b28a1e8222  ./file2
ddae153b86d83e2e20f2e2361add7656  ./file3
82c631a102135ba50b92f836070cc064  ./file4
d63b325a60bfc989046de3f8273c637f  ./file5

Detect existing Corruption on Flash Memory

After you’ve made sure that simple write corruption is not an issue on your CF you should also put something in place that will detect existing corruption as quickly as possible. We decided to use AIDE (Advanced Intrusion Detection Environment, http://aide.sourceforge.net/) for our purpose since it’s a very lightweight file integrity checker and very intuitive to use. The way AIDE works is that it computes hashes for given files and tells the user when a file has changed.

After installation we define the files/directories in aide.config to be included/excluded for checking. We also tell AIDE to only check for md5 which will be good enough to find corruption. So our modifications in aide.conf look like this:

Corruption = md5
# Include paths and files
/bin        	Corruption
/boot       	Corruption
/etc        	Corruption
/lib        	Corruption
/opt        	Corruption
/root       	Corruption
/sbin       	Corruption
/usr        	Corruption
/var/lib/dpkg/  Corruption

For a full list of config options please refer to the aide.conf(5) manual page.

Next we initialize the AIDE database:

aide -c /etc/aide/aide.conf -i
mv /var/lib/aide/aide.db.new /var/lib/aide/aide.db

Last we create a script which does a file system check using AIDE. Note that we need to sync & flush the file cache so that AIDE reads all data from disk:

#!/bin/bash
sync
echo 3 > /proc/sys/vm/drop_caches
aide -c /etc/aide/aide.conf -C >>/var/log/aide.log 2>&1

The results will be written to /var/log/aide.log. If you want you can run this script from a daily cron job and set up some kind of notification like an email in case data corruption was detected.

Conclusion

With AIDE you get a simple and yet powerful tool to actively search for file corruption before the corruption finds you. Of course in an ideal world there is no corruption but consumer grade Flash Memory is usually designed for use in digital cameras and therefore in real life we have to deal with the unexpected.

  • Neil

    I’ve recently encountered a situation that seems similar, involving the same model of Sandisk CF cards, the same firmware versions (7.07 is bad, while 6.03 is good), and what seems like similar behaviour: at low speeds, things seem to be okay, but issues occur at higher rates.

    In my case, the controller doesn’t do DMA, so it’s PIO0 (seems stable) and whatever Linux sets the PIO level to while preparing the cards for use. At higher speeds, with a small number of cards, it seems that persistent corruption occurs, resulting in ATA timeouts and command failures at mount-time.

    If you still have notes from your investigation, is there any chance you might be willing and able to share your findings or comment on what led you to conclude that the DMA level was directly responsible for the problems you experienced?