Filed under: Performance Monitoring
Corrupted data is hard to detect since there is usually no direct indication in the logs. Many times when you notice corruption it is already too late because the machine doesn’t boot up anymore or you suddenly can’t run a command anymore which is essential for normal operation. Corruption itself is hard to prevent, esp. on flash hard drives. The more important it is to make sure the storage device works together well with the rest of the hardware and to find corruption as quickly as possible so that it can be fixed before it’s too late.
This article focuses on using flash memory such as Compact Flash (CF) and Secure Digital (SD) to run a Linux System on it.
Not all Flash devices play nicely together with the other hardware components in your computer. This doesn’t mean that the CF would be defective. It might just work fine with a different system. This could be related to the Speed rating (DMA mode) that’s being used to access the CF, Firmware Revision, or Batch Number. For example, we had a CF from Sandisk (SDCFH-004G) that didn’t run properly with speed UDMA 4 but only for the cards with the Firmware Revision HDX 7.07. We didn’t have problems with the same model on HDX 6.03. And we also didn’t encounter issues with HDX 7.07 when operating at speed UDMA 2.
So you can’t really tell just from the model number if the flash device is suitable for your purpose. Most consumer grade flash devices are only being tested in digital cameras but not on computer systems. If your use case requires reliable data retention then you should go with Industrial flash devices to be on the safe side. But Industrial flash devices are far more expensive than consumer grade flash devices. The choice is yours.
Test Flash Memory for Corruption
So let’s say you’ve purchased a new CF card. What could you do to test it with your existing hardware?
First we’ll check for obvious data corruption by running a simple MD5 Checksum test. The idea is to write a huge file (300-500MB) a couple of times to the CF and compare its content after.
These are the commands (make sure to run as root):
mkdir -p /root/wr_test cd /root/wr_test for A in $(seq 1 5); do dd if=/root/hugeFile of=file$A; done md5sum $(find -type f) sync echo 3 > /proc/sys/vm/drop_caches md5sum $(find -type f)
The first two lines create a new directory and change to it:
mkdir -p /root/wr_test cd /root/wr_test
Then a for-loop copies the file /root/hugeFile (replace with your own file) five times into the newly created directory:
for A in $(seq 1 5); do dd if=/root/hugeFile of=file$A; done
Get the MD5 hash from the files (this will read the files from the file system cache):
md5sum $(find -type f)
The next step is important. Since we’re interested in the data on the disk and not in the data from system cache, we need to tell the kernel towrite the cache to disk which may take a while:
After that wipe out all file system cache that got written to disk:
echo 3 > /proc/sys/vm/drop_caches
Finally run the same MD5 command as above and you will notice pretty soon if the CF passed the simple MD5 test:
md5sum $(find -type f)
In case of corruption one or more copied files will have different checksums.
For example on a good CF the result will look like this:
e0657151e780c19f6b6588174af094c3 ./file1 e0657151e780c19f6b6588174af094c3 ./file2 e0657151e780c19f6b6588174af094c3 ./file3 e0657151e780c19f6b6588174af094c3 ./file4 e0657151e780c19f6b6588174af094c3 ./file5
And on a corrupted CF the result will look like this:
d7cb89ccee796dde6691d8dd55a36085 ./file1 33d4c92031ec8db990fc67b28a1e8222 ./file2 ddae153b86d83e2e20f2e2361add7656 ./file3 82c631a102135ba50b92f836070cc064 ./file4 d63b325a60bfc989046de3f8273c637f ./file5
Detect existing Corruption on Flash Memory
After you’ve made sure that simple write corruption is not an issue on your CF you should also put something in place that will detect existing corruption as quickly as possible. We decided to use AIDE (Advanced Intrusion Detection Environment, http://aide.sourceforge.net/) for our purpose since it’s a very lightweight file integrity checker and very intuitive to use. The way AIDE works is that it computes hashes for given files and tells the user when a file has changed.
After installation we define the files/directories in aide.config to be included/excluded for checking. We also tell AIDE to only check for md5 which will be good enough to find corruption. So our modifications in aide.conf look like this:
Corruption = md5 # Include paths and files /bin Corruption /boot Corruption /etc Corruption /lib Corruption /opt Corruption /root Corruption /sbin Corruption /usr Corruption /var/lib/dpkg/ Corruption
For a full list of config options please refer to the aide.conf(5) manual page.
Next we initialize the AIDE database:
aide -c /etc/aide/aide.conf -i mv /var/lib/aide/aide.db.new /var/lib/aide/aide.db
Last we create a script which does a file system check using AIDE. Note that we need to sync & flush the file cache so that AIDE reads all data from disk:
#!/bin/bash sync echo 3 > /proc/sys/vm/drop_caches aide -c /etc/aide/aide.conf -C >>/var/log/aide.log 2>&1
The results will be written to /var/log/aide.log. If you want you can run this script from a daily cron job and set up some kind of notification like an email in case data corruption was detected.
With AIDE you get a simple and yet powerful tool to actively search for file corruption before the corruption finds you. Of course in an ideal world there is no corruption but consumer grade Flash Memory is usually designed for use in digital cameras and therefore in real life we have to deal with the unexpected.