pyFileFixity/README.rst at master · lrq3000/pyFileFixity

6816 shaares

Filters

Links per page

20 50 100

pyFileFixity/README.rst at master · lrq3000/pyFileFixity

Why are data corrupted with time? One sole reason: entropy. Entropy refers to the universal tendency for systems to become less ordered over time. Data corruption is exactly that: a disorder in bits order. In other words: the Universe hates your data.

Long term storage is thus a very difficult topic: it's like fighting with death (in this case, the death of data). Indeed, because of entropy, data will eventually fade away because of various silent errors such as bit rot or cosmic rays. pyFileFixity aims to provide tools to detect any data corruption, but also fight data corruption by providing repairing tools.

The only solution is to use a principle of engineering that is long known and which makes bridges and planes safe: add some redundancy.

There are only 2 ways to add redundancy:

the simple way is to duplicate the object (also called replication), but for data storage, this eats up a lot of storage and is not optimal. However, if storage is cheap, then this is a good solution, as it is much faster than encoding with error correction codes. For replication to work, at least 3 duplicates are necessary at all times, so that if one fails, it must replaced asap. As sailors say: "Either bring 1 compass or 3 compasses, but never two, because then you won't know which one is correct if one fails." Indeed, with 3 duplicates, if you frequently monitor their integrity (eg, with hashes), then if one fails, simply do a majority vote: the bit value given by 2 of the duplicates is probably correct.
the second way, the optimal tools ever invented to recover from data corruption, are the error correction codes (forward error correction), which are a way to smartly produce redundant codes from your data so that you can later repair your data using these additional pieces of information (ie, an ECC generates n blocks for a file cut in k blocks (with k < n), and then the ecc code can rebuild the whole file with (at least) any k blocks among the total n blocks available). In other words, you can correct up to (n-k) erasures. But error correcting codes can also detect and repair automatically where the errors are (fully automatic data repair for you !), but at the cost that you can then only correct (n-k)/2 errors.
Error correction can seem a bit magical, but for a reasonable intuition, it can be seen as a way to average the corruption error rate: on average, a bit will still have the same chance to be corrupted, but since you have more bits to represent the same data, you lower the overall chance to lose this bit.

archive · backup

November 28, 2025 at 4:44:31 AM UTC * · permalink

https://github.com/lrq3000/pyFileFixity/blob/master/README.rst#the-problem-of-long-term-storage

Filters

Links per page

20 50 100