wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
While 300 TB comprising roughly 86 million music files, which the group claims represent about 99.6 percent of Spotify’s listens, is a vast amount of audio, it falls well short of the platform’s full catalog. Anna’s Archive says Spotify contains around 256 million tracks in total, meaning the audio files it archived cover only about a third of the catalog, with the remaining tracks represented only in metadata rather than preserved as music files.
By not bothering with all the musical chaff in Spotify's catalog, the Anna's Archive team is apparently content to let those less popular songs languish despite their claim to want to avoid focusing on just the most popular artists.
📂🛡️Suite of tools for file fixity (data protection for long term storage⌛) using redundant error correcting codes, hash auditing and duplications with majority vote, all in pure Python🐍
Here are some tools with a similar philosophy to pyFileFixity, which you can use if they better fit your needs, either as a replacement of pyFileFixity or as a complement (pyFileFixity can always be used to generate an ecc file):
With the above strategy, you should be able to preserve your data for as long as you can actively curate it. In case you want more robustness against accidents or the risk that 2 copies get corrupted under 5 years, then you can make more copies, preferably as LTO cartridges, but it can be other hard drives.
For more information on how to cold store LTO drives, read pp32-33 "Caring for Cartridges" instruction of this user manual. For HP LTO6 drives, Matthew Millman made an open-source commandline tool to do advanced LTO manipulations on Windows: ltfscmd.
In case you cannot afford a LTO drive, you can replace these by external hard drives, as they are less expensive to start with, but then your curation strategy should be done more frequently (ie, every 2-3 years a small checkup, and every 5 years, a big checkup).
Why are data corrupted with time? One sole reason: entropy. Entropy refers to the universal tendency for systems to become less ordered over time. Data corruption is exactly that: a disorder in bits order. In other words: the Universe hates your data.
Long term storage is thus a very difficult topic: it's like fighting with death (in this case, the death of data). Indeed, because of entropy, data will eventually fade away because of various silent errors such as bit rot or cosmic rays. pyFileFixity aims to provide tools to detect any data corruption, but also fight data corruption by providing repairing tools.
The only solution is to use a principle of engineering that is long known and which makes bridges and planes safe: add some redundancy.
There are only 2 ways to add redundancy:
the simple way is to duplicate the object (also called replication), but for data storage, this eats up a lot of storage and is not optimal. However, if storage is cheap, then this is a good solution, as it is much faster than encoding with error correction codes. For replication to work, at least 3 duplicates are necessary at all times, so that if one fails, it must replaced asap. As sailors say: "Either bring 1 compass or 3 compasses, but never two, because then you won't know which one is correct if one fails." Indeed, with 3 duplicates, if you frequently monitor their integrity (eg, with hashes), then if one fails, simply do a majority vote: the bit value given by 2 of the duplicates is probably correct.
the second way, the optimal tools ever invented to recover from data corruption, are the error correction codes (forward error correction), which are a way to smartly produce redundant codes from your data so that you can later repair your data using these additional pieces of information (ie, an ECC generates n blocks for a file cut in k blocks (with k < n), and then the ecc code can rebuild the whole file with (at least) any k blocks among the total n blocks available). In other words, you can correct up to (n-k) erasures. But error correcting codes can also detect and repair automatically where the errors are (fully automatic data repair for you !), but at the cost that you can then only correct (n-k)/2 errors.
Error correction can seem a bit magical, but for a reasonable intuition, it can be seen as a way to average the corruption error rate: on average, a bit will still have the same chance to be corrupted, but since you have more bits to represent the same data, you lower the overall chance to lose this bit.
pyFileFixity provides a suite of open source, cross-platform, easy to use and easy to maintain (readable code) to protect and manage data for long term storage/archival, and also test the performance of any data protection algorithm.
The project is done in pure-Python to meet those criteria, although cythonized extensions are available for core routines to speed up encoding/decoding, but always with a pure python specification available so as to allow long term replication.
Here is an example of what pyFileFixity can do:
It's impossible to guarantee a long timeframe because of entropy (also called death!). Digital data decay and dies, just like any other thing in the universe. But it can be slowed down.
There's currently no fail-proof and scientifically proven way to guarantee 30+ years of cold data archival. Some projects are aiming to do that, like the Rosetta Disks project of the Long Now museum, although they are still very costly and with a low data density (about 50 MB).
In the meantime, you can use scientifically proven resilient optical mediums for cold storage like Blu-ray Discs HTL type like Panasonic's, or archival grade DVD+R like Verbatim Gold Archival, and keep them in air-tight boxes in a soft spot (avoid high temperature) and out of the light.
Also be REDUNDANT: Make multiple copies of your data (at least 4), and compute hashes to check regularly that everything is alright, and every few years you should rewrite your data on new disks. Also, use a lot of error correcting codes, they will allow you to repair your corrupted data!
Long answer
Why are data corrupted with time? The answer lies in one word: entropy. This is one of the primary and unavoidable force of the universe, which makes systems become less and less ordered in time. Data corruption is exactly that: a disorder in bits order. So in other words, the Universe hates your data. //
Long-Term Archival
Even with the best currently available technologies, digital data can only be cold stored for a few decades (about 20 years). Thus, in the long run, you cannot just rely on cold storage: you need to setup a methodology for your data archiving process to ensure that your data can be retrieved in the future (even with technological changes), and that you minimize the risks of losing your data. In other words, you need to become the digital curator of your data, repairing corruptions when they happen and recreate new copies when needed.
There's no foolproof rules, but here are a few established curating strategies, and in particular a magical tool that will make your job easier:
in 2025, the next archivist may well be among the most important appointments President Donald Trump will make. Why? Because the federal government’s ability to function transparently, legally, and accountably now depends almost entirely on fixing a catastrophic, decades-long failure in electronic records management — a failure that no other agency but the National Archives and Records Administration (NARA), which the archivist heads, has the authority to repair. //
Most Americans have no idea how bad the situation is. And that’s not their fault. If the government had been managing its electronic records as required by law, we would all have access to the information needed to understand how decisions are made, money is spent, crimes are investigated, and power is used.
Instead, we now have more than two decades of abject information chaos — a level of dysfunction that threatens the very foundations of democratic governance. NARA’s dysfunction has damaged transparency, as seen with Jan. 6, Russiagate, and Arctic Frost, to name a few. On oversight, it has contributed to the Pentagon’s inability to account for trillions of missing taxpayer dollars. With cybersecurity, it has resulted in the loss of more than 25 million classified electronic records, as demonstrated by the Office of Personnel Management’s data breach. Lastly, it has made it difficult to hold anyone accountable for federal health agencies’ misconduct. //
In 1997, NARA endorsed the Defense Department’s DoD 5015.2-certified electronic records repositories as the official solution for managing federal electronic records. The problem? Those systems were designed by professional records managers who — through no fault of their own — had little to no understanding of electronic information management. The result was predictable: applications that were theoretically compliant on paper but fatally flawed in practice.
Federal agencies spent millions purchasing these certified systems. Yet not a single agency ever successfully deployed one in a production environment. The reasons are detailed in a stunning investigative report by the Epoch Times, which chronicles how these failures have cost taxpayers billions, compromised national security, and endangered the lives of innocent Americans. But the bigger story is this: because the DoD 5015.2 systems never worked, federal agencies never managed electronic records in accordance with the Federal Records Act at all. //
If agencies and vendors fail to demonstrate a solution’s compliance with these requirements, NARA can reject it as a suitable solution for managing agency information. Thus, NARA effectively became the government’s default IT regulator — a role for which it was neither trained nor equipped.
As a result, the archivist of the United States, a position once considered ceremonial, suddenly became responsible for overseeing the digital infrastructure of the entire federal government. //
In February, President Trump fired the previous archivist, historian Colleen Shogan. Given her lack of technical experience, her support for her predecessor’s participation in the FBI’s raid on Mar-a-Lago, and her questionable political independence, I fully supported that decision.
But it has left a vacuum at a time when NARA desperately needs leadership with vision, technical expertise, and the ability to rebuild trust across partisan divides. The president has not yet nominated a replacement. //
The person who steps into this role will carry responsibility for ensuring the U.S. government can function in the digital age. If the next archivist fails, the consequences will cascade through every policy domain — from national security to public health to economic oversight. //
The failure of federal electronic records management has already cost billions of dollars, jeopardized transparency, and eroded public trust. It has allowed agencies to operate in the shadows, shielded from accountability by systems too broken to track what they do. //
The archivist of the United States is now the guardian of every recorded action of our government — and therefore the guardian of the public’s right to know.
Magnetic Tapes
Sub-Zero Cold Storage for the Permanent Preservation of Photographs,
Motion Picture Films, Books, Newspapers, Manuscripts, and Historical Artifacts
Remember, this transfer that you (or I) are about to undertake may be the last time (and hopefully the best time) that the original is transferred. Here are some suggestions:
-
Make at least two masters and a listening copy. Keep one set of masters off-site. //
-
Make straight transfers before processing. Save these as better noise processing algorithms may be available in the future.
-
Save a good portion of the noise footprint on the tape without other signal information for later noise reduction processing.
-
At first, worry less about a final product than getting a good, clean transfer with as few artifacts as possible.
-
Above all, listen…are you getting the best transfer you think you can?
This checklist is not a complete guideline. It contains only those items that experience and testing show will have an immediate or severe effect on magnetic tape. Failure to adhere to the items on this list may cause premature loss or deterioration of magnetic tapes and should be considered misuse of the medium. These are minimum handling requirements that summarize good practices.
If the restoration/preservation reformatting is for an institutional client, then the first transfers should be as unprocessed as possible — at least the initial copies that are archived should be done that way. The main reason for this is that processing algorithms will always get better and they may hide some information that is useful to future researchers–information that today we consider “noise.”
I am conservative when setting audio levels when making transfers because there is no way of knowing the loudest portion of the signal in advance. So I generally transfer at 24 bits and then raise or normalize the level prior to dithering down to 16 bits for the distribution copy. If I’m working on music, I will generally archive the 88,200 ks/s or 96,000 ks/s files before the normalization.
Processing should generally be done on a copy. The exception to this in my mind are private clients who want the best possible copy of their parents’ wedding, or some other important event. If applied conservatively, noise reduction and equalization will be appreciated by these clients and most of them won’t care a bit that it’s been processed. I keep the unprocessed files on my servers until I am sure the client is happy with the processed version.
As to what to use, there are a wide variety of options available. At the high-end, this falls into the category of “remastering” rather than simple restoration and I’m sure there are options that I’m not aware of.
As a first step, I am very pleased with the basic capabilities built into Samplitude. In addition to that, I use Algorithmix Noise Free Pro as well as the Sound Laundry suite. Really tough projects can often be improved by the filters in Diamond Cut 6 Live/Forensics and most of the filters are available in the lower-priced Diamond Cut 6. Diamond Cut and their main dealer, Tracertek, often run sales which was how I upgraded to Live/Forensics.
Other products with excellent reputations are Cedar Cambridge, Quadriga Audio Cube, and many others. Listening to and discussing with other users via one or more of the mailing lists listed here is very useful.
Often a tape comes in for restoration that has been poorly wound or poorly stored. Here is an example:
cinched tape
One of the interesting things about this particular tape was it had been recently wound on a constant-tension professional machine prior to shipping to me.
We think that the entire tape had not been re-wound, allowing the higher tension wind to compress the inner core slightly, causing this cinching. After transferring the tape (which didn’t show much ill effect for its cinching), we still found it difficult to get the tape to wind smoothly on the reel.
Therefore, our current suggestion is if you find a tape like this, do not rewind it and attempt to clear up the cinching unless you are also ready to transfer the tape, as there are no guarantees that it can be wound better after unwinding.
The Permanence and Care of Color Photographs: Traditional and Digital Color Prints, Color Negatives, Slides, and Motion Pictures
by Henry Wilhelm
with
contributing author Carol Brower
In 2006, I wrote a blog post (here) called “Let Sleeping Tapes Lie: What to do with poorly wound tapes”. For years, tape experts have been suggesting that it is not as good an idea to rewind tapes as was originally thought. This was partially based on the fact that most rewinding in archives was done on the oldest, junkiest machines so as to not wear out the good machines. Unless rewinding is done on high-quality tape transports, it is indeed counter-productive.
ere are a list of my picks of free and low-cost software tools. I am sticking with Samplitude Professional for audio and Adobe Photoshop and Adobe Photoshop Lightroom for photo-graphics. The other alternatives, however, are wide open.
The International Association of Sound and Audiovisual Archives (IASA) has released their landmark Guidelines on the Production and Preservation of Digital Audio Objects as a free web (HTML) edition, available here. http://www.iasa-web.org/tc04/audio-preservation
At some point, this tape was played on a 1/4-track machine that injected hum onto the left channel. Here’s what the magnetic viewer showed:

At the very top we can see a remnant of the left channel material, then the 120-Hz bars (62.5 mil spacing), then the remainder of the left channel material. In the middle is the guard band and at the bottom, the right channel.