I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive’s capacity so I want to compress them at the highest ratio supported by standard tools. I’ve zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it’s lossless since file level compression can regenerate the original file in its entirety?)
I’ve heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don’t know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.
I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I’m only looking at gz, xz, or bz2.
So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?
Error correction and compression are usually at odds.Error correction usually relies on redudant data to identify what was corrupted it also helps if the process for error correction is ran more frequent. So storing it away offline is counter to the correction and the added redundancy will reduce the space gains. You can check different error correction software or technique. Ex RAID. I recommend following the 3-2-1 data backup rule. Also even if you can’t do all the steps doing the ones you can, helps.Sidenote optionally investigate which storage brand/medium/grade you want. Some are more resistant than other for long term vs short term. Also even unused storage will degrade over time whether the physical components, the magnetic charge weakening or electric charge representing your data. So again offline all the time isn’t the best; run it a couple times a year if not more to ensure errors don’t accumulate.
Sadly I won’t give specifics because I haven’t tried your use case and I am not familiar, but hopefully the keywords help.
Not really. If your data compresses well, you can compress it by easily 60, 70%, then add Reed-Solomon forward error correction blocks at like 20% redundancy, and you’d still be up overall.