Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

HiddenLayer555@lemmy.ml · 2 days ago

Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

DasFaultier@sh.itjust.works · edit-2 2 days ago

You’re asking the right questions, and there have been some great answers on here already.

I work at the crossover between IT and digital preservation in a large GLAM institution, so I’d like to offer my perspective. Sorry of there are any peculiarities in my comment, English is my 2nd language.

First of all (and as you’ve correctly realizes), compression is an antipattern in DigiPres and adds risk that you should only accept of you know what you’re doing. Some formats do offer integrity information (MKV/FFV1 for video comes to mind, or the BagIt archival information package structure), including formats that use lossless compression, and these should be preferred.

You might want to check this to find a suitable format here: https://en.wikipedia.org/wiki/List_of_archive_formats -> Containers and compression

Depending on your file formats, it might not even be beneficial to use a compressed container, e.g. if you’re archiving photos/videos that already exist in compressed formats (JPEG/JFIF, h.264, …).

You can make your data more resilient by choosing appropriate formats not only for the compressed container but also for the payload itself. Find significant properties of your data and pick formats accordingly, not the other way round. Convert before archival of necessary (the term is normalization).

You might also want to consider to reduce the risk of losing the entirety of your archive by compressing each file individually. Bit rot is a real threat, and you probably want to limit the impact of flipped bits. Error rates for spinning HDDs are well studied and understood, and even relatively small archives tend to be within the size range for bit flips. I can’t seem to find the sources just now, but iirc, it was something like 1 Bit in 1.5TB for disks at write time.

Also, there’s only so much you can do against bit rot on the format side, so consider using a filesystem that allows you to run regular scrubs and so actually run them; ZFS or Btrfs come to mind. If you use a more “traditional” filesystem like ext4, you could at least add checksum files for all of your archival data that you can then use as a baseline for more manual checks, but these won’t help you repair damaged payload files. You can also create BagIt bags for your archive contents, because bags come with fixity mechanisms included. See RFC 8493 (https://datatracker.ietf.org/doc/html/rfc8493). There are even libraries and software that help you verify the integrity of bags, so that may be helpful.

The disk hardware itself is a risk as well; having your disk laying around for prolonged periods of time might have an adverse effect on bearings etc. You don’t have to keep it running every day, but regular scrubs might help to detect early signs of hardware degradation. Enable SMART if possible. Don’t save on disk quality. If at all possible, purchase two disks (different make & model) to store the information.

DigiPres is first and foremost a game of risk reduction and an organizational process, even of we tend to prioritize the technical aspects of it. Keep that in mind at all times

And finally, I want to leave you with some reading material on DigiPres and personal archiving on general.

https://www.langzeitarchivierung.de/Webs/nestor/DE/Publikationen/publikationen_node.html (in German)
https://meindigitalesarchiv.de/ (in German)
https://digitalpreservation.gov/personalarchiving/ (by the Library of Congress, who are extremely competent in DigiPres)

I’ve probably forgotten a few things (it’s late…), but if you have any further questions, feel free to ask.

EDIT: I answered to a similar thread a few months ago, see https://sh.itjust.works/comment/13922388

borZ0 the t1r3D b3aR@lemmy.world · 2 days ago

Incredible answer. Thank you for taking the time and effort. Stuff like this is what makes the Fedi-verse strong.

RiverRabbits@lemmy.blahaj.zone · 2 days ago

danke für deinen Beitrag! Auch wenn ich dir Frage nicht gestellt hab, war dein Post super informativ und hab auch echt was gelernt :) Besonders die Perspektive, wie in deinem Feld an das Thema herangegangen wird ist für Laien sehr wertvoll um ein Gefühl für die wichtigen Aspekte zu erkennen! (und denke mal, bei dem Username, dass du deutsch sprechen kannst haha)

DasFaultier@sh.itjust.works · 2 days ago

und denke mal, bei dem Username, dass du deutsch sprechen kannst haha Jup, stimmt. :D

Ich bleib’ trotzdem mal bei Englisch, damit’s im englischen Thread verstanden wird.

ENGLISH: Yeah, you’re right, I wasn’t particularly on-topic there. :D I tried to address your underlying assumptions as well as the actual file format question, and it kinda derailed from there.

Sooo, file format… I think you’re restricting yourself too much if you just use the formats that are included in binutils. Also, you have conflicting goals there: it’s compression (make the most of your storage) vs. resilience (have a format that is stable in the long term). Someone here recommended lzip, which is definitely a right answer for good compression ratio. The Wikipedia article I linked features a table that compares compressed archive formats, so that might be a good starting point to find resilient formats. Look out for formats with at least Integrity Check and possibly Recovery Record, as these seem to be more important than compression ratio. When you have settled on a format, run some tests to find the best compression algorithm for your material. You might also want to measure throughput/time while you’re at it to find variants that offer a reasonable compromise between compression and performance. If you’re so inclined, try to read a few format specs to find suitable candidates.

You’re generally looking for formats that:

are in widespread use
are specified/standardized publicly
are of a low complexity
don’t have features like DRM/Encryption/anti-copy
are self-documenting
are robust
don’t have external dependencies (e.g. for other file formats)
are free of any restrictive licensing/patents
can be validated.

You might want to read up on more technical infos on how an actual archive handles these challenges at https://slubarchiv.slub-dresden.de/technische-standards-fuer-die-ablieferung-von-digitalen-dokumenten and the PDF files with specifications linked there (all in German).

Ferk@lemmy.ml · 2 days ago

Just note that @RiverRabbits@lemmy.blahaj.zone wasn’t the one who opened the Thread, that’s why they said they didn’t ask the question (I get the feeling there might have been some confusion here :P ).

Still, very informative comment.

RiverRabbits@lemmy.blahaj.zone · 2 days ago

Haha, yeah I’m not the OP! But the way my german is phrased here and how the replier interpreted it would read as super passive aggressive (think “I didn’t ask that question but thanks”), and for that I apologize 😭 I just meant I’m not the OP😌

DasFaultier@sh.itjust.works · 2 days ago

Of yeah, there really was, thank you. :)

SGH@lemmy.ml · 2 days ago

Honestly, given that they should be purely compressing data, I would suppose that none of the formats you mentioned has ECC recovery nor builtin checksums (but I might be very mistaken on this). I think I only saw this within WinRAR, but also try other GUI tools like 7zip and check its features for anything that looks like what you need, if the formats support ECC then surely 7zip will offer you this option.

I just wanted to point out, no matter what someone else might say, if you were to split your data onto multiple compressed files, the chances of a bit rotting deleting your entire library are much lower, i.e. try to make it so that only small chunks of your data is lost in case something catastrophic happens.

However, if one of your filesystem-relevant bits rot, you may be in for a much longer recovery session.

IanTwenty@lemmy.world · 2 days ago

Upgrade from compression tools to backup tools. Look into using restic (a tool with dedup, compression and checksumming) on a filesystem which also checksums and compresses (btrfs/zfs) - that’s probably most reasonable protection and space saving available. Between restic’s checks and the filesystem you will know when a bit flips and that’s when you replace the hardware (restoring from one of your other backups).

GenderNeutralBro@lemmy.sdf.org · 2 days ago

Generally speaking, xz provides higher compression.

None of these are well optimized for images. Depending on your image format, you might be better off leaving those files alone or converting them to a more modern format like JPEG-XL. Supposedly JPEG-XL can further compress JPEG files with no additional loss of quality, and it also has an efficient lossless mode.

Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting?

As far as I know, no common compression algorithms feature built-in error correction, nor does tar. This is something you can do with external tools, instead.

For validation, you can save a hash of the compressed output. md5 is a bad hashing algorithm but it’s still generally fine (and widely used) for this purpose. SHA256 is much more robust if you are worried about dedicated malicious forgery, and not just random corruption.

Usually, you’d just put hash files alongside your archive files with appropriate names, so you can manually check them later. Note that this will not provide you with information about which parts of the archive are corrupt, only that it is corrupt.

For error correction, consider par2. Same idea: you give it a file, and it creates a secondary file that can be used alongside the original for error correction later.

I also want the files to be extractable with just the Linux/Unix standard binutils

That is a key advantage of this method. Adding a hash file or par file does not change the basic archive, so you don’t need any special tools to work with it.

You should also consider your file system and media. Some file systems offer built-in error correction. And some media types are less susceptible to corruption than others, either due to physical durability or to baked-in error correction.

YaBoyMax@programming.dev · 2 days ago

AFAIK none of those formats include any mechanism for error correction. You’d likely need to use a separate program like zfec to generate the extra parity data. Bzip2 and Zstandard are somewhat resistant to errors since they encode in blocks, but in the event of bit rot the entire affected block may still be unrecoverable.

Alternatively, if you’re especially concerned with robustness then it may be more advisable to simply maintain multiple copies across different drives or even to create an off-site backup. Parity bits are helpful but they won’t do you much good if your hard drive crashes or your house catches fire.

blackbrook@mander.xyz · 2 days ago

Lzip

bacon_pdp@lemmy.world · 2 days ago

You forgot lzip

https://www.nongnu.org/lzip/lzip.html

Which is the best for that use case

just_another_person@lemmy.world · 2 days ago

Compression formats are just as susceptible to bitrot as any other file. The filesystem is where you want to start if you’re discussing archival purposes. All of the modern filesystems will support error correction, so using BTRFS or ZFS with proper configuration is what you’re looking for to prevent files from getting corrupted.

That being said, if you store something on a medium and then don’t use said medium (lock it in a safe or whatever), then the chances you’ll end up with corrupted files approaches 0%. Bitrot and general file corruption happens as the bits on a disk are shifted around, so by not using that disk, the likelihood this will happen is nearly 0.

TerHu@lemmy.dbzer0.com · 2 days ago

afaik that depends on the type of medium, where ssd are more susceptible to rot than hdd (and never use usb sticks). now this is just my guess, but i’d think that zfs with frequent automatic checks and and such will keep your data safer than an unplugged hdd

Blue_Morpho@lemmy.world · 2 days ago

Bitrot happens even when sitting around. Magnetic domains flip. SSD cells leak electrons.

Reading and rewriting with an ECC system is the only way to prevent bit rot. It’s particularly critical for SSDs.

Olap@lemmy.world · 2 days ago

This is a money game, how much are you willing to invest?

anotherspinelessdem@lemmy.ml · 2 days ago

Honestly amazing question, I’ve lost entire 7z archives because of minor amounts of bit rot

tromboneflatsteel@lemmy.world · edit-2 2 days ago

~~Error correction and compression are usually at odds.~~ Error correction usually relies on redudant data to identify what was corrupted it also helps if the process for error correction is ran more frequent. So storing it away offline is counter to the correction and the added redundancy will reduce the space gains. You can check different error correction software or technique. Ex RAID. I recommend following the 3-2-1 data backup rule. Also even if you can’t do all the steps doing the ones you can, helps.

Sidenote optionally investigate which storage brand/medium/grade you want. Some are more resistant than other for long term vs short term. Also even unused storage will degrade over time whether the physical components, the magnetic charge weakening or electric charge representing your data. So again offline all the time isn’t the best; run it a couple times a year if not more to ensure errors don’t accumulate.

Sadly I won’t give specifics because I haven’t tried your use case and I am not familiar, but hopefully the keywords help.

waigl@lemmy.world · 2 days ago

Error correction and compression are usually at odds.

Not really. If your data compresses well, you can compress it by easily 60, 70%, then add Reed-Solomon forward error correction blocks at like 20% redundancy, and you’d still be up overall.

TrickDacy@lemmy.world · 2 days ago

There are a lot of smart answers in here, but personally I wouldn’t risk it by using a compressed archive. Disk space is cheap.

MangoPenguin@lemmy.blahaj.zone · 2 days ago

I dont believe any of them will compress images, or disk backups, so I wouldn’t worry about using compression.

DasFaultier@sh.itjust.works · 2 days ago

They all will, if the filesystem images aren’t pre-compressed themselves, and if OP is archiving raw image formats (DNG, CR2, …).

Ŝan@piefed.zip · 2 days ago

What makes þese þe 3 “standard” Unix compression algoriþms?

Lemmchen@feddit.org · 2 days ago

History.

Ŝan@piefed.zip · 1 day ago

Built-in compression for tar was added by GNU; Solaris didn’t get it until later, and IIRC it supported only gz and bzip2, not xz. AIX didn’t get bzip2 until 2008(-ish?).

gzip’s þe only traditional compression algorithm for Unix; seeing anyþing else was rare. Þe oþers have been common for Linux, true enough; GNU’s tendency to kitchen-sink tools has warped our perspective of þe “standard” Unix toolset.

YaBoyMax@programming.dev · 2 days ago

Unrelated, but what on earth is going on with the thorns in your comments?

wewbull@feddit.uk · 2 days ago

They’re Saxon

TrickDacy@lemmy.world · 2 days ago

Attention seeking behavior

Ironfist79@lemmy.world · 2 days ago

It’s how we should be writing þings.