Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd argue that bzip2 is a better example of a compression algorithm which no one needs anymore.

Considering these features:

  * Compression ratio
  * Compression speed
  * Decompression speed
  * Ubiquity
And considering these methods:

  * lzop
  * gzip
  * bzip2
  * xz
You get spectrums like this:

  * Ratio:    (worse) lzop  gzip bzip2  xz  (better)
  * C.Speed:  (worse) bzip2  xz  gzip  lzop (better)
  * D.Speed:  (worse) bzip2  xz  gzip  lzop (better)
  * Ubiquity: (worse) lzop   xz  bzip2 gzip (better)
So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.


You can easily apply the same argument to xz here, by introducing something rarer with an even better compression ratio (e.g. zpaq6+). Now xz isn't the best at anything either.

But despite zpaq being public domain, few people have heard of it and the debian package is ancient, and so the ubiquity argument really does count for something after all.


"This package has been orphaned, but someone intends to maintain it. Please see bug number #777123 for more information"

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777123 - get in touch with the new owner of the package if you're interested. It's probably on their Never Ending Open Source To Do List.


No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.


>No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.

Are you implying that xz out-compresses zpaq? Can you supply a benchmark?

Here's one from me - http://mattmahoney.net/dc/text.html - showing a very significant compression ratio advantage to zpaq.


No, where did you find this implication in my comment? What I meant is that:

* xz is faster than bzip2 and provides better compression ratio [than bzip2]

* zpaq is slower [than bzip2 and provides better compression ratio than bzip2]

But looks like I'm mistaken? It seems like it can be faster and give better compression ratio than bzip2, can it?


> So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.

Two points:

(1) It's very, very easy for the best solution to a problem not to simultaneously be the best along any single dimension. If you see a spectrum where each dimension has a unique #1 and all the #2s are the same thing, that #2 solution is pretty likely to be the best of all the solutions. Your hypothetical example does actually make a compelling argument that bzip2 is useless, but that's not because it doesn't come in #1 anywhere; it's because it comes in behind xz everywhere. (Except ubiquity, but that's likely to change pretty quickly in the face of total obsolescence.)

(2) lzop, in your example, is technically "the best at something". But that something is compression and decompression speed, and if your only goal is to optimize those you can do much better by not using lzop (0 milliseconds to compress and decompress!). So that's actually a terrible hypothetical result for lzop.

Heck, zero compression easily wins three of your four categories.


Zero compression is very often the correct choice these days.


No even when speed matters sometime lz4 is the best answer. I wrote a data sync that worked over 100mbps WAN and using lz4 on the serialised data transferred far faster than the raw data. Not just on network you can often be processing data faster (specially on spinning disk) since the reduction in disk I/O can in some cases can actually make the processing faster.


Being second-best on ratio and ubiquity is still pretty handy for serving files. It's compress-once, decompress on somebody else's machine, so neither of those matter. Ratio saves you money and ubiquity means people can actually use the file.


> It's compress-once, decompress on somebody else's machine, so neither of those matter.

Last week, there was a drive mount that was filling up, rate was roughly 30Gb/hr. The contents of that mount was used by the web application. Deletion was not an option. Something that compressed quickly was needed. And on the retrieval end, when the web app needs to do decompression, seconds matter.


I found lz4 to be the best for general purpose analysis, it increased the throughput of my processing 10x compared to bz2. Then if you're working with very large files you can use the splittable version of lz4, 4mc, which also works as a Hadoop InputFormat. I just wish they would switch the Common Crawl archives to lz4.

I should probably mention the compression ratio was slightly worse than bz2 (maybe 15% larger archive) but for the 10x increase in throughput I didn't really mind that much. I could actually analyze my data from my laptop!


If I'm actually doing something with my data, gzip -1 beats out lz4 for streaming, as gzip -1 can usually keep up with the slower of the in/out sides, and gzip -1 is higher compression ratio than lz4 and faster compression (but not decompression) than lz4hc.


I just tested this on my laptop, I used the first 5 million JSONLines of /u/stuck_in_the_matrix reddit dataset (~4.6GB).

For compression lz4 took ~22 seconds (~210 MB/s) and I got ~30% compression, gzip -1 took ~56 seconds (~80 MB/s) and I got ~22% compression.

For decompression lz4 gave me 500MB/s while gunzip gave me 300MB/s.

Commands used:

    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | gzip -1 > test.gz

    gunzip -c test.gz | pv > /dev/null


    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | lz4

    lz4 -cd stdin.lz4 | pv >/dev/null


Interesting; on a mix of source and binaries (archived fully-built checkouts) gzip -1 outperformed lz4 in compression ratio.


No you're correct, gzip -1 outperformed lz4 in my test in compression ratio. I don't know why I typed "30% compression" instead of "compression ratio of 30%." Sorry about that.


FYI, this cool little project — https://code.google.com/p/miniz/ — implements faster gzip compression at level 1.


Last time i checked lz4 did not had a streaming decompression support on their python lib. It will be problem for larger files like common crawl if you are not planning to pre-decompress before processing.


It's not a problem for me since I mostly use Java. However, you can probably just pipe in your data from the lz4 CLI then use that InputStream for whatever python parser you're using and you should be fine.

The biggest problem is using a parser that can do 600MB/s streaming parsing. If you use a command line parser don't try jq even with gnu parallel.


Being the best at something does not make it necessarily the best choice for most situations. This is trivially shown through this example. Assume for the four measured aspects, there is a program this is the best, but in the other three aspects it is orders of magnitude worse than the best in that aspect. Now consider another program which is best in nothing, but is 95% the way to best in every aspect. It's never best in any aspect, but it's clearly a good choice for many, if not most, situations.


Doesn't bzip2 have a concurrent mode that those others don't?


bzip2 can take advantage of any number of CPU cores when compressing.


Bzip2 doesn't handle multiple cores as far as I'm aware, but tools such as pbzip2 can. I wrote about this some time ago: https://hackercodex.com/guide/parallel-bzip-compression/

That said, parallel XZ is even better: https://github.com/vasi/pixz



I don't believe that's true, though there are multiple projects that offer that feature. lbzip2.org and compression.ca/pbzip2 to name a couple.


Really? How? My bzip2 has no option for it and when tested it stuck to one CPU. xz on the other hand has

  -T, --threads=NUM   use at most NUM threads; the default is 1; set to 0
                      to use the number of processor cores


Side note: `xz` only got -T option in stable releases less than 12 months ago (5.2.0 in 2014-12-21), so it hasn't made it into every distro yet.


The xz installed on my system carries a rather promising -T option, but then this text below it

> Multithreaded compression and decompression are not implemented > yet, so this option has no effect for now.


I believe you're looking for pbzip2. Parallel bzip2 file compressor. I've replaced it as my go-to compression.


lbzip2 is pretty good


pbzip2 output isn't universally readable by third-party bz2 decompressors (Hadoop, for example).


If you'd included the "zip" format in your analysis, gzip would not be the best at something anymore.


I use bzip2 purely for sentimental reasons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: