Warning: The Content in this Article May be Inaccurate
Readers have reported that this story contains information that may not be accurate.252 Comments
- inactive, on 10/12/2007, -4/+81Yeah, I'm going to compress my data with something called "KGB" Archiver. And tomorrow, I'm going to start using the CIA Email Client and the FBI Calender Server.
- RichyFreeway, on 10/12/2007, -3/+46Ran netstat -a with the extraction running and I have a connection going out to static.fmpub.net, stop the extraction and it goes away...
Anyone else want to confirm this? - maverick999, on 10/12/2007, -5/+35Let's hope M$ doesn't see that you just posted this online...
- NiteMayr, on 10/12/2007, -2/+25Hey for those who are extracting it... are you watching your network connection while it does it?
- kflasch, on 10/12/2007, -6/+28I wouldn't trust downloading and opening this.
- illynova, on 10/12/2007, -18/+37Wow. This is REALLY amazing. It took my 3.15MB set of files (just a whole bunch of executables... gtalk, 7zip.exe, etc) and.... turned them into an archive 3.10MB in size!
Amazing. Winzip set to maximum compression turns the SAME archive into a 1.8MB archive.
Get lost. - fortytl, on 10/12/2007, -1/+20It's working on a machine with the network cable unplugged.
- ascheinberg, on 10/12/2007, -3/+19You may not understand how archiving works. Some types of files compress better than others, because they have compression built into the file type algorithm.
Just because your set of exe's didn't work too well doesn't mean the program can't do super compression on other file types. - shayne_sweeney, on 10/12/2007, -2/+18It totally works .... however it takes hours to extract large archives. On my dual-core 2.8ghz system it was going to take 4hrs to uncompress. --- This will probably not change file transfers as we know it on the Internet/Usenet circuit.
I disabled my network connection --- easier then using TCPView which too much for a simple task. It's getting all the data from the 1.44mb file, probably a set of algorithims that generates the exact byte data as before.
Nowadays space is cheap so factoring the time it takes to decompress, it's not worth it's weight when compared to a $50 250GB HDD that could store minimally compressed images with ease.
Looking forward to diving into the source code. --- Anybody find a technical DOC on it explaining their ideas? - jpwilmsn, on 10/12/2007, -4/+18To the people complaining about the hardware requirements - you don't get it. Come back in a couple of years and you will find your complaints about requiring a 'whopping' 2GB of RAM to be silly - it will be substandard very shortly. Increasing the power of hardware is GOING to happen, there are no questions about that. The point is that this is a much better compression algorithm that what we are using now - especially (apparently) for large files. Think about the implications of THAT. Eventually, the power it takes to decompress this in real time would very likely be available on a network card. Think about the ability to download gigs and gigs of data compressed so it takes mere megs over the internet. As we become more and more wired (plus the emergence of IPTV, VOIP, etc), there is going to be more and more stress on our network backbones. Compression can and will be a huge part of the expansion of the internet.
I'll read through the site and digg if appropriate, but I think you guys are missing the boat on this one... - Jugalator, on 10/12/2007, -0/+14"It 's a hoax. If it can compress already compressed files (jpg for instance), then can it compress itself? Can the 400MB iso -> 1.4MB -> 4.9KB -> 0.01B (that's less than 1 bit)???"
Wow, you have no idea how *any* archiver works. - merreborn, on 10/12/2007, -0/+13That's silly. By that argument, winzip can't possibly work either.
But it does. - tazamore, on 10/12/2007, -1/+12FOR ALL OF YOUR WHO ASSUME OPEN SOURCE == TRUSTWORTHY
Consider:
1. "caitlyn" has never posted anything on digg before this
2. Anyone can have an anonymous open source project on SourceForge
3. The binary version that everyone is downloading and testing does not have to be built from the same source code that is posted there.
4. What honest person would invite everyone to download Microsoft Office ISO from them?
5. The compression claims are too good to be true.
6. I'd be very worried why this program is trying to connect to external servers as already reported.
I'm not saying closed source is any more secure. Hell no. I'm just saying that this claims to be an open source project doesn't make it honest. - pentomino, on 10/12/2007, -0/+10Why is MS Office the test file? Why not something, I don't know, LEGAL? Like a Linux installation? KnoppMyth?
- ZachPruckowski, on 10/12/2007, -1/+11Because hardware capability is increasing faster than download speed? And bandwidth is expensive?
- segosa, on 10/12/2007, -1/+10Took me 8 hours, although I guess my PC isn't "average". (3GHz, 1GB RAM)
- segosa, on 10/12/2007, -1/+10I'm not an expert on the algorithm but I'd say it works especially well on extremely large files because it finds large patterns, then patterns inside those. Although, that's a guess. I'm just saying it's probably been written with extremely large files in mind, if anything.
- DJB31st, on 10/12/2007, -0/+9the program is clearly real... the office claim and digg post clearly fake
- danpsmith, on 10/12/2007, -0/+8You obviously have no idea how a compression algorithm works. Let's try a little example:
If you shorthand something, say you have 18 0s in a row, then 3 1s, for example. You can put a zero bit, and then denote the number 18 in the file to show how many 0s you replaced, and now you've shortend the file a bit in a way that can be uncompressed again, it's shorthand. Now, let's take that same file we just compressed, and try it again, there's 0 with a binary 18 behind it, wow, not as compressible eh? The compression yields less results because it's already compressed, obviously. There isn't an exact "factor" that you compress stuff by, it depends on the contents of the file. But I'm sure you already knew that.... - miclill, on 10/12/2007, -0/+8The tests they provide are not that promising but I always wanted to have MS-Office. ;-)
- killtherat, on 10/12/2007, -2/+10..."Shirley this is more compression than the Shannon Limit allows?"
Yes it is... And don't call me Shirley - dodd, on 10/12/2007, -0/+8this office thing is fake. most files inside are empty, check forums here: http://forum.infojama.pl/viewtopic.php?t=77533&start=15 . it's in polish, but total commander screenshot is self-explanatory.
- mkrygeri, on 10/12/2007, -1/+9YAY. I knew I'd have a use for all those floppies I've kept from '95!
- vann, on 10/12/2007, -0/+8No, this is no scam. I'd wager most people here just don't understand data compression.
Anyhow, I looked at the source, and KGB is using a predictive arithmetic coding technique. This is one of the more advanced (and expensive) entropy encoding techniques, but for a given model it produces near-optimal (in the information theoretic sense of the word) results. The problem is producing a good model. If your model is trained on Office then you get great encoding for things that are similar to Office, but crap encoding for things sufficiently different.
It appears KGB tries to alleviate this by making a few passes over the data to be compressed in order bootstrap its model using several constituent submodels, and weighting them accordingly. That it doesn't work well for X or Y file is a practical, not a theoretical defect. It also says nothing about its overall performance. I mean, in some cases gzip compresses a file better than bzip2, and in other cases the opposite is true. That doesn't mean gzip sucks or bzip2 sucks -- you can't write a compression algorithm which compresses everything equally well.
Of course, being as we'd use this for practical things, a practical defect is nothing to sneeze at. So it's not a scam, although it's use is obviously going to be limited if it can't outperform faster compression schemes.
Others are saying KGB is actually downloading stuff. There's nothing that seems to indicate that in the source, but it's not like I made a serious study of it. If that's the case then, sure, this thing is a joke. Doesn't really look like it, though, just looks like a really expensive compression algorithm. - readme, on 10/12/2007, -1/+9I call shens. I remember back in like 1992 I download a "fractal compressor" that compressed a 10MB file to like 200k. The demo actually worked in that you saw the file shrink to 200k and you could un-compress it with no loss. It was really a trojan that simply moved the files to a hidden directory when you compress and moved them back when you uncompress. The payload was some virus.
- lebel, on 10/12/2007, -2/+10Mathematically speaking, it is impossible to reduce a 430M ISO archive of executables and random data into a floppy sized diskette. I mean, unless you're dealing with a huge amount of repetition in the data, which I would find HIGHLY surprising, considering the type of data usually found on a software distribution.
Secondly, having the said program connect to a remote host why doing said "extractions" fire off sooooo many alarms in my head that I'm amazed people are digging this story.
Sheesh, kids these days are so naive. Get a book on modern compression techniques and come back. Grab a few aspirines for the headaches if need be. - bradleybuda, on 10/12/2007, -1/+8The thing is, you can't have this kind of universal, non-lossy compression. ANY compression algorithm will shrink some files, and expand others. The reason compression is helpful is that most of the types of files we commonly exchange (text, images, etc.) will usually be shrunk, and the kind of files that get expanded are files with 'random' data that we wouldn't compress anyway.
This is easy to prove. Assume that there is a compression algorithm that can compress ANY file by at least one bit. So take the set of all possible files that contain 20 bits. Our algorithm can compress all of these files to a maximum size of 19 bits. Here's the problem - there are 2^20 = 1,048,576 unique files that can be composed of 20 bits. However, there are 2^19 = 524,288 unique files composed of 19 bits. So we have the 'pidgeonhole problem' - our 1,048,576 files compress into 524,288 files. So when we extract our compressed files, we only get back half of the possible input files. This is a contradiction, and applies for any non-lossy compression algorithm - it's a basic consequence of information theory. Compression is very useful, and there's certainly room for improvement, but there's a lower limit - mathematics forbids us from being that clever. - sirber, on 10/12/2007, -1/+8TCP sirber13:3848 static.fmpub.net:http CLOSE_WAIT
while compressing - Roger, on 10/12/2007, -0/+7That ISO test is probably fake.
http://kgbarchiver.sourceforge.net/en/tests.php
The actual tests do show some improvements over other formats, but nothing too drastic. - lebel, on 10/12/2007, -1/+8OK, here's my guess of what this is: Someone found a buffer overflow in KGB, and when you're running it, you're actually launching something that does something really nasty in the background (thus, the connection to static.fmpub.net).
Nice social engineering to boot. Those script kiddies are becoming really crafty. Using digg.com as a plateform for malware.
If you go to the sourceforge.net site, never do you see those outrageous compression rates posted. All of them are reasonable compression rations and whatnot. Now, you have some nameless Digg.com poster coming here, with some outrageous claims of compression rate, post a link to some shady file upload site and everybody's running their head cut off running said compression program.
I'm telling you all first, here. - segosa, on 10/12/2007, -8/+15You know, before saying it's *****/impossible/etc, perhaps you should try it out.
- jfish, on 10/12/2007, -1/+8Humm, it's probably not written to use more than one core / processor, as with most other applications. Read above where the guy with the AMD x2 was running it and it was only using 1 core. The fact you did not know that the software had to be written to use more than 1 core, makes me wonder why you have access to this machine.
- oringo, on 10/12/2007, -5/+11Just curious, does the KGB intall to ~430MB in size? I can write a program that can compress the office iso to 1bit in size if my program is 430MB in size.
- NiteMayr, on 10/12/2007, -10/+15I think it works by downloading the file(s) from the internet.
- IcedZ, on 10/12/2007, -2/+7read up.. people disconnected ethernet.
- TellarHK, on 10/12/2007, -0/+5That's because a lot of ISO images that are used solely for copy protection defeats are built with blankspace for all bus the relevant virtual tracks and sectors that relate to the copy protection scheme in use. Basically, it has the SecuROM required bits, but the rest of it is one big string of -nothing-. And the simplest pattern there is to compress is 00000000 00000000 00000000 etcetera.
- linuxwebguy, on 10/12/2007, -1/+6@merreborn
"Cool, you can download 430 meg/minute? That's a constant 6 megabit/second. I want your connection!"
Um..... You're bad at math.
The file is 430 MB (megabytes). Downloading that file in a minute would result in 7.16 MB/s or megabytes per second. However, there are 8 bits in a byte. Downloading a 430 MB file in one minute would require a 57.3 Mbps (megabit per second) connection. That is faster than a T3 WAN connection.
Yes, I'd like that too, but wouldn't want to pay for it. - RichyFreeway, on 10/12/2007, -2/+7I think it's looking pretty likely that this is some sort of hoax. It definitely appears to be downloading the files rather than extracting them. Perhaps it's a hacked archive file?
- HarryHunt, on 10/12/2007, -0/+5Here's how you could (theoretically) achieve insane compression:
1) Calculate a hash value of the file (e.g. using MD5)
2) Create a stream in memory that has exactly the same size as the original file
3) Fill that stream will all zeros
4) Calculate the hash value of that stream and see if it's identical to the one you calculated for the original file
5) If it's not identical, go to #3 and use a different configuration of bits (you'd have to try all possible configurations --> 2^(Size of file in bits). For a 100KB file that means 10,485,760,000 different combinations.)
6) If it is identical and the stream is not identical to the original file, increment a variable and continue with the next configuration of bits
7) If the configuration of bits is identical to that in the original file, you're done
8) Save the hash, the file size and the counter variable to a file -> this is your archive
To decompress the file, you'd have to repeat the steps taken during compression, except the counter variable tells you when to stop.
This "algorithm" has a lot of room for optimization, but because the complexity grows exponentially with the file size, i don't think there's any real use for this, even in the future.
As for KGB archiver: the name alone rings all my alarm bells and keeps me from downloading it (not to mention install it). The results some of the people here mentioned sound like those of a below average entropy encoder. The office thing has to be a fake and the screenshots posted here seem to prove that.
I think using some common sense one can safely say that it's not possible to compress a (real) 400 meg file to 1.4 megs using any method other than one involving brute force (which it would take years to compress/decompress a 400 meg file). - mentor, on 10/12/2007, -2/+7Shirley this is more compression than the Shannon Limit allows?
- panique, on 10/12/2007, -1/+6Since it was linked via digg, do we call it "Trojan Entry Vector 2.0"?
- inactive, on 10/12/2007, -2/+7http://img514.imageshack.us/img514/8977/a11uv.png
http://img514.imageshack.us/img514/4057/a22fb.png
http://img514.imageshack.us/img514/7169/a32bx.png
http://img514.imageshack.us/img514/4774/a48is.png - ThinkFr33ly, on 10/12/2007, -0/+5There are certainly ways of producing highly compressed archives, sometimes as high as the percentage in this case. They typically involve one of two things:
1.) Files whose contents have large amounts of redundant or repeating data
2.) Files whose contents have data that follows patterns that can be generated algorithmically
An example of #1 would be an XML document. XML documents are very easily compressible since there is a lot of repeated and redundant data. (""... most of that is repeated. There are only 9 unique characters.) [EDIT: damnit, Digg ripped out my XML. Bastards!]
An example of #2 is a bit harder to describe. In most cases, using information theory, it's possible to create algorithms that can reproduce huge sets of seemingly random data without loss. These algorithms can be stored in place of the actual data and take up a tiny fraction of the original data's space. So yes, it's possible to represent large amounts of data with far fewer bits than would be required if all you were doing was compressing that data.
But these algorithms only apply to certain kinds of data sets. It would be nearly impossible to create an algorithm that recreates the bits that make up MS Office.
So I think this is probably *****. Or if it's not, it would only work for this one data set. - anastrophe, on 10/12/2007, -0/+4what an incredible waste of time - literally. look at 7zip's results. excellent, and it doesn't take days to do. and it's free. sheesh.
- DigeratiPrime, on 10/12/2007, -0/+4good idea and then we could verify the md5 sums to check its legitimacy.
- redcard, on 10/12/2007, -0/+4Prediction. This is the "Fractal Compression" scheme put out on the net. When you run the compressor, it will create a checksum and will then put that inside a file (filled with other random junk) . It will then take the big file , and upload it on the website. When you "uncompress" , you're simply pulling the other file back down from the website.
Now, add this to an algorithim that already works.. and.. bingo. Further, you're being given the file by the OP. Now, I know a number of you see that it's "working" when you unplug the cable.. well, guess what, progress bars are really hard to make , and if this indeed is a hoax, do you really expect them to print out "Can't get your file from the internet so as to make it look like it came from the compressed file. Exiting" or do you expect it to look like it's working?
And finally, a number of you are talking about having the source.. and that's spiffy too. But do you really have "THE" source for the binary that is being run on your computer?
Hmm. - tuxidomasx, on 10/12/2007, -1/+5the implications of this might be more serious than people think. how do people get caught sharing copyrighted info? they are caught in the act most of the time. riaa searches on limewire, or does a netstat while running bittorrent. and they pick a few ip addresses. its trivial.
if i can go to a public library and download a dvd in 5 seconds and stick it on my usb drive then i'm clear. i dont care how long it takes to decompress at home because i know i'm safe while i'm doing it.
downloading/uploading material for 10 hours is more suspect and usually requires a prolonged connection to the internet. i wouldnt care if it took 2 days to uncompress it-- as long as i'm not actively sharing, i'd feel better. heck, the decompression program could be HUGE-- 3, 4,5 gigabytes. who cares. space aint an issue. - drigz, on 10/12/2007, -4/+8Yeah, and for a good algorithm, it'll work well on many files.
Check out my compression algorithm - if the file is a trillion 'e's, it compresses to an 'e', everything else compresses to itself prepended by 'f'. And wow! It can compress a 1TB file to 1 byte!!!!!
If they're gonna boast about there amazing algorithm, it should be able to beat winzip on something that common (executables). - merreborn, on 10/12/2007, -1/+5a 1 gig stick and a 256 meg stick. Or two 512s and a 256.
1 280 megabytes = 1.25 gigabytes - ElectroOverlord, on 10/12/2007, -0/+4Just started compressing my own Office 2003 ISO that is 569MB and it says about 8.5 hours to go. Told it to compress at Level (Good). Numbers look less impressive every few minutes. I am willing to experiment..but have hesitation so far.
-
Show 51 - 100 of 243 discussions



What is Digg?
Digg is coming to a city (and computer) near you! Check out all the details on our