I'm looking for an archiver program that can perform deduplication (dedupe) on the files being archived. Upon unpacking the archive, the software would put back any files it removed during the compression process.
So far I've found:
Anyone aware of any others?
This would probably be an awesome addition to 7-zip.
Answer
Almost all modern archivers do exactly this, the only difference is that they refer to this as a "solid" archive, as in all of the files are concatenated into a single stream before being fed to the compression algorithm. This is different from standard zip compression which compresses each file one by one and adds each compressed file to the archive.
7-zip by its very nature effectively achieves de-duplication. 7-Zip for example will search for files, will sort them by similar file types and file names and so two files of the same type and data will be placed side by side in the stream going to the compressor algorithms. The compressor will then see a lot of data it has seen very recently and those two files will see a large increase in compression efficiency compared to compressing the files one-by-one.
Linux has seen a similar behaviour for a long time through the prevalence of their ".tgz" format (or ".tar.gz" to use it's full form) as the tar is simply merging all the files into a single stream (albeit without sorting and grouping of files) and then compressing with gzip. What this misses is the sorting that 7-zip is doing, which may slightly decrease efficiency but is still a lot better than simply blobbing a lot of individually compressed files together in the way that zip does.
No comments:
Post a Comment