Creating reproducible ZIP archives
March 2015 (282 Words, 2 Minutes)
I am currently working on an app that loads data in relatively small incremental payloads from the server side for offline use. Payload files are in ZIP format and the way the app knows whether it has the correct file (which might not be true – the file might be missing, partially downloaded or simply changed by developer) is by checking it’s checksum against the local file. If there is a checksum mismatch, the file is (re)downloaded.
During development, I regenerate the payload files many times and when the time comes, I release the “golden” versions into production. Obviously, I insist on reproducible builds and this includes the bundle files – if the bits of the data that went into those files haven’t changed, there should be absolutely no reason why the generated bundle should differ in a single bit. After all if the file has a new signature, the app will download it again (even if the data is the same). I could devise a more elaborate way via some kind of versioning but it would be more complex, more error-prone, and wouldn’t add much benefit to the user. Plus, it’s nice to know that your data is guaranteed to be the same.
Turns out, things are not so easy when it comes to ZIP files. Both OS X and Ubuntu offer command-line zip program which by default sticks gobs of metadata into the ZIP archive, including timestamps. Thus, no two zips are alike (well, maybe they will be if run within the same second, but I didn’t bother to check).
Unfortunately there is no way to tell zip
to only include most
critical metadata (file names and sizes) and ignore everything
else. It does offer -X
(aka -no-extra
) option to strip some
attributes but this doesn’t include the file timestamps.
As a result, the only way that seems to be generating consistently reproducible ZIP archives from the same set of files is to force the modification time on files to some well-known time and zip them without any extra attributes. More specifically:
touch -t 201212210314.16 . myfiles.*
zip -X -q myarchive.zip myfiles.*
This works consistently (and portably) on OS X and in Linux but of
course, there is no guarantee it will stay the same: new zip
version
might slightly change the algorithm (hopefully not, considering how
pervasive ZIP is). But at least for now, this seems to be the only way
to perform reproducible, portable ZIP archive generation.