Creating reproducible ZIP archives

An Artist A DayI am currently working on an app that loads data in relatively small incremental payloads from the server side for offline use. Payload files are in ZIP format and the way the app knows whether it has the correct file (which might not be true – the file might be missing, partially downloaded or simply changed by developer) is by checking it’s checksum against the local file. If there is a checksum mismatch, the file is (re)downloaded.

During development, I regenerate the payload files many times and when the time comes, I release the “golden” versions into production. Obviously, I insist on reproducible builds and this includes the bundle files – if the bits of the data that went into those files haven’t changed, there should be absolutely no reason why the generated bundle should differ in a single bit. After all if the file has a new signature, the app will download it again (even if the data is the same). I could devise a more elaborate way via some kind of versioning but it would be more complex, more error-prone, and wouldn’t add much benefit to the user. Plus, it’s nice to know that your data is guaranteed to be the same.

Image credit: New Line Cinema.

Image credit: New Line Cinema.

Turns out, things are not so easy when it comes to ZIP files. Both OS X and Ubuntu offer command-line zip program which by default sticks gobs of metadata into the ZIP archive, including timestamps. Thus, no two zips are alike (well, maybe they will be if run within the same second, but I didn’t bother to check).

Unfortunately there is no way to tell zip to only include most critical metadata (file names and sizes) and ignore everything else. It does offer -X (aka -no-extra) option to strip some attributes but this doesn’t include the file timestamps.

As a result, the only way that seems to be generating consistently reproducible ZIP archives from the same set of files is to force the modification time on files to some well-known time and zip them without any extra attributes. More specifically:

touch -t 201212210314.16 . myfiles.*
zip -X -q myarchive.zip myfiles.*

This works consistently (and portably) on OS X and in Linux but of course, there is no guarantee it will stay the same: new zip version might slightly change the algorithm (hopefully not, considering how pervasive ZIP is). But  at least for now, this seems to be the only way to perform reproducible, portable ZIP archive generation.

This entry was posted in Development. Bookmark the permalink.