Extracting the original datetime from Canon and Olympus photos
Table of Contents
Over the years, like many families these days, we have accumulated hundreds of GBs of photos. A few weeks ago I thought that writing an application to archive all photos to my home NAS could be a funny pet project: the goal is to deduplicate and archive photos (still in the making, will write about it once it’s done). As part of that, I also decided to extract the photos’ original date by parsing their Exif headers, in order to archive them by date.
While there are already many libraries and tools to deal with Exif (first and foremost, Phil Harvey’s excellent ExifTool and its bindings for several programming languages), I decided to write my own parser out of interest: since most of my photos have been shot with Canon and Olympus cameras, I decided to start with Canon’s CR2 and Olympus’ ORF file formats.
A couple of days later, after making my way back to the surface from a rabbit hole of poorly-documented specifications and manufacturer-specific tags, this quote came to my mind:
What you learn after dealing with that (*) is that […] you look at the peole who have been there before you, and you look at their code, and you thank them very much for making it open source, and you give them credit, and you take what they have made, and you put it in your program, and you never ever look at it again. Because that way lies madness.
(*) “timezones”, in the original quote.
Tom Scott, The Problem with Time & Timezones
What follows below is a brief guide about extracting metadata from TIFF files. The code is now available here, for reference.
Exif #
Exchangeable Image File Format (Exif) is a standard format for photo and audio files, used by digital camera manufacturers since its initial definition in 1998: the specification describes a standard set of information, as well as a MakerNote
tag, which allows camera manufacturers to add custom metadata to the file (such as non-standard camera settings). The Exif specification builds upon the TIFF file format, and both CR2 and ORF happen to be TIFF-like formats, so I can start from there.
The first hurdle is that manufacturers typically do not share information about their proprietary file formats, so I have to dig for information. Below is an amazing visual representation of the CR2 format by Laurent Clévy, for which I cannot thank him enough:
A TIFF file is made of several parts:
- an initial block containing image file headers,
- four image file directories (IFDs), and finally
- the image data itself
Exif tags are contained in the first IFD (EXIF Sub-IFD, in the picture), so that’s my goal.
Since fields are found at given offsets (predetermined or variable, depending on the situation), I need to read each file from the beginning. I’m dealing with RAW files, whose size is easily in the range of tens of MBs, and I’m only interested in few bytes of their metadata, so reading the whole file is not a sensible choice: reading it in chunks using an io.ReadSeeker
seems to be most appropriate here.
Parsing algorithm #
Here is how parsing works:
- Start reading the file from its beginning
- Read the byte ordering field from the TIFF header
- Validate the TIFF’s magic number
- Read the offset of the first IFD
- Seek the offset of the Exif sub-IFD
- Sequentially read its contents until I find the relevant fields, identified by their entry IDs
TIFF header #
This is the first block of the file, containing the information required to read the file contents, such as the byte ordering (aka endianness), a control field called magic number (more on that below), and finally the offset of the first IFD. Here is an example of a TIFF header:
4949 2a00 10000000
Byte ordering #
0x4949
(ASCII: II
, for Intel) indicates that the file is written using little-endian byte order. In case of big-endian byte order, the value would be 0x4d4d
(ASCII: MM
, for Motorola).
This field is fundamental as it determines which flavor of binary.ByteOrder
(from Go’s encoding/binary
package) to use when reading the rest of the data, e.g.:
import (
"encoding/binary"
)
byteOrder := binary.LittleEndian
buffer := []byte{0x49, 0x20}
value := byteOrder.Uint16(buffer)
Magic number #
This is a control field to verify that the file conforms to the TIFF format: its expected value is 0x2a00
.
Note: ORF uses a custom value here: 0x524F
. CR2 uses the standard TIFF value, instead.
First IFD offset #
0x10000000
indicates the offset of the start of the first IFD: since the Exif is contained in it, I read this value and point my ReadSeeker
there.
IFDs #
Each IFD is structured as follows:
- number of IFD entries (2 bytes, unsigned short)
- sequence of IFD entries (12 bytes for each entry)
- offset of the next IFD (4 bytes, unsigned long)
The following struct represents an entry:
type Entry struct {
ID uint16
DataType uint16
Length uint32
Value uint32
}
An ID
is a unique, unsigned short value that allows me to understand which entry is in front of me (e.g. 0x8769
identifies the Exif entry). Some IDs are standard and some are manufacturer-specific, so I might have to take that into account. Thankfully, a very extensive list of IDs is available on the ExifTool website.
The DataType
field describes the type of data represented by the entry: for example, the dateTimeOriginal
field is represented by a string, whose corresponding DataType
value is 2
. A complete list of the TIFF types is available in the Appendix.
The Length
field specifies the length, in bytes, of the value of this field.
The Value
field provides the actual value of the entry or the offset to read the value from: this field can only store 4 bytes of data, so any value that requires more than that cannot be written directly in it; in that case, the entry will be marked with an unsigned long DataType
(value: 4
), this field will contain the offset to the actual value, and the Length
field will indicate how many bytes to read. At the end of each IFD there is a data area where any such values will be stored.
The dateTimeOriginal
field itself is an example of such a field, since it requires 20 bytes.
Reading IFD entries #
Back to my use case: I need to read IFD#0 entries until I find the Exif entry (id: 0x8769
), then I’ll skip to its offset and find the entries I’m interested into.
Besides the dateTimeOriginal
field, some manufacturers/cameras may also provide an offsetTimeOriginal
field containing the timezone string (e.g. Europe/Amsterdam
), which would obviously be very useful to have: since I’m going to search for entries in IFDs in a couple of occasions, it make sense to extract that logic into a common function.
Two considerations drive its implementation:
- IFD entries are written according to the natural order of their IDs (*)
- IFDs may contain up to thousands of entries
(*) Disclaimer: I haven’t found any official documentation supporting this, this is based on what I have seen in examples and confirmed by testing my code against my photos.
Since I’m searching for dateTimeOriginal
(ID: 0x9003
) and offsetTimeOriginal
(ID: 0x9011
), I can stop reading as soon as I have found both or the ID of the next entry is greater than 0x9011
.
Finally: returning the original datetime #
If I have found both entries, I can read their values by skipping to the respective offsets and then return a time.Time
in its original timezone using time.LoadLocation(...)
; if I have only dateTimeOriginal
, I’ll have to assume it to be in my timezone; if I haven’t found any of them, tough luck :)
Appendix #
TIFF data types #
Value | Description |
---|---|
1 | unsigned byte |
2 | ASCII string terminated with NUL character |
3 | unsigned short (2 bytes) |
4 | unsigned long (4 bytes) |
5 | unsigned rational: one unsigned long for the numerator and one for the denominator |
6 | signed byte |
7 | sequence of unsigned bytes |
8 | signed short (2 bytes) |
9 | signed long (4 bytes) |
10 | signed rational (2 longs) |
11 | float, IEEE format (2 bytes) |
12 | double, IEEE format (4 bytes) |
Credits #
Laurent Clévy’s Understanding what is stored in a Canon RAW .CR2 file, how and why blog post was a fundamental starting point to dig into the CR2 file format.
Muhammad Imran’s metadata-extractor was also a very useful resource to cross-check implementation details such as tag IDs.
Last but not least, Dustin Oprea’s go-exif is a very comprehensive library to parse and manipulate Exif blocks, complemented by a set of libraries that delve into the specifics of several file formats, see list here.