Archiving and deduplicating media files

Table of Contents

Context #

As for many other families, over the years we have accumulated hundreds of gigabytes of photos and videos, stored on a multitude of external drives. I eventually bought a NAS for home usage and moved all of them there, but they are still scattered around and there’s a lot of duplication.

There are of course plenty of applications on the market that can manage an archive of media, but this looked like (and was, in fact) an interesting pet project, so I started working on it last April.

I had already attempted to tackle this problem a couple of years ago by writing a CLI to identify and remove duplicates in a directory (see go-dedup), but this time I decided to change approach: instead of deduplicating files in an existing directory (= the archive), I will deduplicate them as they are imported into the archive.

As I got working on this I ended up in the rabbit hole of extracting a photo’s original datetime and I have later extracted the results of that deep dive into the tiff-parser library. After several iterations, I feel like it’s now in a usable (albeit still rough around the edges) state so I’m making ark publicly available.

Important: the code does not perform any destructive operation, but I feel nonetheless compelled to point out that its usage is at your own risk.

Requirements #

The NAS (or whatever device will store the media archive) needs to be able to run Docker/OCI containers.

Sequence diagram #

The diagram below, generated with text-diagram, describes the happy flow of the upload of a single file. For brevity’s sake, error situations are ommitted. Any error will break the circuit.

                         +---------+                +---------+                +-----+ +-----+
                         | Client  |                | Server  |                | DB  | | HDD |
                         +---------+                +---------+                +-----+ +-----+
                              |                          |                        |       |
                              | Compute file hash        |                        |       |
                              |------------------        |                        |       |
                              |                 |        |                        |       |
                              |<-----------------        |                        |       |
                              |                          |                        |       |
                              | Send file metadata       |                        |       |
                              |------------------------->|                        |       |
                              |                          |                        |       |
                              |                          | Does it exist          |       |
                              |                          |----------------------->|       |
         -------------------\ |                          |                        |       |
         | alt: file exists |-|                          |                        |       |
         |------------------| |                          |                        |       |
                              |                          |                        |       |
                              |                          |                    Yes |       |
                              |                          |<-----------------------|       |
                              |                          |                        |       |
                              |      File already exists |                        |       |
                              |<-------------------------|                        |       |
                              |                          |                        |       |
                              | Skip file                |                        |       |
                              |----------                |                        |       |
                              |         |                |                        |       |
                              |<---------                |                        |       |
----------------------------\ |                          |                        |       |
| else: file does not exist |-|                          |                        |       |
|---------------------------| |                          |                        |       |
----------------------------\ |                          |                        |       |
| loop: for each file chunk |-|                          |                        |       |
|---------------------------| |                          |                        |       |
                              |                          |                        |       |
                              | Send file chunk          |                        |       |
                              |------------------------->|                        |       |
                              |                          |                        |       |
                              |                          | Store file chunk       |       |
                              |                          |-----------------       |       |
                              |                          |                |       |       |
                              |                          |<----------------       |       |
                              |                          |                        |       |
                              |                       OK |                        |       |
                              |<-------------------------|                        |       |
                 -----------\ |                          |                        |       |
                 | end loop |-|                          |                        |       |
                 |----------| |                          |                        |       |
                              |                          |                        |       |
                              |                          | Atomically write file  |       |
                              |                          |------------------------------->|
                              |                          |                        |       |
                              |                          |                        |    OK |
                              |                          |<-------------------------------|
                              |                          |                        |       |
                              |                          | Store file metadata    |       |
                              |                          |----------------------->|       |
                              |                          |                        |       |
                              |                          |                     OK |       |
                              |                          |<-----------------------|       |
                              |                          |                        |       |
                              |                       OK |                        |       |
                              |<-------------------------|                        |       |
                      ------\ |                          |                        |       |
                      | end |-|                          |                        |       |
                      |-----| |                          |                        |       |
                              |                          |                        |       |

Decision log #

Definition of file equality #

I will consider two files f1 and f2 equal if and only if hash(f1.contents) == hash(f2.contents). I’m not going to take into account other attributes (such as file name, datetime, etc …) as they may lead to false positives.

Hash function #

I’m going to compute file hashes by digesting a file’s contents with the go implementation of the BLAKE3 cryptographic hash function: I have chosen it over other hash functions because of its performance.

Architecture #

The system will have two components:

a server, running on the NAS
a client, running on a machine connected to the same wifi network as the NAS (or the NAS itself)

Server and client will communicate over gRPC in order to leverage HTTP/2 streaming capabilities.

Separation of responsibilities #

The server is responsible to:

provide an API to import files
identify (and discard) duplicates
archive files by creation date

The client is responsible to:

scan directories for media files
compute each file’s hash
send files to the server

I have decided to assign the responsibility of computing file hashes to the client because:

I can then leverage gRPC’s streaming capabilities to first send only the file’s hash and metadata, which makes for a very small payload: if the server determines it to be a duplicate, it cancels the request so that the file contents are not transferred over the network at all; transferring media files over a network is a very time-consuming operation, so this is a big bonus
hashing is a compute-heavy operations and laptops/PCs have way more compute power than the average NAS on the market, so it will be faster on those devices

Network communication #

Instead of vanilla gRPC, I’m going to use the connect-go library, to explore its capabilities.

Clients will authenticate their requests using JWT tokens, leveraging connect-go interceptors: this is purely to experiment with this library, since the underlying assumption is that both server and client communicate over a private network.

Storage #

Since a file’s hash defines its identity, I only need to store that to make things work. at the same time, having file metadata (such as its path and creation datetime) might turn out handy, so I’ll also store those. All said, a key-value store looks like the best fit.

Side note: I initially used BoltDB, but later switched to Redis (running on a Docker container in my NAS) to improve performance.

Metrics #

I’m going to report key performance metrics to Prometheus to measure the system’s performance. The Prometheus server will run on a Docker container in my NAS.

File archiving #

Files are archived in a directory hierarchy based on their creation date (year/month/day). The creation date is extracted, whenever possible, from the file’s EXIF header: when that is not possible (either because the file type is not supported or there is no EXIF data), the file modification time is used as a fallback.

EXIF can currently be parsed from TIFF-like files (TIFF, CR2, ORF, and more), HEIC, and JPEG.

Usage #

Deploy server #

I’m using Github Actions to publish Docker images of the server for every release, so you can grab one on Github Packages. The docker directory provides a example of docker-compose.yaml that you can use to deploy the server along with Prometheus and Redis: please note that you’ll likely have to tweak it since it’s quite basic. You’ll also need to copy the secrets directory and prometheus.yml. At the end, the directory structure on your NAS should look something like:

/
|_ data
  |_ archive      # where your media files will be stored
  |_ prometheus   # where prometheus will persist its database
  |_ redis        # where redis will persist its database
|_ secrets
  |_ server       # contains sensitive data
|_ prometheus.yml # configuration for prometheus

The configuration mounts 3 volumes to store data for server, Prometheus, and Redis respectively: those directories must already exist on your host machine, otherwise the deployment will fail. Please note that the ark_data volume determines where your archive of media files will be stored.

You also need to tweak the secret file that is used to securely inject sensitive data into the server container.

The ark server doesn’t of course necessarily need to run inside a Docker container: because of limitations in my NAS, it is just the simplest way for me to get the server running there.

Run client #

First you’ll need to build the client with

make build-client

Now you need to configure the client using environment variables, for example using an .env file.

Finally, you can run it with

source .env

bin/client --from /your/folder/with/photos