Organizing photos with Elodie

2021-05-09

We have a quite large photo archive, but the biggest problem is that it is totally unorganized. A few days ago I found the organizer Elodie, a simple python script that reads exif data from photos and organizes them into folder based on date and geo-location. Elodie also computes SHA256 sums of all photos to make sure that no duplicates exists.

The following ini-file suits me best. Note that I’ve modified Elodie to include village, hamlet and county as Sweden is quite rual and a lot of my photos didn’t map to a town or city which is the default of Elodie. I’ve also modified Elodie to allow filenames to contain camera make and model as I find it quite useful to know which camera (or rather phone) that took the picture.

 [MapQuest]
key=<my key>

[Exclusions]
synology_folders=@eaDir
thumbs=.thumbnails
Thumbnails=Thumbnails
Previews=Previews

[File]
date=%Y-%m-%d_%H-%M-%S
name=%date %title %camera_model %original_name.%extension

[Directory]
date=%Y-%m-%b
location=%default, %state
full_path=%date/%album|%location|Unknown Location

Start Docker.

# docker run \
 --name elodie \
 --entrypoint /bin/sh \
 --volume /volume2/Media/GoogleTakeout/:/source/gto \
 --volume /volume2/photo/:/source/photos \
 --volume /volume2/Media/tmp/:/destination \
 --volume /volume2/Media/tmp/hash.json:/elodie/hash.json:rw \
 --volume /volume1/pictures/location.json:/elodie/location.json:rw \
 --volume /volume1/pictures/config.ini:/elodie/config.ini:ro \
 --tty --interactive --rm \
 elodie

Run elodie.

# ./elodie.py import \
 --destination /destination/ \
 --debug \
 "/source/gto/takeout/fer/Takeout/Google Photos/" \
 /source/photos/ \
 | tee -a /destination/elodie.log

Fix missing hashes

My inital import in Elode lost some (~75’000) hashes. This is more or less what I did to compute the missing hashes.

Find all files: find /destination/ -name '*.jpg' -not -path '*eaDir*' > /destination/files.

Compare the computed hashes from Elodie in hash.json with the actual files.

import json

with open('hash.json', 'r') as hash_json, open('files', 'r') as files:
    hash_db = json.load(hash_json)
    all_files = files.readlines()

file_list = {f.split('/')[-1] for f in hash_db.values()}
files_to_check = [f[:-1] for f in all_files if f.split('/')[-1][:-1] not in file_list]

with open('files_to_check', 'w') as f:
    f.write( '\0'.join(files_to_check))

len(files_to_check)

>>> import json
>>> with open('hash.json', 'r') as hash_json, open('files', 'r') as files:
...     hash_db = json.load(hash_json)
...     all_files = files.readlines()
...

>>> file_list = {f.split('/')[-1] for f in hash_db.values()}
>>> files_to_check = [f[:-1] for f in all_files if f.split('/')[-1][:-1] not in file_list]
>>> len(files_to_check)
79853
>>>
>>> with open('files_to_check', 'w') as f:
...    f.write( '\0'.join(files_to_check))
...

Then compute the sha256sum with:

$ xargs -0 sha256sum 0</destination/files_to_check | pv -l -s 79853 1>checked_files
79.9k 1:28:22 [15.1 /s] [=========================================================>] 100%

Convert the output checked_files to json and merge with the original hash.json file.

# gawk 'BEGIN {print "{";} {if (NR>1) {printf ","; } printf "\""$1"\": \""; for(i=2;i<=NF;i++)printf "%s",$i (i==NF?"\""ORS:OFS); } END {print "}";}'
# jq '. | length' /destination/checked_files.json
79853
# jq -s add /destination/checked_files.json /destination/hash.json > /destination/hash_merged.json
# jq '. | length' /destination/hash_merged.json
104586

Compute sha265sum on all files.

# find /source/pictures/ -iname '*.jpg' -not -path '*eaDir*' -not -path '*Thumbnails*' -not -path '*Previews*' > /source/pictures/all_files
# cat /source/pictures/all_files | tr '\n' '\0' | xargs -r0 sha256sum | pv -l -s `wc -l </source/pictures/all_files` | gawk 'BEGIN {print "{";} {if (NR>1) {printf ","; } printf "\""$1"\": \""; for(i=2;i<=NF;i++)
printf "%s",$i (i==NF?"\""ORS:OFS); } END {print "}";}' > /source/pictures/all_files.sha256sum
 105k 2:31:25 [11.6 /s]  [========================================================>] 100%

>>> import json
>>> with open('hash.json', 'r') as hash_json, open('/volume2/pictures/all_files.sha256sum', 'r') as all_files:
...     hash_db = json.loads(hash_json.read())
...     files = json.loads(all_files.read())
...
>>> len(hash_db)
235001
>>> len(files)
63640
>>> len(set(files) - set(hash_db))
2243
>>> missing = {k: files[k] for k in (set(files) - set(hash_db))}
>>> with open('missing.json', 'w') as missing_file:
...   json.dump(missing, missing_file)
...
>>>

>>> import re
>>> import json
>>> with open('hash.json.bak', 'r') as hash_json, open('/volume2/pictures/all_files.sha256sum', 'r') as all_files:
>>>      hash_db = json.loads(hash_json.read())
>>>      files = json.loads(all_files.read())
>>> missing_hashes = {k: files[k] for k in (set(files) - set(hash_db))}
>>> dest_filenames = {name.split(' ')[-1].lower(): k for k, name in missing_hashes.items()}
>>> missing = {k: src_filenames[k] for k in (set(src_filenames) - set(dest_filenames))}
>>> src_filenames = {re.sub(whitespace_regexp, '-', name.split('/')[-1].lower()): k for k, name in files.items()}
>>> {missing[key]: (key, files[missing[key]]) for key in itertools.islice(missing, 5)}
{'2ca9c6bee405278d8267797bd35b242d33680ed69cbeb52af15104b518188503': ('2014-11-16-11.02.50.jpg', '/source/pictures/Osorterat/G-drive Photos/Camera Uploads/2014-11-16 11.02.50.jpg'), '4213df72b5d0ed47b3621a45c991506a9a7d564c9c99ce010f703e25b5c54da7': ('2019-08-21-08.15.11.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-08-21 08.15.11.jpg'), '1f9c20cb98435fdb570f1ed6c2d0466e2543e2366344b5026531ef67613ad1de': ('2019-06-14-09.42.03.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-06-14 09.42.03.jpg'), '9974d4f881750bc11965b9637378cb6a7ab62eec326b3d0f8c8866fc8615c70b': ('2019-07-08-12.06.32.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-07-08 12.06.32.jpg'), 'c354ad1a330a4d8e7365f32448e13602dfde7b9576cf56c77549d5083b825aa6': ('120.jpg', '/source/pictures/Gamla bilder/bilder/Bilder Från Kåren/20011103Fiat lordag/120.JPG')}
>>> len(missing)
1811
>>> [files[missing[key]] for key in itertools.islice(missing, 5)]
['/source/pictures/Osorterat/G-drive Photos/Camera Uploads/2014-11-16 11.02.50.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-08-21 08.15.11.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-06-14 09.42.03.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-07-08 12.06.32.jpg', '/source/pictures/Gamla bilder/bilder/Bilder Från Kåren/20011103Fiat lordag/120.JPG']
>>> missed_files = '\0'.join([files[missing[key]] for key in missing])
>>> with open('missed_files_new', 'w') as mf:
...     mf.write(missed_files)
... 
134109

docker run  \
    --name=/elodie-fer    \
    --env="PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
    --env="LANG=C.UTF-8"     --env="GPG_KEY=E3FF2839C048B25C084DEBE9B2699568" \
    --env="PYTHON_VERSION=3.9.1"     --env="PYTHON_PIP_VERSION=20.3.3" \
    --env="PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/5f38681f7f5872e4032860b54e9cc11cf0374932/get-pip.py" \
    --env="PYTHON_GET_PIP6a0b13826862f33c13b614a921d36253bfa1ae779c5fbf569876f3585057e9d2" \
    --env="PYTHONUNBUFFERED=1"   \
    --env="PUID=1026" \
    --env="PGID=241695"   \
    --env="SOURCE=/source" \
    --env="DESTINATION=/destination/" \
    --env="ELLICATION_DIRECTORY=/elodie"   \
    --env="DEFAULT_COMMAND=watch"     --network "bridge"    \
    --volume="/volume1/photo/loc.json:/elodie/location.json:rw" \
    --volume="/volume1/photo/hash.json:/elodie/hash.json:rw"    \
    --volume="/volume1/photo/config.ini:/elodie/config.ini:ro" \
    --volume="/volume1/photo:/destination:rw"    \
    --volume="/volume1/homes/jer/Drive/Moments/:/source/jer:rw" \
    --volume="/volume1/homes/fer/Drive/Moments/:/source/fer:rw"   \
    --entrypoint="/entrypoint.sh" \
    --log-driver="db" --rm  --restart="no"     -t     -i     --privileged   \
    elodie:fer

tags: photos, elodie, docker, python