PhD in Computer science, data scientist at Consid AB.
We have a quite large photo archive, but the biggest problem is that it is totally unorganized. A few days ago I found the organizer Elodie, a simple python script that reads exif data from photos and organizes them into folder based on date and geo-location. Elodie also computes SHA256 sums of all photos to make sure that no duplicates exists.
The following ini-file suits me best. Note that I’ve modified Elodie to include village, hamlet and county as Sweden is quite rual and a lot of my photos didn’t map to a town or city which is the default of Elodie. I’ve also modified Elodie to allow filenames to contain camera make and model as I find it quite useful to know which camera (or rather phone) that took the picture.
[MapQuest]
key=<my key>
[Exclusions]
synology_folders=@eaDir
thumbs=.thumbnails
Thumbnails=Thumbnails
Previews=Previews
[File]
date=%Y-%m-%d_%H-%M-%S
name=%date %title %camera_model %original_name.%extension
[Directory]
date=%Y-%m-%b
location=%default, %state
full_path=%date/%album|%location|Unknown Location
Start Docker.
# docker run \
--name elodie \
--entrypoint /bin/sh \
--volume /volume2/Media/GoogleTakeout/:/source/gto \
--volume /volume2/photo/:/source/photos \
--volume /volume2/Media/tmp/:/destination \
--volume /volume2/Media/tmp/hash.json:/elodie/hash.json:rw \
--volume /volume1/pictures/location.json:/elodie/location.json:rw \
--volume /volume1/pictures/config.ini:/elodie/config.ini:ro \
--tty --interactive --rm \
elodie
Run elodie.
# ./elodie.py import \
--destination /destination/ \
--debug \
"/source/gto/takeout/fer/Takeout/Google Photos/" \
/source/photos/ \
| tee -a /destination/elodie.log
My inital import in Elode lost some (~75’000) hashes. This is more or less what I did to compute the missing hashes.
Find all files: find /destination/ -name '*.jpg' -not -path '*eaDir*' > /destination/files
.
Compare the computed hashes from Elodie in hash.json
with the actual files.
import json
with open('hash.json', 'r') as hash_json, open('files', 'r') as files:
hash_db = json.load(hash_json)
all_files = files.readlines()
file_list = {f.split('/')[-1] for f in hash_db.values()}
files_to_check = [f[:-1] for f in all_files if f.split('/')[-1][:-1] not in file_list]
with open('files_to_check', 'w') as f:
f.write( '\0'.join(files_to_check))
len(files_to_check)
>>> import json
>>> with open('hash.json', 'r') as hash_json, open('files', 'r') as files:
... hash_db = json.load(hash_json)
... all_files = files.readlines()
...
>>> file_list = {f.split('/')[-1] for f in hash_db.values()}
>>> files_to_check = [f[:-1] for f in all_files if f.split('/')[-1][:-1] not in file_list]
>>> len(files_to_check)
79853
>>>
>>> with open('files_to_check', 'w') as f:
... f.write( '\0'.join(files_to_check))
...
Then compute the sha256sum with:
$ xargs -0 sha256sum 0</destination/files_to_check | pv -l -s 79853 1>checked_files
79.9k 1:28:22 [15.1 /s] [=========================================================>] 100%
Convert the output checked_files
to json and merge with the original hash.json
file.
# gawk 'BEGIN {print "{";} {if (NR>1) {printf ","; } printf "\""$1"\": \""; for(i=2;i<=NF;i++)printf "%s",$i (i==NF?"\""ORS:OFS); } END {print "}";}'
# jq '. | length' /destination/checked_files.json
79853
# jq -s add /destination/checked_files.json /destination/hash.json > /destination/hash_merged.json
# jq '. | length' /destination/hash_merged.json
104586
Compute sha265sum on all files.
# find /source/pictures/ -iname '*.jpg' -not -path '*eaDir*' -not -path '*Thumbnails*' -not -path '*Previews*' > /source/pictures/all_files
# cat /source/pictures/all_files | tr '\n' '\0' | xargs -r0 sha256sum | pv -l -s `wc -l </source/pictures/all_files` | gawk 'BEGIN {print "{";} {if (NR>1) {printf ","; } printf "\""$1"\": \""; for(i=2;i<=NF;i++)
printf "%s",$i (i==NF?"\""ORS:OFS); } END {print "}";}' > /source/pictures/all_files.sha256sum
105k 2:31:25 [11.6 /s] [========================================================>] 100%
>>> import json
>>> with open('hash.json', 'r') as hash_json, open('/volume2/pictures/all_files.sha256sum', 'r') as all_files:
... hash_db = json.loads(hash_json.read())
... files = json.loads(all_files.read())
...
>>> len(hash_db)
235001
>>> len(files)
63640
>>> len(set(files) - set(hash_db))
2243
>>> missing = {k: files[k] for k in (set(files) - set(hash_db))}
>>> with open('missing.json', 'w') as missing_file:
... json.dump(missing, missing_file)
...
>>>
>>> import re
>>> import json
>>> with open('hash.json.bak', 'r') as hash_json, open('/volume2/pictures/all_files.sha256sum', 'r') as all_files:
>>> hash_db = json.loads(hash_json.read())
>>> files = json.loads(all_files.read())
>>> missing_hashes = {k: files[k] for k in (set(files) - set(hash_db))}
>>> dest_filenames = {name.split(' ')[-1].lower(): k for k, name in missing_hashes.items()}
>>> missing = {k: src_filenames[k] for k in (set(src_filenames) - set(dest_filenames))}
>>> src_filenames = {re.sub(whitespace_regexp, '-', name.split('/')[-1].lower()): k for k, name in files.items()}
>>> {missing[key]: (key, files[missing[key]]) for key in itertools.islice(missing, 5)}
{'2ca9c6bee405278d8267797bd35b242d33680ed69cbeb52af15104b518188503': ('2014-11-16-11.02.50.jpg', '/source/pictures/Osorterat/G-drive Photos/Camera Uploads/2014-11-16 11.02.50.jpg'), '4213df72b5d0ed47b3621a45c991506a9a7d564c9c99ce010f703e25b5c54da7': ('2019-08-21-08.15.11.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-08-21 08.15.11.jpg'), '1f9c20cb98435fdb570f1ed6c2d0466e2543e2366344b5026531ef67613ad1de': ('2019-06-14-09.42.03.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-06-14 09.42.03.jpg'), '9974d4f881750bc11965b9637378cb6a7ab62eec326b3d0f8c8866fc8615c70b': ('2019-07-08-12.06.32.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-07-08 12.06.32.jpg'), 'c354ad1a330a4d8e7365f32448e13602dfde7b9576cf56c77549d5083b825aa6': ('120.jpg', '/source/pictures/Gamla bilder/bilder/Bilder Från Kåren/20011103Fiat lordag/120.JPG')}
>>> len(missing)
1811
>>> [files[missing[key]] for key in itertools.islice(missing, 5)]
['/source/pictures/Osorterat/G-drive Photos/Camera Uploads/2014-11-16 11.02.50.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-08-21 08.15.11.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-06-14 09.42.03.jpg', '/source/pictures/Mobile Photos/Fredrik/2019-07-08 12.06.32.jpg', '/source/pictures/Gamla bilder/bilder/Bilder Från Kåren/20011103Fiat lordag/120.JPG']
>>> missed_files = '\0'.join([files[missing[key]] for key in missing])
>>> with open('missed_files_new', 'w') as mf:
... mf.write(missed_files)
...
134109
docker run \
--name=/elodie-fer \
--env="PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
--env="LANG=C.UTF-8" --env="GPG_KEY=E3FF2839C048B25C084DEBE9B2699568" \
--env="PYTHON_VERSION=3.9.1" --env="PYTHON_PIP_VERSION=20.3.3" \
--env="PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/5f38681f7f5872e4032860b54e9cc11cf0374932/get-pip.py" \
--env="PYTHON_GET_PIP6a0b13826862f33c13b614a921d36253bfa1ae779c5fbf569876f3585057e9d2" \
--env="PYTHONUNBUFFERED=1" \
--env="PUID=1026" \
--env="PGID=241695" \
--env="SOURCE=/source" \
--env="DESTINATION=/destination/" \
--env="ELLICATION_DIRECTORY=/elodie" \
--env="DEFAULT_COMMAND=watch" --network "bridge" \
--volume="/volume1/photo/loc.json:/elodie/location.json:rw" \
--volume="/volume1/photo/hash.json:/elodie/hash.json:rw" \
--volume="/volume1/photo/config.ini:/elodie/config.ini:ro" \
--volume="/volume1/photo:/destination:rw" \
--volume="/volume1/homes/jer/Drive/Moments/:/source/jer:rw" \
--volume="/volume1/homes/fer/Drive/Moments/:/source/fer:rw" \
--entrypoint="/entrypoint.sh" \
--log-driver="db" --rm --restart="no" -t -i --privileged \
elodie:fer