Skip to content

Instantly share code, notes, and snippets.

@mrenouf
Forked from Daenyth/simdir
Created November 12, 2010 15:38
Show Gist options
  • Save mrenouf/674229 to your computer and use it in GitHub Desktop.
Save mrenouf/674229 to your computer and use it in GitHub Desktop.
Fuzzy directory match using simhash
#!/bin/bash
shingle_size=4
feature_count=1024
if [[ ! -d $1 || ! -d $2 ]]; then
echo "Usage: <dir1> <dir2>"
exit 1
fi;
# make sure we don't simhash simhash files
rm $1/*.sim
rm $2/*.sim
simhash -f $feature_count -s $shingle_size -w $1/*
simhash -f $feature_count -s $shingle_size -w $2/*
# TODO: If a certain match threshold is met, then exlcude the target
# file from all other match attempts. This would only work properly
# assuming a one-to-one mapping, so would want to make it an option.
for A in "$1"/*.sim; do
match=""
maxresult=0
for B in "$2"/*.sim; do
if [[ $A != $B ]]; then
result=$(simhash -c $A $B)
if (( result > maxresult )); then
maxresult=$result
match=$B
fi
fi
done
echo "$maxresult ${A#.sim} --> ${match#.sim}"
done
# clean up after ourself
rm $1/*.sim
rm $2/*.sim
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment