Created
March 19, 2023 17:44
-
-
Save fatih/0b377d277767ed8cb290a5d8627afe0e to your computer and use it in GitHub Desktop.
Poly file layer
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Poly File Layer | |
Idea: A single file content to represent a set of identical or similar files. | |
* Files can be templated. Files generated by the template are identical in the PFL (Poly File Layer) | |
* Files that have common lines, can be linked and marked as a Poly File. | |
Prior Art | |
* [ ] Read the paper | |
* Extracting a Unified Directory Tree to Compare Similar Software Products: https://sel.ist.osaka-u.ac.jp/lab-db/betuzuri/archive/1012/1012.pdf | |
* Fast Search in Hamming Space with Multi-Index Hashing: http://www.cs.toronto.edu/~norouzi/research/papers/multi_index_hashing.pdf | |
Finding Similar Items: | |
* http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf | |
* http://www.mmds.org | |
* RWS-Diff: Flexible and Efficient Change Detection in Hierarchical Data: | |
https://db.in.tum.de/~finis/papers/RWS-Diff.pdf | |
Notes | |
Locality Sensitivity Hasing: https://en.wikipedia.org/wiki/Locality-sensitive_hashing | |
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134 | |
https://medium.com/engineering-brainly/locality-sensitive-hashing-explained-304eb39291e4 | |
https://security.stackexchange.com/questions/44743/is-there-a-hash-algorithm-that-can-identify-similar-files-or-strings | |
https://softwareengineering.stackexchange.com/a/266741 | |
Data deduplication is also often called "record linkage", so you may want to also use that as a search term when researching this. | |
There is an article on the Eventbrite engineering blog that explains how you could greatly reduce the number of file comparisons by using Multi Index Locality Sensitive Hashing . In short, you create a special kind of hash value whereby similar documents will have hash values that are close by. You can then compare potentially similar documents byte for byte because the number of documents you have to compare to is a much smaller set. | |
String similarity | |
TODO: | |
* [ ] Jaro-Winker Similarity (watch from youtube) | |
* [ ] Ratcliff-Obershelp Similarity | |
Types: | |
* Edit distance based | |
* Hamming distance: computed by overlaying one string over another and finding the places where the strings vary. | |
* Con: arrow vs arow's normalization score is 0.4 | |
* Levenshtein distance: computed by finding the number of edits which will transform one string to another. | |
* Jaro Winkler: computed by giving high scores to two strings, if they contain same characters, but within a certain distance from one another, AND the order of the matching characters is same. | |
* Pro: provides a better outcome for arrow vs arow. Normalization score is 0.96 | |
* Token based | |
* Jaccard index: the formulae is to find the number of common tokens and divide it by the total number of unique tokens. numerator/denomintor is intersection (common tokens)/ union (unique tokens) | |
* Sorensen-Dice: the logic is to find the common tokens, multiple by 2, and divide it by the total number of tokens present by comining both sets. | |
* Sequence based | |
* Ratcliff-Obershelp similarity | |
https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment