Skip to content

Instantly share code, notes, and snippets.

@mustafa-zidan
Last active September 24, 2018 07:12
Show Gist options
  • Save mustafa-zidan/749f4a4fb8ff29b49cb9d67de4aab030 to your computer and use it in GitHub Desktop.
Save mustafa-zidan/749f4a4fb8ff29b49cb9d67de4aab030 to your computer and use it in GitHub Desktop.
Code challenge solution for Finway
import argparse
# from nltk.corpus import stopwords
# stop_words = set(stopwords.words('english'))
def word_frequency(file_name):
frequency = {}
with open(file_name, "r") as lines:
for line in lines:
# With each line we need to trimmed, stemmed
# and stop words skipped
for word in line.strip().lower().split():
# if not word in stop_words:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list = frequency.keys()
for words in frequency_list:
print words, frequency[words]
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='calculate the frequencies of words in a file')
parser.add_argument('--input-files', help='input files to calcualte the frequency of')
args = parser.parse_args()
word_frequency(args.input_files)
@mustafa-zidan
Copy link
Author

mustafa-zidan commented Sep 23, 2018

This is the fastest solution to implement within the timeframe. There are a lot of things to improve and here are the things that need to be done from my perspective:

  • We need to remove all the stop words and that can be done using an external library NLTK.
  • The split of the words should be according to regex to prevent a case like -, (2d) or sound".[1]
  • I thought of using map reduce to speed the calculation of the frequency but I am not sure how the GIL will behave in such case. I'd prefer using proper distributed computing platform like spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment