Skip to content

Instantly share code, notes, and snippets.

@santhoshtr
Created February 28, 2020 10:38
Show Gist options
  • Select an option

  • Save santhoshtr/1d2143ed5a4987b31c8c1a2c17564263 to your computer and use it in GitHub Desktop.

Select an option

Save santhoshtr/1d2143ed5a4987b31c8c1a2c17564263 to your computer and use it in GitHub Desktop.
Malayalam corpus cleanup script
# Misc clean up on corpus
# sed -i -f corpora-cleanup.sed corpus/*.txt
# Chillu normalization
s/ന്‍//g
s/ള്‍//g
s/ല്‍//g
s/ര്‍//g
s/ന്‍//g
s/ണ്‍//g
# Remove ZWNJ at end of words
s/\xE2\x80\x8C$//g
# Remove all other ZWJ
s/\xE2\x80\x8D//g
# Remove all soft hyphens
s/\xC2\xAD//g
# Replace old au sign with new one
s/‍ൌ//g
#Common mistakes
s/പക്ഷെ/പക്ഷേ/g
# ZWNJs
s/ു‌//g
s/ി‌//g
s/ോ‌//g
s/ാ‌//g
s/ഒാ//g
# ൻറെ -> ന്റെ at the end of words
s/ൻറെ/ന്റെ/g
s/ൻറ്$/ന്റ്/g
s/ൻറും$/ന്റും/g
s/ൻറിൽ$/ന്റിൽ/g
# ുൻപോൾ - ുമ്പോൾ
s/ുൻപോൾ/ുമ്പോൾ/g
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment