Created
February 15, 2020 22:35
-
-
Save zaidalyafeai/3dab96776b570eaceee6f186a174b12a to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
t = "blah blah" | |
t = araby.strip_tashkeel(t) #remove tashkeel | |
t = re.sub(r'([-؟،.!;:])', ' \\1 ', t) #add spaces between special charaacters | |
t = re.sub(r'([^\s\w\-؟،.!;:])+', '', t) #remove all special characters except some | |
t = re.sub(r'[³ـ¼]', '', t) #explecitly remove some special characters | |
t = re.sub('[a-zA-z]', '', t) #remove english litters |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
in the last line, you are removing English letters. We agreed not to remove them. What changed your mind?