Created
June 3, 2012 14:58
-
-
Save neilkod/2863831 to your computer and use it in GitHub Desktop.
strip entities (urls, hashtags, usernames) from a tweet
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
note: tweets are in json format, coming from STDIN. | |
for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately. | |
I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project. | |
#!/bin/python | |
import json, sys | |
def strip_items(str, start_pos, end_pos): | |
return str[0:start_pos]+str[end_pos:] | |
for itm in sys.stdin: | |
line = itm.strip() | |
data = json.loads(line) | |
txt=data['text'] | |
tostrip=[] | |
print '-'*50 | |
print 'original tweet: %s' % txt | |
entities=data['entities'] | |
for k,v in entities.iteritems(): | |
for ent in v: | |
try: | |
(start_pos,end_pos)=ent['indices'] | |
tostrip.append((start_pos,end_pos)) | |
except KeyError: | |
# no entities/indicies. pass | |
print "error hre" | |
pass | |
for x in sorted(tostrip, reverse=True): | |
txt = strip_items(txt, *x) | |
print 'modified tweet: %s' % txt | |
print '-'*50 | |
##### sample output | |
-------------------------------------------------- | |
original tweet: RT @Aepul_Drama: RT @azlanR: #NowListening; Drama Band - Cerita Dia | |
modified tweet: RT : RT : ; Drama Band - Cerita Dia | |
-------------------------------------------------- | |
original tweet: hooolis, me duele el ojo | |
modified tweet: hooolis, me duele el ojo | |
-------------------------------------------------- | |
original tweet: gonna listen to some Adele and fall asleep now :) http://t.co/Ie1JfC8e | |
modified tweet: gonna listen to some Adele and fall asleep now :) | |
-------------------------------------------------- | |
original tweet: Diseño de cuartos de baño Interiorismoonlinenet Vic: http://t.co/ZjOsbRie mailto:ventas@interiorismoonline... http://t.co/tSC6epL5 | |
modified tweet: Diseño de cuartos de baño Interiorismoonlinenet Vic: mailto:ventas@interiorismoonline... | |
-------------------------------------------------- | |
original tweet: Sans contenir, au contraire des cinq autres, le moindre signe ou allusion religieuse, Rocky 4 est de loin le plus mystique. #PolitiqueEtFoi | |
modified tweet: Sans contenir, au contraire des cinq autres, le moindre signe ou allusion religieuse, Rocky 4 est de loin le plus mystique. | |
-------------------------------------------------- | |
original tweet: “@__LickMyChucks All About Dem COWBOYS” | |
modified tweet: “ All About Dem COWBOYS” | |
-------------------------------------------------- | |
original tweet: #imsickof people wanting people to be 'real' but not being able to handle the truth ! | |
modified tweet: people wanting people to be 'real' but not being able to handle the truth ! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment