Created
March 21, 2016 08:36
-
-
Save hallvors/bef5957658f04315fef6 to your computer and use it in GitHub Desktop.
Using tldextract to remove www. safely and extract the domain name and its public suffix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def extract_domain_name(url): | |
'''Extract the domain name from a given URL''' | |
prefix_blacklist = ['www'] | |
parts = tldextract.extract(url) | |
# We want to drop any prefixes mentioned in the blacklist | |
# They typically do not add information that's useful to | |
# distinguish the "identity" of a specific site | |
# Sometimes the blacklisted domain is part of subdomain, | |
# for example when parsing www.mail.example.com | |
subdomain = parts.subdomain | |
for prefix in prefix_blacklist: | |
subdomain = parts.subdomain.replace(prefix_blacklist, '') | |
if subdomain in prefix_blacklist: | |
return '.'.join([parts.domain, parts.suffix]) | |
else: | |
return '.'.join([subdomain, parts.domain, parts.suffix]) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment