Skip to content

Instantly share code, notes, and snippets.

@rivermont
Created May 3, 2018 17:09
Show Gist options
  • Save rivermont/bbd6a39ac92d14e7ed6b5e06f18cbacd to your computer and use it in GitHub Desktop.
Save rivermont/bbd6a39ac92d14e7ed6b5e06f18cbacd to your computer and use it in GitHub Desktop.
Hopefully the best regex for matching URLs from text.
import re
# Matches almost all URLs
# Does not match for foreign characters; beyond the English alphabet and punctuation
expression = '''((https?|ftp):\/\/)?(www\.)?((-|[0-9]|[A-Z]|[a-z])+\.)+(com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|fi|cn|br|be|at|info|pl|dk|cz|cl|hu|nz|il|ie|za|tw|kr|mx|gr|ar|co|ly|gl)(([\/]|[\-\?\+\(\)\=\_\&\%\#\.]|[0-9]|[A-Z]|[a-z])+)*'''
# Finds all the URLs as the original but also IP addresses.
# Generates a lot of false positives though, such as '2.0'
expression1 = '''((https?|ftp):\/\/)?(www\.)?((-|[0-9]|[A-Z]|[a-z])+\.)+((?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|com|org|edu|gov|uk|net|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es|mil|fi|cn|br|be|at|info|pl|dk|cz|cl|hu|nz|il|ie|za|tw|kr|mx|gr|ar|co|ly|gl)(([\/]|[\-\?\+\(\)\=\_\&\%\#\.]|[0-9]|[A-Z]|[a-z])+)*'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment