Skip to content

Instantly share code, notes, and snippets.

@trevery
Last active May 19, 2017 02:14
Show Gist options
  • Save trevery/3508686284fae79df44d3be9d5122241 to your computer and use it in GitHub Desktop.
Save trevery/3508686284fae79df44d3be9d5122241 to your computer and use it in GitHub Desktop.
python 使用BeautifulSoup 采集整个页面的 URL
from urllib.request
import urlopen
from bs4
import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
# 我们遇到了新页面
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment