Skip to content

Instantly share code, notes, and snippets.

@vpetersson
Last active October 8, 2019 13:54
Show Gist options
  • Save vpetersson/f20efe6194460cc28d49 to your computer and use it in GitHub Desktop.
Save vpetersson/f20efe6194460cc28d49 to your computer and use it in GitHub Desktop.
Parse and dump a sitemap (using Python)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
Based on http://www.craigaddyman.com/parse-an-xml-sitemap-with-python/
"""
from bs4 import BeautifulSoup
import requests
url = "https://www.domain.com/sitemap.xml"
get_url = requests.get(url)
if get_url.status_code == 200:
soup = BeautifulSoup(get_url.text)
for loc in soup.findAll("loc"):
print loc.text
else:
print "Unable to fetch sitemap."
@HQJaTu
Copy link

HQJaTu commented Jul 11, 2018

Above code doesn't account for <sitemap><loc> URL to have any arguments in it.

I created an improved version at https://gist.github.com/HQJaTu/cd66cf659b8ee633685b43c5e7e92f05 to address that issue. The obvious solution is to first parse the url, and check the URL path-part.

@dvir-cdsoft
Copy link

hi,
sorry for the question, im new at python
to where the code dump the sitemap ?
do i need to write any writing to file ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment