One of the quickest ways to understanding a website is to take a glimpse at the XML sitemap(s). As a user, you can usually pretty quickly find the XML sitemap or sitemap index by visiting /robots.txt. Most robots.txt files will include the URL or multiple URLs to a website’s sitemap(s).
We’ll explore a couple of different ways to quickly pull this data together for easy export into a CSV, Excel file, or Google Sheets.
Method 1: Load Sitemap Directly Into Pandas DataFrame
Requirements
pip install pandas
pip install lxml
Simple Way for Regular Sitemaps
import pandas as pd
If you are using pandas version 1.3 or greater, you have access to the pandas.read_xml function. The first argument of the function can be a filepath, including valid URL schemes over HTTP. In this example, we’ll be pulling the sitemap for Schema.org
urls = pd.read_xml('https://schema.org/docs/sitemap.xml')
>>> urls.head()
loc lastmod
0 https://schema.org/3DModel 2021-07-07
1 https://schema.org/AMRadioChannel 2021-07-07
2 https://schema.org/APIReference 2021-07-07
3 https://schema.org/Abdomen 2021-07-07
4 https://schema.org/AboutPage 2021-07-07
As you can see, this is incredibly simple and straight to the point. It’s not perfect, though. For instance, if the sitemap is a sitemap index, your resulting DataFrame is a set of other sitemaps.
Handling Sitemap Index
Let’s pretend that’s the case. Sure, we could repeat this process of each one, but let’s take a different approach to handle this all at once with a simple function. In this function, we’ll be adding an exception that needs to be imported. So at the top of your script where you imported pandas, we need to import this exception.
from lxml import XMLSyntaxError
Now we can get to the function.
def extract_urls(sitemap, index=True):
urls = pd.DataFrame()
if index:
sitemap_index_df = pd.read_xml(sitemap)
for xml_sitemap in sitemap_index_df['loc'].tolist():
try:
urls = pd.concat([urls, pd.read_xml(xml_sitemap)])
except XMLSyntaxError:
print(xml_sitemap, 'unreadable.')
except UnicodeEncodeError:
print(xml_sitemap, 'unicode error.')
else:
urls = pd.read_xml(sitemap)
return urls
I’ll walk you through the code to get an understanding of what we’re doing.
The first thing we do is set the variable urls
to an empty DataFrame class. Then we check to see if the sitemap provided is a sitemap index or not. By default, we are saying it is not. With this method, it is up to you to know whether it is or not.
I’m going to jump down to line 12 for a second. This is the else statement if the index is False. In which case, the code executes exactly what we wrote above with the Schema.org example.
Back up to line 3. If the sitemap we’re passing in is an index, we read the sitemap index into its own DataFrame and we’ll be iterating over that.
The for
loop starting on line 5 will loop over each sitemap in the list of sitemaps we pull from the sitemap index to build a full DataFrame by performing pd.read_xml()
on each sitemap and combining them together.
You’ll notice I’m using try
and except
statements. I won’t go into those here, but these are the ones I’ve most commonly run across. I recommend you add any that you commonly see to create a log to manually check.
And that’s it. This simple function will take a sitemap or sitemap index and return a DataFrame of all the URLs found. Once you have the URLs in a DataFrame, you can easily export the file using df.to_csv() or df.to_excel().
>>> wordpress = extract_urls('https://wordpress.org/sitemap.xml', index=True)
>>> wordpress.head()
loc lastmod image
0 https://wordpress.org/about/ 2018-03-28T03:08:37Z NaN
1 https://wordpress.org/about/logos/ 2018-03-28T03:08:37Z NaN
2 https://wordpress.org/about/domains/ 2018-03-28T03:08:37Z NaN
3 https://wordpress.org/about/etiquette/ 2018-03-28T03:08:37Z NaN
4 https://wordpress.org/about/features/ 2018-03-28T03:08:37Z NaN
A quick note that sometimes you’ll see XML sitemap files in the robots.txt end with a .gz file extension – it might look something like example.com/sitemap.xml.gz. This is a gzip file extension indicating that the file has been compressed. Pandas automatically infers the compression type and you shouldn’t have to do anything special for those.
A word of warning, the quick and dirty solution we just went over will not always work. Many sites have protections in place that will throw errors when trying to directly request the sitemap(s) via pandas.read_xml.
Method 1 Full Code
import pandas as pd
from lxml.etree import XMLSyntaxError
def extract_urls(sitemap, index=True):
urls = pd.DataFrame()
if index:
sitemap_index_df = pd.read_xml(sitemap)
for xml_sitemap in sitemap_index_df['loc'].tolist():
try:
urls = pd.concat([urls, pd.read_xml(xml_sitemap)])
except XMLSyntaxError:
print(xml_sitemap, 'unreadable.')
except UnicodeEncodeError:
print(xml_sitemap, 'unicode error.')
else:
urls = pd.read_xml(sitemap)
return urls
Method 2: Make a Request & Parse the Data Separately
Requirements
pip install pandas
pip install requests
pip install beautifulsoup4
Using Requests to Get Sitemap
In the event that Method 1 raises an HTTP Error 403: Forbidden, or other typical HTTP 4XX or 5XX errors, we can typically write some code to accommodate for that. Let’s first try making a GET request to a sitemap that was giving us a 403 Error when put directly in Pandas. I’ll also be importing all of the modules we’ll be using for this method.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import gzip
>>> resp = requests.get('https://www.screamingfrog.co.uk/sitemap.xml')
>>> resp.ok
True
>>> resp.status_code
200
It would seem that we got the request came through properly! One of the benefits of making the request separately is that we can also check the XML file for a namespace and identify whether it is a sitemap index or a regular sitemap. This will improve the user experience when calling the function later since you won’t have to define ahead of time whether it’s a sitemap index or not.
In a proper XML sitemap, there are two possible namespaces: sitemapindex and urlset. Let’s check what our sitemap from above is set to. To make this easier for us, we’ll be leveraging BeautifulSoup.
>>> soup = BeautifulSoup(resp.content, 'xml')
>>> soup.select('sitemapindex')
[]
>>> soup.select('urlset')
[[<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.screamingfrog.co.uk/</loc>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://www.screamingfrog.co.uk/contact-us/</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
...
</urlset>]
Because we are attempting to select one of these two namespaces, one should always return an empty list and the other will always return a list of the XML tree. Knowing that we will get one of these two options, we know that we can test for these values to construct an IF statement like we did in Method 1.
But before we get into writing our function, there’s one last thing we have to do; we have to create the DataFrame. Lucky for us, this is just as straightforward by calling pd.read_xml()
.
>>> urls = pd.read_xml(resp.content)
>>> urls.head()
loc changefreq priority
0 https://www.screamingfrog.co.uk/ daily 1.0
1 https://www.screamingfrog.co.uk/contact-us/ daily 0.9
2 https://www.screamingfrog.co.uk/seo-spider/tut... daily 0.9
3 https://www.screamingfrog.co.uk/about/ daily 0.9
4 https://www.screamingfrog.co.uk/search-engine-... daily 0.9
Writing the Function
Now it’s time to put this all together to write the function.
def parse_sitemap(sitemap, **kwargs):
urls = pd.DataFrame()
resp = requests.get(sitemap, **kwargs)
if not resp.ok:
print(f'Unable to fetch sitemap. Request returned HTTP Response {resp.status_code}. Please check your input.')
return None
soup = BeautifulSoup(resp.content, 'xml')
if soup.select('sitemapindex'):
sitemaps = pd.read_xml(resp.content)
for each_sitemap in sitemaps['loc'].tolist():
resp = requests.get(each_sitemap, **kwargs)
if resp.ok:
urls = pd.concat([urls, pd.read_xml(resp.content)])
else:
print(f'Unable to fetch {each_sitemap}. Request returned HTTP Response {resp.status_code}.')
else:
urls = pd.read_xml(resp.content)
return urls
Like last time, we start out by assigning urls
to an empty DataFrame. Then we get into making our HTTP request.
On line 4, we want to check before we go any further that the request went through. If the HTTP request had an error (typically a 4XX or 5XX error), the program would print that it was unable to fetch the sitemap and also give you the HTTP status code. You might need to check your input.
Assuming all goes well, we parse the content of the sitemap into a BeautifulSoup object where we check if we can select sitemapindex
namespace. Remember, this will either return the results or an empty list. An empty list is a Falsy value and a non-empty list is a Truthy value. I might get into this in more detail in another article later, but I wanted to mention that because beginners might be wondering why I’m not adding a conditional statement to the IF statement.
Anyway, if the value resolves that it is a sitemap index, we put the sitemap index into a DataFrame and then iterate over each sitemaps, making requests, turning them into DataFrames and concatenating them together. You’ll notice that in the FOR loop, I’m also checking to verify that the request went through properly.
If the sitemap is just a regular sitemap, we can just turn that initial response into a DataFrame.
What does **kwargs mean?
**kwargs stands for keyword arguments and is a little more advanced, but it’s functionality I wanted to include in this function for the more advanced users. What we’ve written isn’t perfect 100% of the time either. It will probably work for 80-90% of the sitemaps you might be pulling. For another 5-10% of them, you might need to pass different headers into your request. I often pass a Chrome-like user-agent, for example, which can help me in situations of 403 errors and 503 errors.
Handling Gzipped XML Sitemaps
What we’ve written can’t handle gzipped xml sitemaps like Method 1 can. We can easily fix that with a little help from a built-in library called gzip
. First, we’ll add the import to the top of the script.
import gzip
Then in our function, we need to add a way to check if the response is a gzip file. Again, we can normally tell this by the .gz file extension, but if we’re parsing a sitemap index, we might not be looking at the file extensions. If the response is gzip, we need to decompress it before parsing it.
if resp.headers['Content-Type'] == 'application/x-gzip':
content = gzip.decompress(resp.content)
else:
content = resp.content
Just like the response has a status_code and content, it has headers. The response headers tell us all sorts of information about the page and they come in key-value pairs. We’re specifically checking if the Content-Type
header is application/x-gzip
, indicating it is a gzip file. If it is, we call the decompress()
function from the gzip
built-in library. Being built-in, you don’t need to pip install it. We set the decompressed content to a variable we call content
. If it’s not compressed, we just set the same variable to resp.content
. That’s all there is to it to be able to handle .xml.gz sitemap files!
Don’t forget, for a sitemap index, we need to make this same check for each sitemap inside of it. Just because a sitemap index is or is not compressed does not mean the children sitemaps are or are not.
Method 2 Full Code
import pandas as pd
from bs4 import BeautifulSoup
import requests
import gzip
def parse_sitemap(sitemap, **kwargs):
urls = pd.DataFrame()
resp = requests.get(sitemap, **kwargs)
if not resp.ok:
print(f'Unable to fetch sitemap. Request returned HTTP Response {resp.status_code}. Please check your input.')
return None
if resp.headers['Content-Type'] == 'application/x-gzip':
content = gzip.decompress(resp.content)
else:
content = resp.content
soup = BeautifulSoup(content, 'xml')
if soup.select('sitemapindex'):
sitemaps = pd.read_xml(content)
for each_sitemap in sitemaps['loc'].tolist():
resp = requests.get(each_sitemap, **kwargs)
if resp.ok:
if resp.headers['Content-Type'] == 'application/x-gzip':
content = gzip.decompress(resp.content)
else:
content = resp.content
urls = pd.concat([urls, pd.read_xml(content)])
else:
print(f'Unable to fetch {each_sitemap}. Request returned HTTP Response {resp.status_code}.')
else:
urls = pd.read_xml(content)
return urls
Eric is a Python SEO with a passion for data. He uses Python to automate tasks and analyze large data sets. He is often the go-to for spreadsheet challenges, too. He sees every challenge as an opportunity to learn.