Pulling XML Sitemap with Pandas

Pulling XML Sitemaps Using Pandas

One of the quickest ways to understanding a website is to take a glimpse at the XML sitemap(s). As a user, you can usually pretty quickly find the XML sitemap or sitemap index by visiting /robots.txt. Most robots.txt files will include the URL or multiple URLs to a website’s sitemap(s).

We’ll explore a couple of different ways to quickly pull this data together for easy export into a CSV, Excel file, or Google Sheets.

Method 1: Load Sitemap Directly Into Pandas DataFrame

Requirements

pip install pandas
pip install lxml

Simple Way for Regular Sitemaps

import pandas as pd

If you are using pandas version 1.3 or greater, you have access to the pandas.read_xml function. The first argument of the function can be a filepath, including valid URL schemes over HTTP. In this example, we’ll be pulling the sitemap for Schema.org

urls = pd.read_xml('https://schema.org/docs/sitemap.xml')
>>> urls.head()

                                 loc     lastmod
0         https://schema.org/3DModel  2021-07-07
1  https://schema.org/AMRadioChannel  2021-07-07
2    https://schema.org/APIReference  2021-07-07
3         https://schema.org/Abdomen  2021-07-07
4       https://schema.org/AboutPage  2021-07-07

As you can see, this is incredibly simple and straight to the point. It’s not perfect, though. For instance, if the sitemap is a sitemap index, your resulting DataFrame is a set of other sitemaps.

Handling Sitemap Index

Let’s pretend that’s the case. Sure, we could repeat this process of each one, but let’s take a different approach to handle this all at once with a simple function. In this function, we’ll be adding an exception that needs to be imported. So at the top of your script where you imported pandas, we need to import this exception.

from lxml import XMLSyntaxError

Now we can get to the function.

def extract_urls(sitemap, index=True):
    urls = pd.DataFrame()
    if index:
        sitemap_index_df = pd.read_xml(sitemap)
        for xml_sitemap in sitemap_index_df['loc'].tolist():
            try:
                urls = pd.concat([urls, pd.read_xml(xml_sitemap)])
            except XMLSyntaxError:
                print(xml_sitemap, 'unreadable.')
            except UnicodeEncodeError:
                print(xml_sitemap, 'unicode error.')
    else:
        urls = pd.read_xml(sitemap)
    return urls

I’ll walk you through the code to get an understanding of what we’re doing.

The first thing we do is set the variable urls to an empty DataFrame class. Then we check to see if the sitemap provided is a sitemap index or not. By default, we are saying it is not. With this method, it is up to you to know whether it is or not.

I’m going to jump down to line 12 for a second. This is the else statement if the index is False. In which case, the code executes exactly what we wrote above with the Schema.org example.

Back up to line 3. If the sitemap we’re passing in is an index, we read the sitemap index into its own DataFrame and we’ll be iterating over that.

The for loop starting on line 5 will loop over each sitemap in the list of sitemaps we pull from the sitemap index to build a full DataFrame by performing pd.read_xml() on each sitemap and combining them together.

You’ll notice I’m using try and except statements. I won’t go into those here, but these are the ones I’ve most commonly run across. I recommend you add any that you commonly see to create a log to manually check.

And that’s it. This simple function will take a sitemap or sitemap index and return a DataFrame of all the URLs found. Once you have the URLs in a DataFrame, you can easily export the file using df.to_csv() or df.to_excel().

>>> wordpress = extract_urls('https://wordpress.org/sitemap.xml', index=True)
>>> wordpress.head()

                                      loc               lastmod  image
0            https://wordpress.org/about/  2018-03-28T03:08:37Z    NaN
1      https://wordpress.org/about/logos/  2018-03-28T03:08:37Z    NaN
2    https://wordpress.org/about/domains/  2018-03-28T03:08:37Z    NaN
3  https://wordpress.org/about/etiquette/  2018-03-28T03:08:37Z    NaN
4   https://wordpress.org/about/features/  2018-03-28T03:08:37Z    NaN

A quick note that sometimes you’ll see XML sitemap files in the robots.txt end with a .gz file extension – it might look something like example.com/sitemap.xml.gz. This is a gzip file extension indicating that the file has been compressed. Pandas automatically infers the compression type and you shouldn’t have to do anything special for those.

A word of warning, the quick and dirty solution we just went over will not always work. Many sites have protections in place that will throw errors when trying to directly request the sitemap(s) via pandas.read_xml.

Method 1 Full Code

import pandas as pd
from lxml.etree import XMLSyntaxError


def extract_urls(sitemap, index=True):
    urls = pd.DataFrame()
    if index:
        sitemap_index_df = pd.read_xml(sitemap)
        for xml_sitemap in sitemap_index_df['loc'].tolist():
            try:
                urls = pd.concat([urls, pd.read_xml(xml_sitemap)])
            except XMLSyntaxError:
                print(xml_sitemap, 'unreadable.')
            except UnicodeEncodeError:
                print(xml_sitemap, 'unicode error.')
    else:
        urls = pd.read_xml(sitemap)
    return urls

Method 2: Make a Request & Parse the Data Separately

Requirements

pip install pandas
pip install requests
pip install beautifulsoup4

Using Requests to Get Sitemap

In the event that Method 1 raises an HTTP Error 403: Forbidden, or other typical HTTP 4XX or 5XX errors, we can typically write some code to accommodate for that. Let’s first try making a GET request to a sitemap that was giving us a 403 Error when put directly in Pandas. I’ll also be importing all of the modules we’ll be using for this method.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import gzip
>>> resp = requests.get('https://www.screamingfrog.co.uk/sitemap.xml')
>>> resp.ok

True

>>> resp.status_code

200

It would seem that we got the request came through properly! One of the benefits of making the request separately is that we can also check the XML file for a namespace and identify whether it is a sitemap index or a regular sitemap. This will improve the user experience when calling the function later since you won’t have to define ahead of time whether it’s a sitemap index or not.

In a proper XML sitemap, there are two possible namespaces: sitemapindex and urlset. Let’s check what our sitemap from above is set to. To make this easier for us, we’ll be leveraging BeautifulSoup.

>>> soup = BeautifulSoup(resp.content, 'xml')
>>> soup.select('sitemapindex')

[]

>>> soup.select('urlset')

[[<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
 <loc>https://www.screamingfrog.co.uk/</loc>
 <changefreq>daily</changefreq>
 <priority>1.0</priority>
 </url>
 <url>
 <loc>https://www.screamingfrog.co.uk/contact-us/</loc>
 <changefreq>daily</changefreq>
 <priority>0.9</priority>
 </url>
 <url>
 ...
 </urlset>]

Because we are attempting to select one of these two namespaces, one should always return an empty list and the other will always return a list of the XML tree. Knowing that we will get one of these two options, we know that we can test for these values to construct an IF statement like we did in Method 1.

But before we get into writing our function, there’s one last thing we have to do; we have to create the DataFrame. Lucky for us, this is just as straightforward by calling pd.read_xml().

>>> urls = pd.read_xml(resp.content)
>>> urls.head()

                                                 loc changefreq  priority
0                   https://www.screamingfrog.co.uk/      daily       1.0
1        https://www.screamingfrog.co.uk/contact-us/      daily       0.9
2  https://www.screamingfrog.co.uk/seo-spider/tut...      daily       0.9
3             https://www.screamingfrog.co.uk/about/      daily       0.9
4  https://www.screamingfrog.co.uk/search-engine-...      daily       0.9

Writing the Function

Now it’s time to put this all together to write the function.

def parse_sitemap(sitemap, **kwargs):
    urls = pd.DataFrame()
    resp = requests.get(sitemap, **kwargs)
    if not resp.ok:
        print(f'Unable to fetch sitemap. Request returned HTTP Response {resp.status_code}. Please check your input.')
        return None
    soup = BeautifulSoup(resp.content, 'xml')
    if soup.select('sitemapindex'):
        sitemaps = pd.read_xml(resp.content)
        for each_sitemap in sitemaps['loc'].tolist():
            resp = requests.get(each_sitemap, **kwargs)
            if resp.ok:
                urls = pd.concat([urls, pd.read_xml(resp.content)])
            else:
                print(f'Unable to fetch {each_sitemap}. Request returned HTTP Response {resp.status_code}.')
    else:
        urls = pd.read_xml(resp.content)
    return urls

Like last time, we start out by assigning urls to an empty DataFrame. Then we get into making our HTTP request.

On line 4, we want to check before we go any further that the request went through. If the HTTP request had an error (typically a 4XX or 5XX error), the program would print that it was unable to fetch the sitemap and also give you the HTTP status code. You might need to check your input.

Assuming all goes well, we parse the content of the sitemap into a BeautifulSoup object where we check if we can select sitemapindex namespace. Remember, this will either return the results or an empty list. An empty list is a Falsy value and a non-empty list is a Truthy value. I might get into this in more detail in another article later, but I wanted to mention that because beginners might be wondering why I’m not adding a conditional statement to the IF statement.

Anyway, if the value resolves that it is a sitemap index, we put the sitemap index into a DataFrame and then iterate over each sitemaps, making requests, turning them into DataFrames and concatenating them together. You’ll notice that in the FOR loop, I’m also checking to verify that the request went through properly.

If the sitemap is just a regular sitemap, we can just turn that initial response into a DataFrame.

What does **kwargs mean?

**kwargs stands for keyword arguments and is a little more advanced, but it’s functionality I wanted to include in this function for the more advanced users. What we’ve written isn’t perfect 100% of the time either. It will probably work for 80-90% of the sitemaps you might be pulling. For another 5-10% of them, you might need to pass different headers into your request. I often pass a Chrome-like user-agent, for example, which can help me in situations of 403 errors and 503 errors.

Handling Gzipped XML Sitemaps

What we’ve written can’t handle gzipped xml sitemaps like Method 1 can. We can easily fix that with a little help from a built-in library called gzip. First, we’ll add the import to the top of the script.

import gzip

Then in our function, we need to add a way to check if the response is a gzip file. Again, we can normally tell this by the .gz file extension, but if we’re parsing a sitemap index, we might not be looking at the file extensions. If the response is gzip, we need to decompress it before parsing it.

    if resp.headers['Content-Type'] == 'application/x-gzip':
        content = gzip.decompress(resp.content)
    else:
        content = resp.content

Just like the response has a status_code and content, it has headers. The response headers tell us all sorts of information about the page and they come in key-value pairs. We’re specifically checking if the Content-Type header is application/x-gzip, indicating it is a gzip file. If it is, we call the decompress() function from the gzip built-in library. Being built-in, you don’t need to pip install it. We set the decompressed content to a variable we call content. If it’s not compressed, we just set the same variable to resp.content. That’s all there is to it to be able to handle .xml.gz sitemap files!

Don’t forget, for a sitemap index, we need to make this same check for each sitemap inside of it. Just because a sitemap index is or is not compressed does not mean the children sitemaps are or are not.

Method 2 Full Code

import pandas as pd
from bs4 import BeautifulSoup
import requests
import gzip

def parse_sitemap(sitemap, **kwargs):
    urls = pd.DataFrame()
    resp = requests.get(sitemap, **kwargs)
    if not resp.ok:
        print(f'Unable to fetch sitemap. Request returned HTTP Response {resp.status_code}. Please check your input.')
        return None
    if resp.headers['Content-Type'] == 'application/x-gzip':
        content = gzip.decompress(resp.content)
    else:
        content = resp.content
    soup = BeautifulSoup(content, 'xml')
    if soup.select('sitemapindex'):
        sitemaps = pd.read_xml(content)
        for each_sitemap in sitemaps['loc'].tolist():
            resp = requests.get(each_sitemap, **kwargs)
            if resp.ok:
                if resp.headers['Content-Type'] == 'application/x-gzip':
                    content = gzip.decompress(resp.content)
                else:
                    content = resp.content
                urls = pd.concat([urls, pd.read_xml(content)])
            else:
                print(f'Unable to fetch {each_sitemap}. Request returned HTTP Response {resp.status_code}.')
    else:
        urls = pd.read_xml(content)
    return urls

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top