One of the biggest advantages of using the Google Search Console API instead of the web interface is that you can pull a lot more rows at once when looking at your performance reporting. If you’re used to using the Search Console UI, you’re used to seeing up to 1000 rows of data at a time. You can only export each page you’re looking at, which means you might have to start stitching together a lot of files.
Using the Search Analytics method of the Google Search Console API, the default number of rows returned is 1000. In the body of the request, you can set the rowLimit
to a maximum of 25,000. Already, we’re talking about a 25x gain in rows. Depending on the reports we run and the sites we work on, even 25,000 might not be enough. So, how do we pull all the rows? I’ll walk you through that now!
Requirements
pip install pandas
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
Preparing the Python Script
To start, we’re going to need the code to authenticate the script and build the service for us to use. We have an entire article dedicated to building and explaining the code to connect to the Google Search Console API. You can either copy and paste this from a previous script or you can import the Google Search Console Python file we created before. For the purpose of this article, we’re going to act like we don’t have an importable file.
# Import Modules
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
import pandas as pd
# Define function to get authorization
def gsc_auth(scopes):
creds = None
# The file token.json stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.json'):
creds = Credentials.from_authorized_user_file('token.json', scopes)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
'credentials.json', scopes)
creds = flow.run_local_server(port=0)
# Save the credentials for the next run
with open('token.json', 'w') as token:
token.write(creds.to_json())
service = build('searchconsole', 'v1', credentials=creds)
return service
# Authenticate API
scopes = ['https://www.googleapis.com/auth/webmasters']
service = gsc_auth(scopes)
If you have followed the other articles, this should look pretty familiar to you.
The next thing we’re going to is set a starting_row
variable to 0 and also create an empty Pandas DataFrame.
start_row = 0
all_search_analytics_df = pd.DataFrame()
First 25,000 Rows
For now, we’re going to start exploring. What we’re going to be writing in this section will kinda be included in the final code, but there will be more to it. Let’s build a request body and make the Search Analytics API call.
sa_request = {
"startDate": "2022-01-01",
"endDate": "2022-03-15",
"dimensions": [
"QUERY",
"PAGE"
],
"rowLimit": 25000,
"startRow": start_row
}
gsc_search_analytics = service.searchanalytics().query(siteUrl='sc-domain:shortautomaton.com',
body=sa_request).execute()
Notice that we are setting the startRow
to the variable we created earlier, start_row
. The default value for startRow
if you didn’t include it in the request body is 0. Being that this is the first 25,000 rows, we technically didn’t need to include it.
>>> len(gsc_search_analytics['rows'])
25000
Great! Let’s append these rows to the DataFrame we created earlier.
all_search_analytics_df = pd.concat([all_search_analytics_df, pd.DataFrame(gsc_search_analytics['rows'])])
There’s one important thing I want to point out here. If you’ve used Pandas in the past, you might be used to appending rows with a .append
method. I’m using pd.concat()
instead of DataFrame.append()
because the .append
method has been deprecated recently with the release of 1.4. .append
will still work for now, but it will be removed in a future release, so it’s best to make the switch when you can.
Second 25,000 Rows
In order to pull the next 25,000 rows of data, we need to start where we left off. Let’s set our start_row
variable to 25000 and reset the request body’s startRow
value to start_row
to recognize the new value.
start_row += 25000
sa_request['startRow'] = start_row
Why 25000 instead of 25001? Simple, really. Our starting row was 0, not 1. So 25,000 rows of data goes from 0 to 24,999. You can verify this yourself by setting a small rowLimit
like 10, pulling the data, then setting the startRow
to 10 and pulling the data. You’ll find that it doesn’t repeat.
Also take note that we don’t have to completely rewrite the sa_request
dictionary. We can set a new value by using the key. Everything else remains the same!
With the new request body set, we can make another data pull.
gsc_search_analytics = service.searchanalytics().query(siteUrl='sc-domain:shortautomaton.com',
body=sa_request).execute()
>>> len(gsc_search_analytics['rows'])
25000
Looks familiar. Let’s append it to the DataFrame.
all_search_analytics_df = pd.concat([all_search_analytics_df, pd.DataFrame(gsc_search_analytics['rows'])])
By appending the new set of data to the DataFrame, we can start stitching together the report.
>>> all_search_analytics_df.shape
(50000, 5)
Nth 25,000 Rows
We could keep doing this until we reach the end, but how do we know how many rows are available in the report? Well, we don’t. Google doesn’t tell us how many rows are available. We are left to our own devices to figure that out.
Luckily, it’s pretty easy to determine when we’ve reached the end of the report. We’ll eventually run out of rows. When we are requesting up to 25,000 rows, for example, and we only get back 3,958 rows, we know we’ve hit the end and we can stop asking.
You might have experience with loops in Python already, which is a great start. If not, no big deal. I’ll be walking through some of the logic.
First, let’s rewind back to the first 25,000 rows.
start_row = 0
all_search_analytics_df = pd.DataFrame()
I’m intentionally skipping setting the request body’s startRow
key before working in this section because we’ll be setting it in the loop.
Because we don’t know how many API calls we’ll need to make, we’ll need to use a WHILE loop to get them all.
A WHILE loop basically says, “while a condition is true, keep looping”.
Let’s write the code, BUT DON’T RUN IT YET!
while True:
sa_request['startRow'] = start_row
gsc_search_analytics = service.searchanalytics().query(siteUrl='sc-domain:shortautomaton.com',
body=sa_request).execute()
all_search_analytics_df = pd.concat([all_search_analytics_df, pd.DataFrame(gsc_search_analytics['rows'])])
start_row += 25000
What this snippet of code is doing is setting sa_request
to the value assigned to start_row
, making the request, appending the data to the DataFrame, and adding 25,000 to the start_row
and we loop over it again and again.
Pretty great, right? One problem. We’re stuck in an infinite loop. while True
will never resolve to being False. So we’ll be requesting rows of data long after we’ve run out of data to request. The kicker is that the Google Search Console API considers these requests to be valid, even if no rows are returned.
Luckily, we’ve already discussed the logic we need to know when our report is done – when the number of rows is less than the rowLimit
, in this case, 25,000.
So, between appending the DataFrame and adding 25,000 to start_row
, we need to add a check to see how many rows we got back, and if it’s less than 25,000, we can break the loop.
if len(gsc_search_analytics['rows']) < 25000:
break
In 99.99% of cases, a stat I just made up, this code is good to go. There’s one potential bug lingering in this though. Let’s walk through a quick example.
>>> start_row = 1000000000
>>> gsc_search_analytics = service.searchanalytics().query(siteUrl='sc-domain:shortautomaton.com',
body=sa_request).execute()
>>> print(gsc_search_analytics)
{'responseAggregationType': 'byPage'}
I set my start_row
to 1 billion, made the request, and then printed the resulting dictionary. Notice anything missing? gsc_search_analytics
doesn’t have a key called 'rows'
. This will cause a problem when trying to create a DataFrame using gsc_search_analytics['rows']
and will result in a KeyError
exception. The chances of having an exact multiple of the rows you’re calling – in our case, 50,000, 75,000, 100,000, etc. – are exceedingly low, but why risk the error when we know we can write around it?
In our WHILE loop, we’re going to replace the line setting the DataFrame with a try-except clause.
try:
all_search_analytics_df = pd.concat([all_search_analytics_df, pd.DataFrame(gsc_search_analytics['rows'])])
except KeyError:
break
Now, if we have, say, exactly 50,000 rows, on the third run of this loop, we’ll encounter the row
-less request and break the loop instead of erroring out.
That’s all we’re writing for this, so erroring out isn’t the worst thing in the world, but it’s still not ideal. You definitely don’t want to be erroring out if you’re planning on downloading the CSV or pushing it to Google Sheets or creating a visualization. Always aim for errorless code!
And that’s it! You are now pulling all of the rows of data from the Google Search Console API Search Analytics request. We went through a little extra exploration to explain the why behind the code, so the full code is at the bottom, like always.
Small Caveat
Please be aware that WHILE loops can be dangerous. I’ve accidentally locked my computer up by running an unchecked WHILE loop. No attempt to break the loop manually worked. I heard my fans spin up and my clock never moved a minute. I had to hard-restart my computer. No fun.
What we wrote won’t do this. However, even as a checked WHILE loop, if you’re dealing with a lot of data in your Google Search Console, you could eat through your API usage limits.
Full Code
# Import Modules
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
import pandas as pd
# Define function to get authorization
def gsc_auth(scopes):
creds = None
# The file token.json stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.json'):
creds = Credentials.from_authorized_user_file('token.json', scopes)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
'credentials.json', scopes)
creds = flow.run_local_server(port=0)
# Save the credentials for the next run
with open('token.json', 'w') as token:
token.write(creds.to_json())
service = build('searchconsole', 'v1', credentials=creds)
return service
# Authenticate API
scopes = ['https://www.googleapis.com/auth/webmasters']
service = gsc_auth(scopes)
# Assign Start Row
start_row = 0
# Create empty DataFrame to combine data into
all_search_analytics_df = pd.DataFrame()
# Build Request Body
sa_request = {
"startDate": "2022-01-01",
"endDate": "2022-03-15",
"dimensions": [
"QUERY",
"PAGE"
],
"rowLimit": 25000,
}
# Loop over requests until all rows are pulled into DataFrame
while True:
sa_request['startRow'] = start_row
gsc_search_analytics = service.searchanalytics().query(siteUrl='sc-domain:shortautomaton.com',
body=sa_request).execute()
try:
all_search_analytics_df = pd.concat([all_search_analytics_df, pd.DataFrame(gsc_search_analytics['rows'])])
except KeyError:
break
if len(gsc_search_analytics['rows']) < 25000:
break
start_row += 25000
Eric is a Python SEO with a passion for data. He uses Python to automate tasks and analyze large data sets. He is often the go-to for spreadsheet challenges, too. He sees every challenge as an opportunity to learn.