In a previous article, we walked through authenticating the Google Analytics (Universal Analytics) API using Python. We will pick up where left off and set up our code to run our first report from Google Analytics!
Requirements
A Google Analytics API Connection. I’ll be calling the function we wrote to build the service and run the report.
pip install pandas
Basic Google Analytics API batchGet Report
Since you’re here already, you probably have an idea of the benefits of pulling your data via the Google Analytics API as opposed to downloading from the interface. One of the big ones for me is the ability to pull a lot of data quickly at scale. A report in the Google Analytics interface can view and download up to 5000 rows at a time. The API can pull up to 100,000 rows at a time, which, to me, makes this a necessary skill when working on enterprise level websites.
Let’s get started!
Additional Imports
Before we get started with the report, there are two modules I recommend importing at the top of your code to help us with parsing the data.
import pandas as pd
from collections import defaultdict
Request Body Setup
The bulk of the work when setting up the report comes in building the request body. This is where you’ll set date ranges, dimensions, and metrics you’ll be pulling. You can also set the amount of rows you want to pull and the page you want to start on (which is handy if you have more then 100,000 rows of data and you have to run the same report multiple times. I recommend building this request body separately and assigning it to a variable body
so we don’t have to try putting the entire dictionary construction in the service function we’ll be calling.
It’s also very helpful to build the request in the Google Analytics batchGet method documentation page. This will guide you to making sure your syntax is accurate and you’re building the request properly. You’ll primarily be referencing the ReportRequest request body and they offer a pop-over to build the body and try it out without you having to put it into your Python script. It will give you a warning if your request has errors and try to tell you what the errors might be. Please be aware that any time you run a report from the documentation page, it will count toward your daily limit.
Because we are running a batchGet, the first key we are adding to our request is a reportsRequest
, which will house all of our reporting requests. Everything else mentioned here will be nested under here. As you might have figured out from the name, you can run multiple reports with a single request. We won’t cover that in this article, though.
The first item in the reportsRequest
list/array is a dictionary where a single report request will be given all of its attributes in key:value pairs. We’ll be referring to this as a ReportRequest.
The first key in the ReportRequest is your viewId
. You can find this in the admin section of your Google Analytics account. Navigate to your View settings and you’ll find the number. Note that this is not the Property ID that correlates with the GA code you put on your site.
The second item we’ll add is the dateRanges
, which accepts a list. Each item in the list will consist of a startDate
and endDate
. Those are both strings and not a date or datetime objects and they use ISO 8601 date formatting; in simple terms, YYYY-MM-DD format.
Next, we are going to add our dimensions
. These also use a list and each item will consist of a dictionary with a name
. You can find all of the accepted dimension names in the Google Analytics UA Dimensions & Metrics Explorer. We’re going to pull a standard source / medium for our dimension.
Naturally, we need metrics
to go with our dimensions. Again, these will use a list and each item will consist of a dictionary with an expression
key. We’re going to pull sessions. Again, you can find all of the accepted metric names in the Google Analytics UA Dimensions & Metrics Explorer.
This is the general gist of the report body, but there are two more items I’m going to add. The first is pageSize
, which is reflective of how many rows we are asking for. I pretty much always set these to the highest possible, 100000
. The higher you go, the longer it can take to get a response, but I like to pull an exhaustive set of data. Speaking of taking longer, I also set samplingLevel
to LARGE
. You’ll get more accurate data this way, but it will take longer – and it’s not to say you’ll get unsampled data either. There are three options for samplingLevel
: LARGE
, SMALL
, and DEFAULT
. If you don’t select one, DEFAULT
is used.
body = {
"reportRequests": [
{
"viewId": "262445554",
"dateRanges": [
{
"startDate": "2022-03-01",
"endDate": "2022-03-31"
}
],
"dimensions": [
{
"name": "ga:sourceMedium"
}
],
"metrics": [
{
"expression": "ga:sessions"
}
],
"pageSize": 100000,
"samplingLevel": "LARGE"
}
]
}
Make API Request
Now that we have the request body built, we can make the request and saving it to a variable.
analytics = ga_auth(scopes)
response = analytics.reports().batchGet(body=body).execute()
Extracting Google Analytics Data From Response
We’re going to get our hands dirty and really dig into the response to get an understanding of what we’re seeing. The response is a dictionary with only one key, reports
and the value on reports
is a list of the report(s) you’ve requested in your batch request. In our case, we have list with only one report in it. That report is also a dictionary.
>>> response['reports'][0].keys()
dict_keys(['columnHeader', 'data'])
The columnHeader
key has a value equal to the dimensions and metrics we asked for. We don’t really have to do anything with this. We also have a data
key, which is where our report data lives. There’s another that could appear, that we likely won’t see in this small of a report called nextPageToken
. The nextPageToken
will return a string number to indicate which row to start on to pull the next set of rows. I’ll get into this more in another article. For now, let’s dive into the data
key.
>>> response['reports'][0]['data'].keys()
dict_keys(['rows', 'totals', 'rowCount', 'minimums', 'maximums', 'isDataGolden'])
Yours might look a little bit different, but you should see something similar to this.
I’m going to skip over rows for a second and talk about the rest. The totals
key will return the row totals for each metric we asked for. The rowCount
will tell us how many rows are in the report (remember, we asked for up to 100,000). The minimums
and maximums
will be the minimum and maximum values for each metric we asked for.
The isDataGolden
will either be True
or False
. If this is True
, this means you can pull this report again another time and the data will be the same. If this is False
, the data could change next time you pull it. This might happen if you are pulling data from today. Since the day isn’t over yet, it won’t be golden. I always recommend pulling up to the current date and never into the current date.
There are two other keys I’ll mention that I don’t have in my response. The first is samplesReadCounts
and the second is samplingSpaceSizes
. These will only appear if your report is sampled. As you begin pulling more dimensions, you are more likely to encounter sampling. I’ll cover this another time.
Finally, we’re back to the rows
key. The rows
key has all of our report data in a list. As the name might indicate, the number of items in the list should match the number of rows in the rowCount
we saw earlier. Each item in the list is a standalone dictionary with two keys: dimensions
and metrics
. Each of those keys will be a list of the values associated with each dimension and metric we asked for. In our case, we only asked for one dimension and one metric: source / medium and sessions, so a single row might look something like this:
{'dimensions': ['(direct) / (none)'], 'metrics': [{'values': ['17151']}]}
Hopefully that’s not all too confusing, but if it is, don’t sweat it, we’re about to write a loop to parse all of our data into a single dictionary that can be converted to a Pandas DataFrame.
First, to begin the parse, we need a place to put it.
report_data = defaultdict(list)
Standard dictionaries lack the ability to append to them, so I like to use the defaultdict
from the built-in collections library to inherit some list-like properties. The other great part about defaultdict
is that if you are trying to append data to a key that does not yet exist, the key is created without you having to initiate it before attempting to append.
for report in response.get('reports', []):
rows = report.get('data', {}).get('rows', [])
for row in rows:
report_data['source / medium'].append(row.get('dimensions', [])[0])
for values in row.get('metrics', []):
all_values = values.get('values', [])
report_data['sessions'].append(all_values[0])
report_df = pd.DataFrame(report_data)
I have already kinda explained what we’re doing with each loop by going through the response, but I’ll touch on one thing I know I had a question on when I first learned this. My question was why are we using .get()
on the dictionaries instead of just calling the key like dict['key']
? The answer to that is to reduce the risk of errors. With the .get()
method, we can set a fallback – typically an empty dictionary or an empty list. If we called dict['key']
and it didn’t exist, we’d raise a KeyError exception and our code stops.
Finally, after we’re done looping through our response, we can neatly put our report_data
dictionary into a Pandas DataFrame.
Tips to Scale This Script for Easy Reuse
Everything we went over so far works fine for this report. To be honest, though, what we wrote is not scalable to other reports. What if we want a different dimension? We need to update the dimension in several places. What if we want different metrics? We have to update the metrics in several places. What if we want more dimensions and metrics? We can request up to 9 dimensions and 10 metrics per request!
Set Your Dimensions & Metrics to Lists Prior to Building Request
I find it’s easier to keep track of my dimensions and metrics by putting them in a list, especially because they have to maintain syntactic accuracy in order to not get an error from the API. You can also put your viewId
s in a list, though I prefer putting those in a dictionary so I have some sort of naming convention instead of just the numbers.
# Set Request Parameters
views = {'Short Automaton': '262445554'}
dimensions = ['ga:sourceMedium']
metrics = ['ga:sessions']
Use List Comprehension to Build Request Dimensions & Metrics
Using the first tip, we can really clean up our code a lot by using list comprehensions. Your request body can get really long if you are pulling several dimensions and/or several metrics, and since the request body uses an array-like format for these, it is easy to produce these at scale! Please note that the request body below is not complete, which is intentional to show you just this tip.
As a bonus (mostly because I don’t feel those warrant a set of tips on their own), I’m also showing how you can use the views
dictionary for the viewId
field.
body = {
"reportRequests": [
{
"viewId": views['Short Automaton'],
"dimensions": [{'name': dimension} for dimension in dimensions],
"metrics": [{'expression': metric} for metric in metrics]
}
]
}
Use enumerate() to Extract Data Into Report Data Dictionary
Again, we’re going to leverage the fact that we’ve put our dimensions and metrics into a list when extracting our data. Previously, we explicitly called the keys we wanted to append values to and we also explicitly called the position of the response. That’s not very scalable. I want to be able to add dimensions and metrics to that list without changing another thing in my code and we can achieve that by tweaking our extraction loop slightly by using enumerate(dimensions)
and enumerate(metrics)
.
If you’re somewhat new to Python, you might not have seen enumerate()
before. The way we use it is that we apply it to a list and and it returns a numbered list using tuples. For example, if our list was ['apple', 'ball', 'cat']
, enumerate(['apple', 'ball', 'cat'])
would return [(0, 'apple'), (1, 'ball'), (2, 'cat')]
. That’s a simple explanation to get the point across. It’s not exactly how it works, but it’s a good way to think about it.
Because the data is returns in the same order of dimensions and metrics as we ask for them, we can create a FOR loop on the enumeration of both our dimensions list and metrics list. We unpack the enumeration by index (I use i
in this example) and key. With these unpacked, we can append the correct dimension or metric from the response to the correct key in our report dictionary. You can see the changes on rows 4 and 5 for our dimensions and rows 9 and 10 for our metrics below.
for report in response.get('reports', []):
rows = report.get('data', {}).get('rows', [])
for row in rows:
for i, key in enumerate(dimensions):
report_data[key].append(row.get('dimensions', [])[i]) # Get dimensions
for values in row.get('metrics', []):
all_values = values.get('values', []) # Get metric values
for i, key in enumerate(metrics):
report_data[key].append(all_values[i])
Google Analytics (UA) batchGet Full Code Example
# Import modules
import pandas as pd
from collections import defaultdict
# Authenticate & Build Service
analytics = ga_auth(scopes)
# Set Request Parameters
views = {'Short Automaton': '262445554'}
dimensions = ['ga:sourceMedium']
metrics = ['ga:sessions']
# Build request body
body = {
"reportRequests": [
{
"viewId": views['Short Automaton'],
"dateRanges": [
{
"startDate": "2022-03-01",
"endDate": "2022-03-31"
}
],
"dimensions": [{'name': dimension} for dimension in dimensions],
"metrics": [{'expression': metric} for metric in metrics],
"pageSize": 100000,
"samplingLevel": "LARGE"
}
]
}
# Make Request
response = analytics.reports().batchGet(body=body).execute()
# Parse Request
report_data = defaultdict(list)
for report in response.get('reports', []):
rows = report.get('data', {}).get('rows', [])
for row in rows:
for i, key in enumerate(dimensions):
report_data[key].append(row.get('dimensions', [])[i]) # Get dimensions
for values in row.get('metrics', []):
all_values = values.get('values', []) # Get metric values
for i, key in enumerate(metrics):
report_data[key].append(all_values[i])
report_df = pd.DataFrame(report_data)
Eric is a Python SEO with a passion for data. He uses Python to automate tasks and analyze large data sets. He is often the go-to for spreadsheet challenges, too. He sees every challenge as an opportunity to learn.