Fredrik Erlandsson

Logo

PhD in Computer science, data scientist at Consid AB.

Extracting data from Google Analytics 360 as an Azure Data Factory job

2020-11-01

Sometimes a it is not possible to query Google Big Query to extract Google Analytics 360 data. There exists an API to access GA data but it is quite undocumented.

What we did here is to create an Azure Function (in python) that reads a CSV file from our Azure Blob Storage and makes requests to https://www.googleapis.com/analytics/v3/data/ga.

The analytics API accepts the following variables "ids", "start-date", "end-date", "metrics", "dimensions", "filters", "segment" and due to limited experience in GBQ (and the fact that our CSV already had the metrics, dimensions, filters, and segment filled) we decided to follow the API way. There are however a few limits with this API.

Google Analytics API limits

Google limits the rate of how much data that can be accessed (at least in November 2020). Therefore we had to build some workarounds on how to handle different issues (from the API).

502 Bad Gateway on Python Azure Function

An Azure Function is only allowed a limited execution time (time as in seconds not cpu-cycles). If the Azure Function can’t complete the request within this time the function will return a 502 Bad Gateway page. To handle this we decided to not to retry when the Google Analytics API fail, instead are we returning that part as empty and let ADF do multiple retries to the Azure Function.

Code

The following code was used in the Azure Function.

import logging
import json

import azure.functions as func
from azure.identity import ManagedIdentityCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd
import requests

KV_URL = "https://<KV>.vault.azure.net/"
KV_ANALYTICS_SECRET = "<SECRET>"
KV_BLOB_SECRET = "<BLOB SECRET>"
SCOPE = "https://www.googleapis.com/auth/analytics.readonly"

def read_blob_xls_file(connection_string, container, blob):
    con = ContainerClient.from_connection_string(
        conn_str=connection_string, container_name=container
    )
    df = pd.read_excel(con.get_blob_client(blob).download_blob().readall())
    return df

def write_df_to_blob(df, connection_string, container, blob_name):
    ContainerClient.from_connection_string(
        conn_str=connection_string, container_name=container
    ).get_blob_client(blob_name).upload_blob(df.to_csv(index=False))

def ga_lookup(df, token):
    url = "https://www.googleapis.com/analytics/v3/data/ga"

    result = []
    variables = (
        "ids",
        "start-date",
        "end-date",
        "metrics",
        "dimensions",
        "filters",
        "segment",
    )

    for _, row in df.iterrows():
        payload = {"access_token": token}
        for var in variables:
            if row[var]:
                payload[var] = row[var]
        res = requests.get(url, params=payload)
        if not res:
            logging.error(payload)
            logging.error(res.text)
            continue
        for res_row in res.json().get("rows", []):
            result.append(
                {
                    "MetricId": row["componentName"],
                    "dimensions": row["dimensions"],
                    "metrics": row["metrics"],
                    "gAdate": res_row[0],
                    "amount": res_row[1],
                }
            )

    return pd.DataFrame(result)

def get_ga_token(key_dict: str) -> str:
    return ServiceAccountCredentials.from_json_keyfile_dict(
                    json.loads(key_dict),
                    scopes=SCOPE,
                ).get_access_token().access_token
        
def get_kv_secret(kv_url: str, secret: str) -> str:
    return SecretClient(
        vault_url=kv_url,
        credential=ManagedIdentityCredential(),
    ).get_secret(secret).value
  
tags: adf, ga, google analytics, azure, python, pandas