This is a series of blog posts that lead up to creating a page recommendation engine (ie: "next steps" or "people who viewed this page also viewed...") using R and Google Analytics data. Here are the four posts in the series:
[Step 1:] Getting User-level Data.
The first thing you will need in order to create a page recommender is collect user-level data. The Google Analytics API does not allow you to pull their natural keys for users or pageviews, so you have to collect one yourself - in a custom dimension. NOTE: If you get more than 10,000 pageviews per day, you will have issues with sampling and/or API query limits, so you'll want to use a different strategy than I've done here. You might be able to use filters to limit the data, though - depending on how much more than 10k you regularly get.
In any case, there are three good options for getting this data:
Use A Long, Specific Session Query.
This is a somewhat complicated method, but you can do it immediately without having to collect new data first. Basically, you just use dimensionality to get around the limitations of Google's API to determine an individual session. You're able to query 7 dimensions in a single query. So, your goal would be to create a query that gets a unique user's session by combining various dimensions. For example, you might query: hour, minute, landing page, country, region, city, and page. The first 6 dimensions are session-scoped (ie: session-level data), and the last one is hit-scoped (ie: the pageview). So, each row will be a hit that also has the session info. If any users overlapped on this exact thing, you'd get some conflicts.
Read more about this method here: GA Query User-Level Data Without Client ID Or User ID
Use Google Analytics Client ID.
This is an easy method because everyone will have the ability to do it. This ID is Google's identifier for the user's browser. So, with this method, you will be able to track users across sessions (in the same, cookied browser), but you won't be able to track cross-browser/device behavior without additional effort.
Read more about this method here: Get Google Analytics Client ID - Universal Analytics
A unique user key.
If you use a marketing automation software (like Marketo or ClickDimensions), you'll likely be collecting user-level data (ie: pageviews, email activity, etc) in that system. The main benefit to this system is that is can consolidate anonymous visitors when they identify themselves. So, if someone fills out a form or logs into your website, you can track that. If they change browsers and identify themselves again, you can combine those records into a single contact record and let Google know who they are - so Google Analytics can also consolidate users.
Read more about this method here: Marketo/Google Analytics Integration
The goal of collecting this data is so you can pull individual pageviews for individual people. You'll put this data into