ECON250 - Big Data Analytics | Week 6
2026-03-11
thelook schema
orders — order headerorder_items — one row per itemproducts — product detailsusers — customer infoorder_row
├── order_id, created_at, status
├── customer (STRUCT)
├── items (ARRAY<STRUCT>)
└── totals (STRUCT)
No JOINs. Everything about the order is self-contained. If you know Python or JavaScript — a STRUCT is an object/dict, an ARRAY is a list.
Named fields inside a column
Table with a device STRUCT column
| user_id | device.category | device.browser |
|---|---|---|
| u1 | desktop | Chrome |
| u2 | mobile | Safari |
| u3 | desktop | Firefox |
Flat: 4 separate columns
Same data, different shape. Same number of rows — STRUCTs don’t change the grain. The flat row has 4 keys; the packed row has 2 keys, one of which is a nested object.
Multiple values in one cell
Flat: one row per item (5 rows)
| order_id | product_name |
|---|---|
| 1001 | Google Tee |
| 1001 | Google Mug |
| 1002 | Nest Camera |
| 1003 | Android Sticker |
| 1003 | Google Pen |
5 rows → 3 rows. No information lost — the items are packed into arrays.
The left shape is what thelook gives you. The right shape is what GA4 gives you.
ARRAY_AGGInput — 5 rows
| order_id | product_name |
|---|---|
| 1001 | Google Tee |
| 1001 | Google Mug |
| 1002 | Nest Camera |
| 1003 | Android Sticker |
| 1003 | Google Pen |
ARRAY_AGG works like COUNT or SUM with GROUP BY — but instead of computing one number, it collects all values into a list.
UNNESTInput — 3 rows (nested)
Output — 5 rows (flat)
| order_id | product_name |
|---|---|
| 1001 | Google Tee |
| 1001 | Google Mug |
| 1002 | Nest Camera |
| 1003 | Android Sticker |
| 1003 | Google Pen |
3 rows → 5 rows. Each array element becomes its own row.
Each parent row is paired with every element of its array:
Before — 3 rows
After UNNEST — 5 rows
| order_id | revenue | product |
|---|---|---|
| 1001 | 30 | Tee |
| 1001 | 30 | Mug |
| 1002 | 15 | Camera |
| 1003 | 25 | Sticker |
| 1003 | 25 | Pen |
SUM(revenue) = 125, but real total is 70. The revenue gets duplicated for each array element.
The grain changes. Before: one row = one order. After: one row = one item. If you aggregate order-level columns after UNNEST, you double-count.
Flat rows ──ARRAY_AGG──→ Nested array ──UNNEST──→ Flat rows
(5 rows) (3 rows) (5 rows)
ARRAY_AGG packs rows into arrays. UNNEST unpacks arrays into rows.
For the group project, you mostly go nested → flat (UNNEST). GA4 data arrives nested; analysis needs it flat.
| Function | What it does | Example use |
|---|---|---|
ARRAY_LENGTH(arr) |
Count elements | Items per order — without UNNEST |
arr[OFFSET(0)] |
First element (0-based) | First item in an order |
arr[ORDINAL(1)] |
First element (1-based) | Same, different indexing |
ARRAY_TO_STRING(arr, ', ') |
Join into a string | “Tee, Mug, Pen” |
value IN UNNEST(arr) |
Check membership | Does this order contain “Google Tee”? |
ARRAY_LENGTH and IN UNNEST let you query arrays without flattening them.
Arrays where each element has multiple fields
ARRAY<STRING> — a list of values
One value per element.
GA4’s items array stores name, price, category, quantity per item. That’s ARRAY<STRUCT> — the more common format in production data.
What’s stored (1 row)
UNNEST(items) AS item gives each struct element a name. Then item.field accesses fields with dot notation.
ARRAY_AGG(STRUCT(...))STRUCT() inside ARRAY_AGG() packs multiple columns per row into a single array element. The result is structurally identical to GA4’s items[] array.
GA4’s event_params
items vs. event_params — different arrays, different accessitems — a list of products
Each element is a different product. UNNEST gives you one row per product.
event_params — key-value pairs
Each element is a different property of the same event. You usually want to extract specific keys as columns, not UNNEST the whole thing.
Each element has a key and a value struct with four typed slots (only one populated per key).
SELECT
event_date,
event_name,
(SELECT value.string_value
FROM UNNEST(event_params)
WHERE key = 'page_location') AS page_location,
(SELECT value.int_value
FROM UNNEST(event_params)
WHERE key = 'ga_session_id') AS ga_session_id
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`
LIMIT 10Correlated subquery — extract specific keys
Result: same rows as input, specific keys as columns.
| Approach | Grain | Use when |
|---|---|---|
| Correlated subquery | 1 row per event | Extracting known keys as columns (90% of project work) |
| UNNEST in FROM | 1 row per key per event | Exploring what keys exist, aggregating across keys |
Run this once to discover the types:
SELECT
key,
COUNT(*) AS occurrences,
COUNTIF(value.string_value IS NOT NULL) AS has_string,
COUNTIF(value.int_value IS NOT NULL) AS has_int,
COUNTIF(value.float_value IS NOT NULL) AS has_float,
COUNTIF(value.double_value IS NOT NULL) AS has_double
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`,
UNNEST(event_params)
GROUP BY key
ORDER BY occurrences DESCThis tells you: page_location → string_value, ga_session_id → int_value, engagement_time_msec → int_value, and so on.
{
"event_name": "page_view",
"event_date": "20201101",
"user_pseudo_id": "abc123",
"device": {"category": "mobile", "browser": "Safari"},
"geo": {"country": "United States", "city": "New York"},
"traffic_source": {"medium": "organic", "source": "google"},
"items": [{"item_name": "Google Tee", "price_in_usd": 18.0}, ...],
"event_params": [
{"key": "page_location", "value": {"string_value": "https://shop.google.com/"}},
{"key": "ga_session_id", "value": {"int_value": 1234567}},
...
]
}Three access patterns: device.category (dot notation), UNNEST(items), correlated subquery for event_params.
SELECT
event_date,
event_name,
device.category AS device_type, -- STRUCT: dot notation
geo.country, -- STRUCT: dot notation
(SELECT value.string_value -- key-value bag:
FROM UNNEST(event_params) -- correlated subquery
WHERE key = 'page_location')
AS page_url,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = 'ga_session_id') AS session_id
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`
LIMIT 20The group project requires a sessions_base view: one row per session.
Event-level (raw GA4)
| user | session_id | event_name |
|---|---|---|
| u1 | s1 | session_start |
| u1 | s1 | page_view |
| u1 | s1 | page_view |
| u1 | s1 | add_to_cart |
| u1 | s1 | purchase |
| u1 | s2 | session_start |
| u1 | s2 | page_view |
Session-level (sessions_base)
| user | session_id | events | pageviews | purchased |
|---|---|---|---|---|
| u1 | s1 | 5 | 2 | yes |
| u1 | s2 | 2 | 1 | no |
The transformation: extract session_id from event_params → GROUP BY user + session → aggregate metrics.
You start building this in tomorrow’s practice.
Not all nested data arrives as STRUCTs and ARRAYs. Sometimes a column is just a string that happens to contain valid JSON:
Native STRUCT — BigQuery knows the schema
The column has a defined type. BigQuery validates it at write time.
GA4 uses native types. But API logs, webhooks, and many exports store data as JSON strings.
JSON_VALUE extracts a scalar. JSON_QUERY extracts an object or array. JSON_QUERY_ARRAY extracts an array you can UNNEST.
The $ is the root. Dot notation navigates the path. Bracket notation ($.items[0]) for array positions.
| Data shape | Read | Build |
|---|---|---|
| STRUCT | column.field |
STRUCT(col1, col2) |
| ARRAY | UNNEST(column) in FROM |
ARRAY_AGG(value) |
| ARRAY<STRUCT> | UNNEST(column) AS x, then x.field |
ARRAY_AGG(STRUCT(col1, col2)) |
| Key-value array | (SELECT value FROM UNNEST(col) WHERE key = '...') |
— |
| JSON | JSON_VALUE(col, '$.path') |
— |
The Build column is what Assignment 5 covers. The Read column is what the group project requires.
Tomorrow’s practices:
sessions_baseComing up:
sessions_base + one analysis view
Kyiv School of Economics