What’s Inside the Row

ECON250 - Big Data Analytics | Week 6

Oleh Omelchenko

2026-03-11

How many JOINs for one order?


thelook schema

  • orders — order header
  • order_items — one row per item
  • products — product details
  • users — customer info
SELECT o.order_id, u.first_name,
  oi.sale_price, p.name AS product
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.id
JOIN users u ON o.user_id = u.id

Three JOINs. Every query that needs the full picture repeats them.

What if it were one row?

order_row
  ├── order_id, created_at, status
  ├── customer (STRUCT)
  ├── items (ARRAY<STRUCT>)
  └── totals (STRUCT)
{
  "order_id": 1001,
  "created_at": "2024-11-15",
  "status": "shipped",
  "customer": {"name": "Alice", "country": "UA"},
  "items": [
    {"product": "Tee", "price": 18},
    {"product": "Mug", "price": 12}
  ],
  "totals": {"num_items": 2, "revenue": 30}
}


No JOINs. Everything about the order is self-contained. If you know Python or JavaScript — a STRUCT is an object/dict, an ARRAY is a list.

STRUCTs

Named fields inside a column

Reading STRUCTs: dot notation

Table with a device STRUCT column

user_id device.category device.browser
u1 desktop Chrome
u2 mobile Safari
u3 desktop Firefox

Each row’s device column holds an object:

{"category": "desktop", "browser": "Chrome"}


Dot notation reads the fields — just like device["category"] in Python:

SELECT user_id,
  device.category, device.browser
FROM events

Building STRUCTs from flat columns


Flat: 4 separate columns

SELECT user_id, city, country, region
FROM users
{"user_id": "u1", "city": "Kyiv",
 "country": "UA", "region": "Kyivska"}

Packed: 1 STRUCT column

SELECT user_id,
  STRUCT(city, country, region) AS location
FROM users
{"user_id": "u1",
 "location": {"city": "Kyiv",
               "country": "UA",
               "region": "Kyivska"}}


Same data, different shape. Same number of rows — STRUCTs don’t change the grain. The flat row has 4 keys; the packed row has 2 keys, one of which is a nested object.

ARRAYs

Multiple values in one cell

Flat vs. nested: the same data, two shapes

Flat: one row per item (5 rows)

order_id product_name
1001 Google Tee
1001 Google Mug
1002 Nest Camera
1003 Android Sticker
1003 Google Pen

Nested: one row per order (3 rows)

order_id product_names
1001 [Google Tee, Google Mug]
1002 [Nest Camera]
1003 [Android Sticker, Google Pen]
{"order_id": 1001,
 "product_names": ["Google Tee", "Google Mug"]}


5 rows → 3 rows. No information lost — the items are packed into arrays.

The left shape is what thelook gives you. The right shape is what GA4 gives you.

Building arrays: ARRAY_AGG

SELECT
  order_id,
  ARRAY_AGG(product_name) AS product_names
FROM order_items
GROUP BY order_id


Input — 5 rows

order_id product_name
1001 Google Tee
1001 Google Mug
1002 Nest Camera
1003 Android Sticker
1003 Google Pen

Output — 3 rows

{"order_id": 1001,
 "product_names": ["Google Tee", "Google Mug"]}
{"order_id": 1002,
 "product_names": ["Nest Camera"]}
{"order_id": 1003,
 "product_names": ["Android Sticker", "Google Pen"]}


ARRAY_AGG works like COUNT or SUM with GROUP BY — but instead of computing one number, it collects all values into a list.

Flattening arrays: UNNEST

SELECT order_id, product_name
FROM orders_nested, UNNEST(product_names) AS product_name


Input — 3 rows (nested)

{"order_id": 1001,
 "product_names": ["Google Tee", "Google Mug"]}
{"order_id": 1002,
 "product_names": ["Nest Camera"]}
{"order_id": 1003,
 "product_names": ["Android Sticker", "Google Pen"]}

Output — 5 rows (flat)

order_id product_name
1001 Google Tee
1001 Google Mug
1002 Nest Camera
1003 Android Sticker
1003 Google Pen


3 rows → 5 rows. Each array element becomes its own row.

UNNEST creates a cross join

Each parent row is paired with every element of its array:

Before — 3 rows

{"order_id": 1001, "revenue": 30,
 "products": ["Tee", "Mug"]}
{"order_id": 1002, "revenue": 15,
 "products": ["Camera"]}
{"order_id": 1003, "revenue": 25,
 "products": ["Sticker", "Pen"]}

After UNNEST — 5 rows

order_id revenue product
1001 30 Tee
1001 30 Mug
1002 15 Camera
1003 25 Sticker
1003 25 Pen

SUM(revenue) = 125, but real total is 70. The revenue gets duplicated for each array element.


The grain changes. Before: one row = one order. After: one row = one item. If you aggregate order-level columns after UNNEST, you double-count.

ARRAY_AGG and UNNEST are inverses


Flat rows ──ARRAY_AGG──→ Nested array ──UNNEST──→ Flat rows
(5 rows)                 (3 rows)                 (5 rows)


ARRAY_AGG packs rows into arrays. UNNEST unpacks arrays into rows.

For the group project, you mostly go nested → flat (UNNEST). GA4 data arrives nested; analysis needs it flat.

Useful ARRAY functions

Function What it does Example use
ARRAY_LENGTH(arr) Count elements Items per order — without UNNEST
arr[OFFSET(0)] First element (0-based) First item in an order
arr[ORDINAL(1)] First element (1-based) Same, different indexing
ARRAY_TO_STRING(arr, ', ') Join into a string “Tee, Mug, Pen”
value IN UNNEST(arr) Check membership Does this order contain “Google Tee”?


ARRAY_LENGTH and IN UNNEST let you query arrays without flattening them.

ARRAY<STRUCT>

Arrays where each element has multiple fields

ARRAY<STRING> vs. ARRAY<STRUCT>

ARRAY<STRING> — a list of values

{"product_names":
  ["Google Tee", "Google Mug"]}

One value per element.

ARRAY<STRUCT> — a list of objects

{"items": [
  {"name": "Google Tee",
   "price": 18.0, "category": "Apparel"},
  {"name": "Google Mug",
   "price": 12.0, "category": "Drinkware"}
]}

Multiple fields per element.


GA4’s items array stores name, price, category, quantity per item. That’s ARRAY<STRUCT> — the more common format in production data.

Reading ARRAY<STRUCT>: UNNEST + dot notation

What’s stored (1 row)

{"order_id": 1001,
 "items": [
  {"item_name": "Google Tee",
   "price_in_usd": 18.00,
   "item_category": "Apparel"},
  {"item_name": "Google Mug",
   "price_in_usd": 12.00,
   "item_category": "Drinkware"}
]}

After UNNEST (2 rows)

SELECT order_id,
  item.item_name,
  item.price_in_usd,
  item.item_category
FROM orders_nested, UNNEST(items) AS item
order_id item_name price_in_usd item_category
1001 Google Tee 18.00 Apparel
1001 Google Mug 12.00 Drinkware


UNNEST(items) AS item gives each struct element a name. Then item.field accesses fields with dot notation.

Building ARRAY<STRUCT>: ARRAY_AGG(STRUCT(...))

SELECT
  order_id,
  ARRAY_AGG(STRUCT(
    product_name,
    sale_price,
    category
  )) AS items
FROM order_items
JOIN products USING (product_id)
GROUP BY order_id

Result — one row per order

{"order_id": 1001,
 "items": [
  {"product_name": "Google Tee",
   "sale_price": 18.0,
   "category": "Apparel"},
  {"product_name": "Google Mug",
   "sale_price": 12.0,
   "category": "Drinkware"}
]}


STRUCT() inside ARRAY_AGG() packs multiple columns per row into a single array element. The result is structurally identical to GA4’s items[] array.

Key-Value Arrays

GA4’s event_params

items vs. event_params — different arrays, different access


items — a list of products

Each element is a different product. UNNEST gives you one row per product.

event_params — key-value pairs

Each element is a different property of the same event. You usually want to extract specific keys as columns, not UNNEST the whole thing.


"event_params": [
  {"key": "page_location",
   "value": {"string_value": "https://shop.google.com/"}},
  {"key": "ga_session_id",
   "value": {"int_value": 1234567}},
  {"key": "engagement_time_msec",
   "value": {"int_value": 5200}}
]


Each element has a key and a value struct with four typed slots (only one populated per key).

Extracting a parameter: correlated subquery

SELECT
  event_date,
  event_name,
  (SELECT value.string_value
   FROM UNNEST(event_params)
   WHERE key = 'page_location')     AS page_location,
  (SELECT value.int_value
   FROM UNNEST(event_params)
   WHERE key = 'ga_session_id')     AS ga_session_id
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`
LIMIT 10

Two ways to read event_params


Correlated subquery — extract specific keys

SELECT event_name,
  (SELECT value.string_value
   FROM UNNEST(event_params)
   WHERE key = 'page_location')
     AS page_location,
  (SELECT value.int_value
   FROM UNNEST(event_params)
   WHERE key = 'ga_session_id')
     AS session_id
FROM events

Result: same rows as input, specific keys as columns.

UNNEST in FROM — expand all keys

SELECT event_name,
  ep.key,
  ep.value.string_value,
  ep.value.int_value
FROM events, UNNEST(event_params) AS ep

Result: rows multiplied — one row per key per event.


Approach Grain Use when
Correlated subquery 1 row per event Extracting known keys as columns (90% of project work)
UNNEST in FROM 1 row per key per event Exploring what keys exist, aggregating across keys

How to know which value type to use


Run this once to discover the types:

SELECT
  key,
  COUNT(*) AS occurrences,
  COUNTIF(value.string_value IS NOT NULL) AS has_string,
  COUNTIF(value.int_value IS NOT NULL) AS has_int,
  COUNTIF(value.float_value IS NOT NULL) AS has_float,
  COUNTIF(value.double_value IS NOT NULL) AS has_double
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`,
     UNNEST(event_params)
GROUP BY key
ORDER BY occurrences DESC


This tells you: page_location → string_value, ga_session_id → int_value, engagement_time_msec → int_value, and so on.

The GA4 Dataset

GA4 event row structure

{
  "event_name": "page_view",
  "event_date": "20201101",
  "user_pseudo_id": "abc123",

  "device": {"category": "mobile", "browser": "Safari"},
  "geo": {"country": "United States", "city": "New York"},
  "traffic_source": {"medium": "organic", "source": "google"},

  "items": [{"item_name": "Google Tee", "price_in_usd": 18.0}, ...],

  "event_params": [
    {"key": "page_location", "value": {"string_value": "https://shop.google.com/"}},
    {"key": "ga_session_id", "value": {"int_value": 1234567}},
    ...
  ]
}

Three access patterns: device.category (dot notation), UNNEST(items), correlated subquery for event_params.

First look at the data

SELECT
  event_date,
  event_name,
  device.category AS device_type,           -- STRUCT: dot notation
  geo.country,                              -- STRUCT: dot notation
  (SELECT value.string_value                -- key-value bag:
   FROM UNNEST(event_params)                --   correlated subquery
   WHERE key = 'page_location')
     AS page_url,
  (SELECT value.int_value FROM UNNEST(event_params)
   WHERE key = 'ga_session_id') AS session_id
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`
LIMIT 20

From events to sessions

The group project requires a sessions_base view: one row per session.


Event-level (raw GA4)

user session_id event_name
u1 s1 session_start
u1 s1 page_view
u1 s1 page_view
u1 s1 add_to_cart
u1 s1 purchase
u1 s2 session_start
u1 s2 page_view

Session-level (sessions_base)

user session_id events pageviews purchased
u1 s1 5 2 yes
u1 s2 2 1 no


The transformation: extract session_id from event_params → GROUP BY user + session → aggregate metrics.

You start building this in tomorrow’s practice.

JSON

JSON vs. native nested types


Not all nested data arrives as STRUCTs and ARRAYs. Sometimes a column is just a string that happens to contain valid JSON:

Native STRUCT — BigQuery knows the schema

-- dot notation works
SELECT device.category FROM events

The column has a defined type. BigQuery validates it at write time.

JSON string — BigQuery sees a plain STRING

-- the column is just text
SELECT payload FROM api_logs
-- '{"user": {"name": "Alice"}, "total": 42}'

No schema, no dot notation. You parse it at query time with special functions.


GA4 uses native types. But API logs, webhooks, and many exports store data as JSON strings.

JSON extraction — syntax reference

-- If payload is a JSON or STRING column:
SELECT
  JSON_VALUE(payload, '$.user.name') AS user_name,
  JSON_VALUE(payload, '$.order.total') AS order_total,
  JSON_QUERY_ARRAY(payload, '$.items') AS items_array
FROM api_logs


JSON_VALUE extracts a scalar. JSON_QUERY extracts an object or array. JSON_QUERY_ARRAY extracts an array you can UNNEST.

The $ is the root. Dot notation navigates the path. Bracket notation ($.items[0]) for array positions.

Wrap-up

Reference card


Data shape Read Build
STRUCT column.field STRUCT(col1, col2)
ARRAY UNNEST(column) in FROM ARRAY_AGG(value)
ARRAY<STRUCT> UNNEST(column) AS x, then x.field ARRAY_AGG(STRUCT(col1, col2))
Key-value array (SELECT value FROM UNNEST(col) WHERE key = '...')
JSON JSON_VALUE(col, '$.path')


The Build column is what Assignment 5 covers. The Read column is what the group project requires.

Next week and announcements


Tomorrow’s practices:

  • Practice 1: Build nested structures from thelook (STRUCT, ARRAY_AGG, UNNEST round-trip)
  • Practice 2: Explore the GA4 dataset, start building sessions_base


Coming up:

  • Group project spec is published — read it before tomorrow
  • Assignment 5 (nested structures on thelook) due this week
  • Week 7: checkpoint — working sessions_base + one analysis view