What’s Inside the Row

ECON250 - Big Data Analytics | Week 6

Oleh Omelchenko

2026-03-11

How many JOINs for one order?

thelook schema

orders — order header
order_items — one row per item
products — product details
users — customer info

SELECT o.order_id, u.first_name,
  oi.sale_price, p.name AS product
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.id
JOIN users u ON o.user_id = u.id

Three JOINs. Every query that needs the full picture repeats them.

What if it were one row?

order_row
  ├── order_id, created_at, status
  ├── customer (STRUCT)
  ├── items (ARRAY<STRUCT>)
  └── totals (STRUCT)

{
  "order_id": 1001,
  "created_at": "2024-11-15",
  "status": "shipped",
  "customer": {"name": "Alice", "country": "UA"},
  "items": [
    {"product": "Tee", "price": 18},
    {"product": "Mug", "price": 12}
  ],
  "totals": {"num_items": 2, "revenue": 30}
}

No JOINs. Everything about the order is self-contained. If you know Python or JavaScript — a STRUCT is an object/dict, an ARRAY is a list.

5 minutes

Walk through the structure. The JSON on the right is how you can think about what’s actually stored in the row. Customer info is a STRUCT — an object with named fields. Items are an ARRAY of STRUCTs — a list of objects. If students know Python: STRUCT = dict, ARRAY = list. This is not BigQuery’s JSON type — it’s native columnar storage, but the mental model is the same.

Ask: “Why would a system store data this way?”

Trade-offs:

Pro: No JOINs — faster reads, simpler queries
Pro: Scan efficiency — BigQuery only reads the columns you reference, even inside nested structures
Con: If a customer changes their name, every order row needs updating
Con: Harder to answer questions the schema wasn’t designed for

Analytical systems (BigQuery, GA4, Firebase) favor nested structures because they optimize for reads. Transactional systems (PostgreSQL, MySQL) favor normalized tables because they optimize for writes. GA4 data is nested for this reason.

STRUCTs

Named fields inside a column

Reading STRUCTs: dot notation

Table with a device STRUCT column

user_id	device.category	device.browser
u1	desktop	Chrome
u2	mobile	Safari
u3	desktop	Firefox

Each row’s device column holds an object:

{"category": "desktop", "browser": "Chrome"}

Dot notation reads the fields — just like device["category"] in Python:

SELECT user_id,
  device.category, device.browser
FROM events

Building STRUCTs from flat columns

Flat: 4 separate columns

SELECT user_id, city, country, region
FROM users

{"user_id": "u1", "city": "Kyiv",
 "country": "UA", "region": "Kyivska"}

Packed: 1 STRUCT column

SELECT user_id,
  STRUCT(city, country, region) AS location
FROM users

{"user_id": "u1",
 "location": {"city": "Kyiv",
               "country": "UA",
               "region": "Kyivska"}}

Same data, different shape. Same number of rows — STRUCTs don’t change the grain. The flat row has 4 keys; the packed row has 2 keys, one of which is a nested object.

ARRAYs

Multiple values in one cell

Flat vs. nested: the same data, two shapes

Flat: one row per item (5 rows)

order_id	product_name
1001	Google Tee
1001	Google Mug
1002	Nest Camera
1003	Android Sticker
1003	Google Pen

Nested: one row per order (3 rows)

order_id	product_names
1001	[Google Tee, Google Mug]
1002	[Nest Camera]
1003	[Android Sticker, Google Pen]

{"order_id": 1001,
 "product_names": ["Google Tee", "Google Mug"]}

5 rows → 3 rows. No information lost — the items are packed into arrays.

The left shape is what thelook gives you. The right shape is what GA4 gives you.

Building arrays: `ARRAY_AGG`

SELECT
  order_id,
  ARRAY_AGG(product_name) AS product_names
FROM order_items
GROUP BY order_id

Input — 5 rows

order_id	product_name
1001	Google Tee
1001	Google Mug
1002	Nest Camera
1003	Android Sticker
1003	Google Pen

Output — 3 rows

{"order_id": 1001,
 "product_names": ["Google Tee", "Google Mug"]}
{"order_id": 1002,
 "product_names": ["Nest Camera"]}
{"order_id": 1003,
 "product_names": ["Android Sticker", "Google Pen"]}

ARRAY_AGG works like COUNT or SUM with GROUP BY — but instead of computing one number, it collects all values into a list.

Flattening arrays: `UNNEST`

SELECT order_id, product_name
FROM orders_nested, UNNEST(product_names) AS product_name

Input — 3 rows (nested)

{"order_id": 1001,
 "product_names": ["Google Tee", "Google Mug"]}
{"order_id": 1002,
 "product_names": ["Nest Camera"]}
{"order_id": 1003,
 "product_names": ["Android Sticker", "Google Pen"]}

Output — 5 rows (flat)

order_id	product_name
1001	Google Tee
1001	Google Mug
1002	Nest Camera
1003	Android Sticker
1003	Google Pen

3 rows → 5 rows. Each array element becomes its own row.

UNNEST creates a cross join

Each parent row is paired with every element of its array:

Before — 3 rows

{"order_id": 1001, "revenue": 30,
 "products": ["Tee", "Mug"]}
{"order_id": 1002, "revenue": 15,
 "products": ["Camera"]}
{"order_id": 1003, "revenue": 25,
 "products": ["Sticker", "Pen"]}

After UNNEST — 5 rows

order_id	revenue	product
1001	30	Tee
1001	30	Mug
1002	15	Camera
1003	25	Sticker
1003	25	Pen

SUM(revenue) = 125, but real total is 70. The revenue gets duplicated for each array element.

The grain changes. Before: one row = one order. After: one row = one item. If you aggregate order-level columns after UNNEST, you double-count.

ARRAY_AGG and UNNEST are inverses

Flat rows ──ARRAY_AGG──→ Nested array ──UNNEST──→ Flat rows
(5 rows)                 (3 rows)                 (5 rows)

ARRAY_AGG packs rows into arrays. UNNEST unpacks arrays into rows.

For the group project, you mostly go nested → flat (UNNEST). GA4 data arrives nested; analysis needs it flat.

Useful ARRAY functions

Function	What it does	Example use
`ARRAY_LENGTH(arr)`	Count elements	Items per order — without UNNEST
`arr[OFFSET(0)]`	First element (0-based)	First item in an order
`arr[ORDINAL(1)]`	First element (1-based)	Same, different indexing
`ARRAY_TO_STRING(arr, ', ')`	Join into a string	“Tee, Mug, Pen”
`value IN UNNEST(arr)`	Check membership	Does this order contain “Google Tee”?

ARRAY_LENGTH and IN UNNEST let you query arrays without flattening them.

2 minutes

Don’t drill into each. The two most used: ARRAY_LENGTH for counting, IN UNNEST for filtering.

-- Orders with 3+ items (no UNNEST needed)
WHERE ARRAY_LENGTH(items) >= 3

-- Orders containing a specific product (no UNNEST needed)
WHERE 'Google Tee' IN UNNEST(product_names)

ARRAY<STRUCT>

Arrays where each element has multiple fields

ARRAY<STRING> vs. ARRAY<STRUCT>

ARRAY<STRING> — a list of values

{"product_names":
  ["Google Tee", "Google Mug"]}

One value per element.

ARRAY<STRUCT> — a list of objects

{"items": [
  {"name": "Google Tee",
   "price": 18.0, "category": "Apparel"},
  {"name": "Google Mug",
   "price": 12.0, "category": "Drinkware"}
]}

Multiple fields per element.

GA4’s items array stores name, price, category, quantity per item. That’s ARRAY<STRUCT> — the more common format in production data.

Reading ARRAY<STRUCT>: UNNEST + dot notation

What’s stored (1 row)

{"order_id": 1001,
 "items": [
  {"item_name": "Google Tee",
   "price_in_usd": 18.00,
   "item_category": "Apparel"},
  {"item_name": "Google Mug",
   "price_in_usd": 12.00,
   "item_category": "Drinkware"}
]}

After UNNEST (2 rows)

SELECT order_id,
  item.item_name,
  item.price_in_usd,
  item.item_category
FROM orders_nested, UNNEST(items) AS item

order_id	item_name	price_in_usd	item_category
1001	Google Tee	18.00	Apparel
1001	Google Mug	12.00	Drinkware

UNNEST(items) AS item gives each struct element a name. Then item.field accesses fields with dot notation.

Building ARRAY<STRUCT>: `ARRAY_AGG(STRUCT(...))`

SELECT
  order_id,
  ARRAY_AGG(STRUCT(
    product_name,
    sale_price,
    category
  )) AS items
FROM order_items
JOIN products USING (product_id)
GROUP BY order_id

Result — one row per order

{"order_id": 1001,
 "items": [
  {"product_name": "Google Tee",
   "sale_price": 18.0,
   "category": "Apparel"},
  {"product_name": "Google Mug",
   "sale_price": 12.0,
   "category": "Drinkware"}
]}

STRUCT() inside ARRAY_AGG() packs multiple columns per row into a single array element. The result is structurally identical to GA4’s items[] array.

Key-Value Arrays

GA4’s event_params

`items` vs. `event_params` — different arrays, different access

items — a list of products

Each element is a different product. UNNEST gives you one row per product.

event_params — key-value pairs

Each element is a different property of the same event. You usually want to extract specific keys as columns, not UNNEST the whole thing.

"event_params": [
  {"key": "page_location",
   "value": {"string_value": "https://shop.google.com/"}},
  {"key": "ga_session_id",
   "value": {"int_value": 1234567}},
  {"key": "engagement_time_msec",
   "value": {"int_value": 5200}}
]

Each element has a key and a value struct with four typed slots (only one populated per key).

Extracting a parameter: correlated subquery

SELECT
  event_date,
  event_name,
  (SELECT value.string_value
   FROM UNNEST(event_params)
   WHERE key = 'page_location')     AS page_location,
  (SELECT value.int_value
   FROM UNNEST(event_params)
   WHERE key = 'ga_session_id')     AS ga_session_id
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`
LIMIT 10

Two ways to read event_params

Correlated subquery — extract specific keys

SELECT event_name,
  (SELECT value.string_value
   FROM UNNEST(event_params)
   WHERE key = 'page_location')
     AS page_location,
  (SELECT value.int_value
   FROM UNNEST(event_params)
   WHERE key = 'ga_session_id')
     AS session_id
FROM events

Result: same rows as input, specific keys as columns.

UNNEST in FROM — expand all keys

SELECT event_name,
  ep.key,
  ep.value.string_value,
  ep.value.int_value
FROM events, UNNEST(event_params) AS ep

Result: rows multiplied — one row per key per event.

Approach	Grain	Use when
Correlated subquery	1 row per event	Extracting known keys as columns (90% of project work)
UNNEST in FROM	1 row per key per event	Exploring what keys exist, aggregating across keys

How to know which value type to use

Run this once to discover the types:

SELECT
  key,
  COUNT(*) AS occurrences,
  COUNTIF(value.string_value IS NOT NULL) AS has_string,
  COUNTIF(value.int_value IS NOT NULL) AS has_int,
  COUNTIF(value.float_value IS NOT NULL) AS has_float,
  COUNTIF(value.double_value IS NOT NULL) AS has_double
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`,
     UNNEST(event_params)
GROUP BY key
ORDER BY occurrences DESC

This tells you: page_location → string_value, ga_session_id → int_value, engagement_time_msec → int_value, and so on.

The GA4 Dataset

GA4 event row structure

{
  "event_name": "page_view",
  "event_date": "20201101",
  "user_pseudo_id": "abc123",

  "device": {"category": "mobile", "browser": "Safari"},
  "geo": {"country": "United States", "city": "New York"},
  "traffic_source": {"medium": "organic", "source": "google"},

  "items": [{"item_name": "Google Tee", "price_in_usd": 18.0}, ...],

  "event_params": [
    {"key": "page_location", "value": {"string_value": "https://shop.google.com/"}},
    {"key": "ga_session_id", "value": {"int_value": 1234567}},
    ...
  ]
}

Three access patterns: device.category (dot notation), UNNEST(items), correlated subquery for event_params.

First look at the data

SELECT
  event_date,
  event_name,
  device.category AS device_type,           -- STRUCT: dot notation
  geo.country,                              -- STRUCT: dot notation
  (SELECT value.string_value                -- key-value bag:
   FROM UNNEST(event_params)                --   correlated subquery
   WHERE key = 'page_location')
     AS page_url,
  (SELECT value.int_value FROM UNNEST(event_params)
   WHERE key = 'ga_session_id') AS session_id
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201101`
LIMIT 20

From events to sessions

The group project requires a sessions_base view: one row per session.

Event-level (raw GA4)

user	session_id	event_name
u1	s1	session_start
u1	s1	page_view
u1	s1	page_view
u1	s1	add_to_cart
u1	s1	purchase
u1	s2	session_start
u1	s2	page_view

Session-level (sessions_base)

user	session_id	events	pageviews	purchased
u1	s1	5	2	yes
u1	s2	2	1	no

The transformation: extract session_id from event_params → GROUP BY user + session → aggregate metrics.

You start building this in tomorrow’s practice.

JSON

JSON vs. native nested types

Not all nested data arrives as STRUCTs and ARRAYs. Sometimes a column is just a string that happens to contain valid JSON:

Native STRUCT — BigQuery knows the schema

-- dot notation works
SELECT device.category FROM events

The column has a defined type. BigQuery validates it at write time.

JSON string — BigQuery sees a plain STRING

-- the column is just text
SELECT payload FROM api_logs
-- '{"user": {"name": "Alice"}, "total": 42}'

No schema, no dot notation. You parse it at query time with special functions.

GA4 uses native types. But API logs, webhooks, and many exports store data as JSON strings.

JSON extraction — syntax reference

-- If payload is a JSON or STRING column:
SELECT
  JSON_VALUE(payload, '$.user.name') AS user_name,
  JSON_VALUE(payload, '$.order.total') AS order_total,
  JSON_QUERY_ARRAY(payload, '$.items') AS items_array
FROM api_logs

JSON_VALUE extracts a scalar. JSON_QUERY extracts an object or array. JSON_QUERY_ARRAY extracts an array you can UNNEST.

The $ is the root. Dot notation navigates the path. Bracket notation ($.items[0]) for array positions.

Wrap-up

Reference card

Data shape	Read	Build
STRUCT	`column.field`	`STRUCT(col1, col2)`
ARRAY	`UNNEST(column)` in FROM	`ARRAY_AGG(value)`
ARRAY<STRUCT>	`UNNEST(column) AS x`, then `x.field`	`ARRAY_AGG(STRUCT(col1, col2))`
Key-value array	`(SELECT value FROM UNNEST(col) WHERE key = '...')`	—
JSON	`JSON_VALUE(col, '$.path')`	—

The Build column is what Assignment 5 covers. The Read column is what the group project requires.

Next week and announcements

Tomorrow’s practices:

Practice 1: Build nested structures from thelook (STRUCT, ARRAY_AGG, UNNEST round-trip)
Practice 2: Explore the GA4 dataset, start building sessions_base

Coming up:

Group project spec is published — read it before tomorrow
Assignment 5 (nested structures on thelook) due this week
Week 7: checkpoint — working sessions_base + one analysis view