Week 6 Practice: Nested Data Structures

ECON250: Big Data Analytics

Published

March 12, 2026

Modified

March 21, 2026

Your Submission

Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion

Save your work as views in your personal dataset (student__yourname). Each section tells you what to save and what to name it.

How to Submit

In BigQuery Explorer, find your w6_ views in your student__yourname dataset
For each view: click the three dots menu (⋮) → “Copy link”
Paste all view links in Moodle

Practice 1: Build Nested Structures from Flat Data

Why Pack Data Into Nested Structures?

thelook_ecommerce stores orders across four normalized tables: orders, order_items, products, users. To get the full picture of one order (who bought it, what products, at what price) you need three JOINs. Every query that needs this information repeats the same three JOINs.

Now imagine a system that logs millions of events per day, like Google Analytics. Every analyst needs the same context: which user, which device, which products, which session. If that context lives in separate tables, every analyst writes the same JOINs, every query scans the same tables, and the bill grows.

The alternative: pack related data into a single row. Customer info becomes a STRUCT (one column with named fields). Line items become an ARRAY of STRUCTs (a list of objects inside one cell). One row = one complete order, no JOINs needed.

GA4 stores event data this way. In Practice 2, you’ll work with that real nested dataset. First, you’ll build the nested structure yourself from flat tables, so you understand what’s inside the row before you have to take it apart.

There’s also a BigQuery-specific reason. BigQuery is columnar and reads only the columns you reference, even inside nested structures. SELECT customer.country from a STRUCT doesn’t scan customer.email. Nesting related fields together doesn’t cost extra when you only need some of them.

Quick Reference

Operation	Syntax	Effect on rows
Bundle columns into one	`STRUCT(col1, col2) AS name`	Same
Collect rows into a list	`ARRAY_AGG(expr) ... GROUP BY`	Fewer
Collect rows into list of objects	`ARRAY_AGG(STRUCT(col1, col2)) ... GROUP BY`	Fewer
Expand a list back to rows	`FROM table, UNNEST(array_col) AS alias`	More
Count without expanding	`ARRAY_LENGTH(array_col)`	Same
Check membership without expanding	`value IN UNNEST(array_col)`	Same

Part 1: STRUCTs

Query bigquery-public-data.thelook_ecommerce.users. Select a few columns normally, then pack first_name, last_name, email, country into a single customer STRUCT column:

SELECT
  id AS user_id,
  STRUCT(first_name, last_name, email, country) AS customer
FROM `bigquery-public-data.thelook_ecommerce.users`
LIMIT 10

Look at the result — the customer column contains an object with named fields.

Now access the fields back out with dot notation: SELECT user_id, customer.country .... Try using a STRUCT field in a WHERE clause. Try GROUP BY country using the struct field.

Part 2: ARRAY_AGG

Query order_items joined with products for a small set of orders (pick 3-5 order IDs with WHERE order_id IN (...)).

Step 1: Build an array of product names per order:

SELECT
  oi.order_id,
  ARRAY_AGG(p.name) AS product_names
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
WHERE oi.order_id IN (1, 2, 3, 4, 5)
GROUP BY oi.order_id

How many rows did you start with? How many do you have now? Each array cell holds all the product names for that order.

Step 2: Replace ARRAY_AGG(p.name) with ARRAY_AGG(STRUCT(p.name AS product_name, oi.sale_price, p.category)) — now each array element has multiple fields. This is ARRAY<STRUCT>, the same shape as GA4’s items[] array.

Part 3: Round-Trip

Use your Part 2 result as a CTE. UNNEST it back to flat rows:

WITH nested AS (
  -- your Part 2 query here
)
SELECT order_id, item.product_name, item.sale_price, item.category
FROM nested, UNNEST(items) AS item

Verify you recover the original row count.

Then compute top 3 categories by total revenue (SUM(item.sale_price)) from the UNNESTed data. Run the same calculation directly on the flat order_items + products tables for the same order IDs. Do the numbers match?

Part 4: ARRAY Functions (optional)

Using your nested CTE from Part 2, try these without UNNEST:

ARRAY_LENGTH(items) — count items per order
'some product' IN UNNEST(product_names) — check if an order contains a specific product
ARRAY_AGG(p.name ORDER BY oi.sale_price DESC) — what does ordering inside the array give you?

Submission: `w6_orders_nested`

Build a view that produces one row per order with:

order_id, created_at, status — from orders
customer — a STRUCT with first_name, last_name, email, country from users
items — an ARRAY<STRUCT> with product_name, category, brand, sale_price from order_items + products

This requires joining three tables. One approach: build the items array in a CTE first (GROUP BY order_id), then join that to orders + users for the scalar and STRUCT columns. Limit to orders from 2024 onward to keep it fast.

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_orders_nested` AS
-- your query here

Verify: pick one order_id from your view, UNNEST the items, and confirm the products match what you’d get from a direct query on order_items.

Practice 2: Flatten GA4 Data

From Nested Back to Flat

In Practice 1 you built a nested order row — customer as a STRUCT, line items as an ARRAY<STRUCT>. GA4 stores its event data the same way: one row per event with device info, product lists, and session parameters all packed in. No JOINs needed to read a single event.

But nested data is awkward for analytical queries. You can’t GROUP BY session or compute revenue per product category when everything is packed inside arrays. So you flatten the nested data into tables at the grain you need (sessions, purchases, items) where GROUP BY, window functions, and JOINs work normally.

The three views you build today are this flattening step. They’re also the starting point for the group project.

The Data

bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117

One day of Google Analytics 4 event data from the Google Merchandise Store. ~60K events, ~49 MB per scan.

Each row is a single event (a page view, a click, a purchase):

event row
  ├── event_name, event_date, event_timestamp     ← scalars (read directly)
  ├── user_pseudo_id                               ← scalar
  ├── device (STRUCT)                              ← dot notation: device.category
  ├── geo (STRUCT)                                 ← dot notation: geo.country
  ├── ecommerce (STRUCT)                           ← dot notation: ecommerce.purchase_revenue_in_usd
  ├── traffic_source (STRUCT)                      ← dot notation (but: first-touch only!)
  ├── event_params (ARRAY<STRUCT>)                 ← key-value bag → correlated subquery
  │     {key: "page_location", value: {string_value: "https://..."}}
  │     {key: "ga_session_id", value: {int_value: 1234567}}
  │     {key: "engagement_time_msec", value: {int_value: 5200}}
  └── items (ARRAY<STRUCT>)                        ← product list → UNNEST in FROM
        {item_name: "Google Tee", price_in_usd: 18.0, item_category: "Apparel"}

Three access patterns:

Data shape	How to access	Example
STRUCT	Dot notation — no UNNEST	`device.category`, `geo.country`
ARRAY<STRUCT> (product list)	`UNNEST(items) AS item` in FROM, then `item.item_name`	Expand items to rows
ARRAY<STRUCT> (key-value bag)	Correlated subquery — extract one key as a column	Get `page_location` without changing row count

Most of the useful data is inside nested structures:

Column	Type	How to access
`event_name`	STRING	Directly — `page_view`, `session_start`, `purchase`, etc.
`event_date`	STRING	Needs `PARSE_DATE('%Y%m%d', event_date)` — it’s stored as `'20201117'`
`event_timestamp`	INT64	Microseconds since epoch — use `TIMESTAMP_MICROS(event_timestamp)`
`user_pseudo_id`	STRING	Anonymous user ID
`device`	STRUCT	Dot notation: `device.category`, `device.browser`, `device.operating_system`
`geo`	STRUCT	Dot notation: `geo.country`, `geo.city`
`traffic_source`	STRUCT	Dot notation: `traffic_source.source`, `traffic_source.medium` — but see note below
`ecommerce`	STRUCT	Dot notation: `ecommerce.purchase_revenue_in_usd`, `ecommerce.total_item_quantity`
`event_params`	ARRAY<STRUCT>	Key-value bag — use correlated subquery (see below)
`items`	ARRAY<STRUCT>	Product list — use `UNNEST(items) AS item`, then `item.item_name`, etc.

Syntax you’ll need

Pattern	Syntax
Extract a key from event_params	`(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location')`
Extract an int key from event_params	`(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')`
Flatten items array	`FROM events, UNNEST(items) AS item` then `item.item_name`, `item.price_in_usd`
Read a STRUCT field	`device.category`, `geo.country` — no UNNEST needed
Convert event_timestamp	`TIMESTAMP_MICROS(event_timestamp)`
Parse event_date	`PARSE_DATE('%Y%m%d', event_date)`
First value in a group by timestamp	`ARRAY_AGG(col IGNORE NULLS ORDER BY event_timestamp LIMIT 1)[SAFE_OFFSET(0)]`
Constant within a session	`ANY_VALUE(device.category)` — picks any value (all the same)

Important: traffic_source.source is the user’s first-touch attribution — the same for every session by that user. The session-level source/medium lives in event_params (keys source and medium). For the group project, you’ll usually want the session-level version.

Orientation

Before building anything, explore:

Event types: SELECT event_name, COUNT(*) AS cnt ... GROUP BY 1 ORDER BY 2 DESC. What types exist? Which are most frequent? How rare is purchase?
event_params keys: Run the discovery query from lecture to find all keys and which value type each uses:

SELECT
  key,
  COUNT(*) AS occurrences,
  COUNTIF(value.string_value IS NOT NULL) AS has_string,
  COUNTIF(value.int_value IS NOT NULL) AS has_int,
  COUNTIF(value.float_value IS NOT NULL) AS has_float
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117`,
     UNNEST(event_params)
GROUP BY key
ORDER BY occurrences DESC

Keep this output open — you’ll need it for the rest of practice. It tells you whether to use value.string_value or value.int_value for each key.

Items array: How many events have a non-empty items array? (WHERE ARRAY_LENGTH(items) > 0). Which event types have items?

Warm-Up: Extract Nested Fields (no GROUP BY yet)

Before building any views, practice the two extraction patterns on a small sample. Run this query and look at the output:

SELECT
  event_name,
  user_pseudo_id,
  TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
  -- STRUCT fields: just dot notation
  device.category AS device_category,
  geo.country,
  -- Key-value bag: correlated subquery extracts one key as a column
  (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117`
LIMIT 20

Notice: same number of rows in, same number of rows out. The correlated subquery runs once per row, finds the matching key, and returns one value. It does not multiply rows like UNNEST in the FROM clause would.

Try adding another extraction — pull source from event_params (it’s a string_value). Compare it to traffic_source.source for the same rows. Are they the same?

Submission View 1: `w6_purchases`

Grain: one row per purchase event. No GROUP BY needed — just filter and extract.

This is the simplest view and uses the same correlated subquery pattern you just practiced.

Column	Source
`user_pseudo_id`	top-level field
`session_id`	`ga_session_id` from `event_params` (int_value)
`event_timestamp`	`TIMESTAMP_MICROS(event_timestamp)`
`purchase_revenue`	`ecommerce.purchase_revenue_in_usd`
`total_item_quantity`	`ecommerce.total_item_quantity`
`page_location`	from `event_params` (string_value)

Filter to WHERE event_name = 'purchase'. Each row is one purchase event with its nested fields extracted as flat columns.

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_purchases` AS
-- your query here

Submission View 2: `w6_items`

Grain: one row per item per purchase. This is where you use UNNEST on the items array.

Column	Source
`user_pseudo_id`	top-level field
`session_id`	`ga_session_id` from `event_params` (int_value)
`event_timestamp`	`TIMESTAMP_MICROS(event_timestamp)`
`item_name`	`item.item_name` (after UNNEST)
`item_category`	`item.item_category` (after UNNEST)
`price_in_usd`	`item.price_in_usd` (after UNNEST)
`quantity`	`item.quantity` (after UNNEST)
`purchase_revenue`	`ecommerce.purchase_revenue_in_usd`

Start from your w6_purchases logic, then add UNNEST(items) AS item in the FROM clause to get individual products. Notice that purchase_revenue is an event-level field that gets duplicated onto every item row. If you SUM(purchase_revenue) after UNNEST, you double-count — a $50 order with 3 items would contribute $150 to your sum.

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_items` AS
-- your query here

Submission View 3: `w6_sessions`

Grain: one row per session. This is the hardest of the three views — it combines correlated subqueries (which you’ve now used twice) with GROUP BY aggregation.

Column	Source
`user_pseudo_id`	top-level field
`session_id`	`ga_session_id` from `event_params` (int_value)
`session_date`	`event_date` parsed to DATE
`device_category`	`device.category`
`country`	`geo.country`
`landing_page`	first `page_location` in the session by event_timestamp
`total_events`	count of events in the session
`total_pageviews`	count of events where `event_name = 'page_view'`
`has_purchase`	TRUE if any event in the session is `'purchase'`

Approach: session_id lives inside event_params, and you need to both GROUP BY it and aggregate other fields. Use a two-step CTE — extract nested fields first, then group:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_sessions` AS
WITH events_flat AS (
  -- Step 1: extract nested fields so we can GROUP BY them
  SELECT
    *,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location
  FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117`
)
-- Step 2: one row per session — fill in the aggregations
SELECT
  user_pseudo_id,
  session_id,
  -- session_date: parse event_date to DATE
  -- device_category: constant within session, use ANY_VALUE()
  -- country: constant within session, use ANY_VALUE()
  -- landing_page: first page_location by event_timestamp (not alphabetical!)
  -- total_events: COUNT(*)
  -- total_pageviews: count only 'page_view' events
  -- has_purchase: TRUE if any event is 'purchase'
FROM events_flat
WHERE session_id IS NOT NULL
GROUP BY user_pseudo_id, session_id

Gotchas:

landing_page should be the chronologically first page in the session, not the alphabetically first. MIN(page_location) gives the wrong answer — you need an aggregation that respects timestamp order. Check the syntax reference table above for the right pattern.
Device and geo fields are constant within a session — ANY_VALUE() works for these.
Some events have NULL session_id — the CTE’s WHERE clause filters these out.

Verify Your Work

Once all three views exist, cross-check them:

How many sessions in w6_sessions? How many had a purchase?
Does the count of purchase sessions in w6_sessions match the row count in w6_purchases? (They may differ slightly — one session can have multiple purchase events.)
Pick one purchase from w6_purchases. Can you find its items in w6_items by matching on session_id and event_timestamp?
What’s the most purchased product by quantity?
Try SUM(purchase_revenue) on w6_items vs. w6_purchases — do they match? Why or why not?

Going Further

Add source and medium to w6_sessions (from event_params, not traffic_source). Why does this distinction matter?
Add engagement_time_sec to w6_sessions (sum of engagement_time_msec from event_params, converted to seconds).
What percentage of sessions are single-event? What does the distribution of events per session look like?
Build w6_sessions across the full date range (events_* with WHERE event_date BETWEEN '20201101' AND '20210131'). How does performance change?

Troubleshooting

Problem	Solution
`event_date` is not a DATE	It’s a STRING like `'20201117'` — use `PARSE_DATE('%Y%m%d', event_date)`
Correlated subquery returns error	Make sure you’re UNNESTing the right column — `FROM UNNEST(event_params)`, not `FROM UNNEST(items)`
Can’t GROUP BY a correlated subquery	Extract it in a CTE first, then GROUP BY the CTE column
Row count explodes after UNNEST	Expected — you changed the grain. Each array element becomes its own row
`purchase_revenue` is NULL	Only purchase events have ecommerce data — other event types have NULLs
`landing_page` looks wrong	`MIN(page_location)` gives alphabetical first, not chronological. Use `ARRAY_AGG(... ORDER BY event_timestamp LIMIT 1)`
`ARRAY_AGG` returns NULL	Some values are NULL — try `IGNORE NULLS`
Session counts don’t match between views	One session can have multiple purchase events — `w6_purchases` counts events, `w6_sessions` counts sessions

Summary

STRUCTs are accessed with dot notation, no UNNEST needed — device.category just works
ARRAY_AGG packs rows into arrays; UNNEST unpacks them back. These are inverse operations.
UNNEST changes the grain — always know what one row represents after each transformation
Correlated subqueries extract specific keys from key-value arrays without changing the grain
The three views (sessions, purchases, items) each represent a different grain of GA4 data — this is the starting point for the group project