Week 6 Practice: Nested Data Structures

ECON250: Big Data Analytics

Published

March 12, 2026

Modified

March 21, 2026

Your Submission

Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion

Save your work as views in your personal dataset (student__yourname). Each section tells you what to save and what to name it.

How to Submit

  1. In BigQuery Explorer, find your w6_ views in your student__yourname dataset
  2. For each view: click the three dots menu (⋮) → “Copy link”
  3. Paste all view links in Moodle

Practice 1: Build Nested Structures from Flat Data

Why Pack Data Into Nested Structures?

thelook_ecommerce stores orders across four normalized tables: orders, order_items, products, users. To get the full picture of one order (who bought it, what products, at what price) you need three JOINs. Every query that needs this information repeats the same three JOINs.

Now imagine a system that logs millions of events per day, like Google Analytics. Every analyst needs the same context: which user, which device, which products, which session. If that context lives in separate tables, every analyst writes the same JOINs, every query scans the same tables, and the bill grows.

The alternative: pack related data into a single row. Customer info becomes a STRUCT (one column with named fields). Line items become an ARRAY of STRUCTs (a list of objects inside one cell). One row = one complete order, no JOINs needed.

GA4 stores event data this way. In Practice 2, you’ll work with that real nested dataset. First, you’ll build the nested structure yourself from flat tables, so you understand what’s inside the row before you have to take it apart.

There’s also a BigQuery-specific reason. BigQuery is columnar and reads only the columns you reference, even inside nested structures. SELECT customer.country from a STRUCT doesn’t scan customer.email. Nesting related fields together doesn’t cost extra when you only need some of them.

Quick Reference

Operation Syntax Effect on rows
Bundle columns into one STRUCT(col1, col2) AS name Same
Collect rows into a list ARRAY_AGG(expr) ... GROUP BY Fewer
Collect rows into list of objects ARRAY_AGG(STRUCT(col1, col2)) ... GROUP BY Fewer
Expand a list back to rows FROM table, UNNEST(array_col) AS alias More
Count without expanding ARRAY_LENGTH(array_col) Same
Check membership without expanding value IN UNNEST(array_col) Same

Part 1: STRUCTs

Query bigquery-public-data.thelook_ecommerce.users. Select a few columns normally, then pack first_name, last_name, email, country into a single customer STRUCT column:

SELECT
  id AS user_id,
  STRUCT(first_name, last_name, email, country) AS customer
FROM `bigquery-public-data.thelook_ecommerce.users`
LIMIT 10

Look at the result — the customer column contains an object with named fields.

Now access the fields back out with dot notation: SELECT user_id, customer.country .... Try using a STRUCT field in a WHERE clause. Try GROUP BY country using the struct field.

Part 2: ARRAY_AGG

Query order_items joined with products for a small set of orders (pick 3-5 order IDs with WHERE order_id IN (...)).

Step 1: Build an array of product names per order:

SELECT
  oi.order_id,
  ARRAY_AGG(p.name) AS product_names
FROM `bigquery-public-data.thelook_ecommerce.order_items` oi
JOIN `bigquery-public-data.thelook_ecommerce.products` p ON oi.product_id = p.id
WHERE oi.order_id IN (1, 2, 3, 4, 5)
GROUP BY oi.order_id

How many rows did you start with? How many do you have now? Each array cell holds all the product names for that order.

Step 2: Replace ARRAY_AGG(p.name) with ARRAY_AGG(STRUCT(p.name AS product_name, oi.sale_price, p.category)) — now each array element has multiple fields. This is ARRAY<STRUCT>, the same shape as GA4’s items[] array.

Part 3: Round-Trip

Use your Part 2 result as a CTE. UNNEST it back to flat rows:

WITH nested AS (
  -- your Part 2 query here
)
SELECT order_id, item.product_name, item.sale_price, item.category
FROM nested, UNNEST(items) AS item

Verify you recover the original row count.

Then compute top 3 categories by total revenue (SUM(item.sale_price)) from the UNNESTed data. Run the same calculation directly on the flat order_items + products tables for the same order IDs. Do the numbers match?

Part 4: ARRAY Functions (optional)

Using your nested CTE from Part 2, try these without UNNEST:

  • ARRAY_LENGTH(items) — count items per order
  • 'some product' IN UNNEST(product_names) — check if an order contains a specific product
  • ARRAY_AGG(p.name ORDER BY oi.sale_price DESC) — what does ordering inside the array give you?

Submission: w6_orders_nested

Build a view that produces one row per order with:

  • order_id, created_at, status — from orders
  • customer — a STRUCT with first_name, last_name, email, country from users
  • items — an ARRAY<STRUCT> with product_name, category, brand, sale_price from order_items + products

This requires joining three tables. One approach: build the items array in a CTE first (GROUP BY order_id), then join that to orders + users for the scalar and STRUCT columns. Limit to orders from 2024 onward to keep it fast.

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_orders_nested` AS
-- your query here

Verify: pick one order_id from your view, UNNEST the items, and confirm the products match what you’d get from a direct query on order_items.


Practice 2: Flatten GA4 Data

From Nested Back to Flat

In Practice 1 you built a nested order row — customer as a STRUCT, line items as an ARRAY<STRUCT>. GA4 stores its event data the same way: one row per event with device info, product lists, and session parameters all packed in. No JOINs needed to read a single event.

But nested data is awkward for analytical queries. You can’t GROUP BY session or compute revenue per product category when everything is packed inside arrays. So you flatten the nested data into tables at the grain you need (sessions, purchases, items) where GROUP BY, window functions, and JOINs work normally.

The three views you build today are this flattening step. They’re also the starting point for the group project.

The Data

bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117

One day of Google Analytics 4 event data from the Google Merchandise Store. ~60K events, ~49 MB per scan.

Each row is a single event (a page view, a click, a purchase):

event row
  ├── event_name, event_date, event_timestamp     ← scalars (read directly)
  ├── user_pseudo_id                               ← scalar
  ├── device (STRUCT)                              ← dot notation: device.category
  ├── geo (STRUCT)                                 ← dot notation: geo.country
  ├── ecommerce (STRUCT)                           ← dot notation: ecommerce.purchase_revenue_in_usd
  ├── traffic_source (STRUCT)                      ← dot notation (but: first-touch only!)
  ├── event_params (ARRAY<STRUCT>)                 ← key-value bag → correlated subquery
  │     {key: "page_location", value: {string_value: "https://..."}}
  │     {key: "ga_session_id", value: {int_value: 1234567}}
  │     {key: "engagement_time_msec", value: {int_value: 5200}}
  └── items (ARRAY<STRUCT>)                        ← product list → UNNEST in FROM
        {item_name: "Google Tee", price_in_usd: 18.0, item_category: "Apparel"}

Three access patterns:

Data shape How to access Example
STRUCT Dot notation — no UNNEST device.category, geo.country
ARRAY<STRUCT> (product list) UNNEST(items) AS item in FROM, then item.item_name Expand items to rows
ARRAY<STRUCT> (key-value bag) Correlated subquery — extract one key as a column Get page_location without changing row count

Most of the useful data is inside nested structures:

Column Type How to access
event_name STRING Directly — page_view, session_start, purchase, etc.
event_date STRING Needs PARSE_DATE('%Y%m%d', event_date) — it’s stored as '20201117'
event_timestamp INT64 Microseconds since epoch — use TIMESTAMP_MICROS(event_timestamp)
user_pseudo_id STRING Anonymous user ID
device STRUCT Dot notation: device.category, device.browser, device.operating_system
geo STRUCT Dot notation: geo.country, geo.city
traffic_source STRUCT Dot notation: traffic_source.source, traffic_source.medium — but see note below
ecommerce STRUCT Dot notation: ecommerce.purchase_revenue_in_usd, ecommerce.total_item_quantity
event_params ARRAY<STRUCT> Key-value bag — use correlated subquery (see below)
items ARRAY<STRUCT> Product list — use UNNEST(items) AS item, then item.item_name, etc.

Syntax you’ll need

Pattern Syntax
Extract a key from event_params (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location')
Extract an int key from event_params (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
Flatten items array FROM events, UNNEST(items) AS item then item.item_name, item.price_in_usd
Read a STRUCT field device.category, geo.country — no UNNEST needed
Convert event_timestamp TIMESTAMP_MICROS(event_timestamp)
Parse event_date PARSE_DATE('%Y%m%d', event_date)
First value in a group by timestamp ARRAY_AGG(col IGNORE NULLS ORDER BY event_timestamp LIMIT 1)[SAFE_OFFSET(0)]
Constant within a session ANY_VALUE(device.category) — picks any value (all the same)

Important: traffic_source.source is the user’s first-touch attribution — the same for every session by that user. The session-level source/medium lives in event_params (keys source and medium). For the group project, you’ll usually want the session-level version.

Orientation

Before building anything, explore:

  1. Event types: SELECT event_name, COUNT(*) AS cnt ... GROUP BY 1 ORDER BY 2 DESC. What types exist? Which are most frequent? How rare is purchase?

  2. event_params keys: Run the discovery query from lecture to find all keys and which value type each uses:

SELECT
  key,
  COUNT(*) AS occurrences,
  COUNTIF(value.string_value IS NOT NULL) AS has_string,
  COUNTIF(value.int_value IS NOT NULL) AS has_int,
  COUNTIF(value.float_value IS NOT NULL) AS has_float
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117`,
     UNNEST(event_params)
GROUP BY key
ORDER BY occurrences DESC

Keep this output open — you’ll need it for the rest of practice. It tells you whether to use value.string_value or value.int_value for each key.

  1. Items array: How many events have a non-empty items array? (WHERE ARRAY_LENGTH(items) > 0). Which event types have items?

Warm-Up: Extract Nested Fields (no GROUP BY yet)

Before building any views, practice the two extraction patterns on a small sample. Run this query and look at the output:

SELECT
  event_name,
  user_pseudo_id,
  TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
  -- STRUCT fields: just dot notation
  device.category AS device_category,
  geo.country,
  -- Key-value bag: correlated subquery extracts one key as a column
  (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117`
LIMIT 20

Notice: same number of rows in, same number of rows out. The correlated subquery runs once per row, finds the matching key, and returns one value. It does not multiply rows like UNNEST in the FROM clause would.

Try adding another extraction — pull source from event_params (it’s a string_value). Compare it to traffic_source.source for the same rows. Are they the same?

Submission View 1: w6_purchases

Grain: one row per purchase event. No GROUP BY needed — just filter and extract.

This is the simplest view and uses the same correlated subquery pattern you just practiced.

Column Source
user_pseudo_id top-level field
session_id ga_session_id from event_params (int_value)
event_timestamp TIMESTAMP_MICROS(event_timestamp)
purchase_revenue ecommerce.purchase_revenue_in_usd
total_item_quantity ecommerce.total_item_quantity
page_location from event_params (string_value)

Filter to WHERE event_name = 'purchase'. Each row is one purchase event with its nested fields extracted as flat columns.

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_purchases` AS
-- your query here

Submission View 2: w6_items

Grain: one row per item per purchase. This is where you use UNNEST on the items array.

Column Source
user_pseudo_id top-level field
session_id ga_session_id from event_params (int_value)
event_timestamp TIMESTAMP_MICROS(event_timestamp)
item_name item.item_name (after UNNEST)
item_category item.item_category (after UNNEST)
price_in_usd item.price_in_usd (after UNNEST)
quantity item.quantity (after UNNEST)
purchase_revenue ecommerce.purchase_revenue_in_usd

Start from your w6_purchases logic, then add UNNEST(items) AS item in the FROM clause to get individual products. Notice that purchase_revenue is an event-level field that gets duplicated onto every item row. If you SUM(purchase_revenue) after UNNEST, you double-count — a $50 order with 3 items would contribute $150 to your sum.

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_items` AS
-- your query here

Submission View 3: w6_sessions

Grain: one row per session. This is the hardest of the three views — it combines correlated subqueries (which you’ve now used twice) with GROUP BY aggregation.

Column Source
user_pseudo_id top-level field
session_id ga_session_id from event_params (int_value)
session_date event_date parsed to DATE
device_category device.category
country geo.country
landing_page first page_location in the session by event_timestamp
total_events count of events in the session
total_pageviews count of events where event_name = 'page_view'
has_purchase TRUE if any event in the session is 'purchase'

Approach: session_id lives inside event_params, and you need to both GROUP BY it and aggregate other fields. Use a two-step CTE — extract nested fields first, then group:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.w6_sessions` AS
WITH events_flat AS (
  -- Step 1: extract nested fields so we can GROUP BY them
  SELECT
    *,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location
  FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20201117`
)
-- Step 2: one row per session — fill in the aggregations
SELECT
  user_pseudo_id,
  session_id,
  -- session_date: parse event_date to DATE
  -- device_category: constant within session, use ANY_VALUE()
  -- country: constant within session, use ANY_VALUE()
  -- landing_page: first page_location by event_timestamp (not alphabetical!)
  -- total_events: COUNT(*)
  -- total_pageviews: count only 'page_view' events
  -- has_purchase: TRUE if any event is 'purchase'
FROM events_flat
WHERE session_id IS NOT NULL
GROUP BY user_pseudo_id, session_id

Gotchas:

  • landing_page should be the chronologically first page in the session, not the alphabetically first. MIN(page_location) gives the wrong answer — you need an aggregation that respects timestamp order. Check the syntax reference table above for the right pattern.
  • Device and geo fields are constant within a session — ANY_VALUE() works for these.
  • Some events have NULL session_id — the CTE’s WHERE clause filters these out.

Verify Your Work

Once all three views exist, cross-check them:

  • How many sessions in w6_sessions? How many had a purchase?
  • Does the count of purchase sessions in w6_sessions match the row count in w6_purchases? (They may differ slightly — one session can have multiple purchase events.)
  • Pick one purchase from w6_purchases. Can you find its items in w6_items by matching on session_id and event_timestamp?
  • What’s the most purchased product by quantity?
  • Try SUM(purchase_revenue) on w6_items vs. w6_purchases — do they match? Why or why not?

Going Further

  • Add source and medium to w6_sessions (from event_params, not traffic_source). Why does this distinction matter?
  • Add engagement_time_sec to w6_sessions (sum of engagement_time_msec from event_params, converted to seconds).
  • What percentage of sessions are single-event? What does the distribution of events per session look like?
  • Build w6_sessions across the full date range (events_* with WHERE event_date BETWEEN '20201101' AND '20210131'). How does performance change?

Troubleshooting

Problem Solution
event_date is not a DATE It’s a STRING like '20201117' — use PARSE_DATE('%Y%m%d', event_date)
Correlated subquery returns error Make sure you’re UNNESTing the right column — FROM UNNEST(event_params), not FROM UNNEST(items)
Can’t GROUP BY a correlated subquery Extract it in a CTE first, then GROUP BY the CTE column
Row count explodes after UNNEST Expected — you changed the grain. Each array element becomes its own row
purchase_revenue is NULL Only purchase events have ecommerce data — other event types have NULLs
landing_page looks wrong MIN(page_location) gives alphabetical first, not chronological. Use ARRAY_AGG(... ORDER BY event_timestamp LIMIT 1)
ARRAY_AGG returns NULL Some values are NULL — try IGNORE NULLS
Session counts don’t match between views One session can have multiple purchase events — w6_purchases counts events, w6_sessions counts sessions

Summary

  • STRUCTs are accessed with dot notation, no UNNEST needed — device.category just works
  • ARRAY_AGG packs rows into arrays; UNNEST unpacks them back. These are inverse operations.
  • UNNEST changes the grain — always know what one row represents after each transformation
  • Correlated subqueries extract specific keys from key-value arrays without changing the grain
  • The three views (sessions, purchases, items) each represent a different grain of GA4 data — this is the starting point for the group project