Group Project: Google Merchandise Store Analytics

ECON250: Big Data Analytics

Published

March 21, 2026

Points	35
Groups	3–4 students
Assigned	Week 6
Checkpoint	Week 7
Final Submission + Presentations	Week 8
Exact deadlines	Posted on Moodle

Form your group and register via the Google Spreadsheet (pinned in #econ250-2026 on Slack).

On using AI for this final project

AI tools are allowed and can genuinely help you on this project — but how you use them matters. The goal is to use AI to learn faster and get unstuck, not to outsource the analytical thinking that the project is designed to develop.

Use AI to learn techniques you haven’t seen before. If your area needs a query pattern you don’t know — cohort retention tables, funnel analysis, percentile buckets — ask an AI to explain the approach, show you a simplified example, then adapt it to the GA4 data yourself. The learning happens in the adaptation: figuring out which fields to use, handling NULLs, choosing the right grain.

Use AI for debugging. Paste an error message or unexpected result and ask what’s going wrong. This is one of the highest-value uses — AI is good at spotting syntax issues, unintended cross-joins, and off-by-one problems.

Use AI to review and improve your SQL. After you’ve written a query, ask AI to review it for correctness, efficiency, or readability. You’ll learn more from having your own work critiqued than from reading AI-generated code.

Don’t ask AI to “analyze the Google Merchandise Store dataset.” If you prompt an AI with your area description and ask it to write the queries and interpret the results, you’ll get generic output that sounds plausible but doesn’t reflect what’s actually in the data. You’ll also struggle to explain it during Q&A — and you will be asked.

Don’t paste AI output into your report without rewriting it. AI-generated interpretation tends to be vague and hedging (“this could suggest…” / “further analysis would be needed…”). Your report should make specific claims backed by specific numbers from your queries.

Disclosure requirements

Per course policy, document all AI usage in your report: which tools, what you asked, and how you used the output. Be specific — “used ChatGPT to debug a CROSS JOIN issue in the funnel query” is useful; “used AI for some queries” is not. During Week 8 presentations, you may be asked to explain any part of your analysis or SQL in detail.

Objective

You have three months of Google Analytics 4 (GA4) event data from the Google Merchandise Store (November 2020 – January 2021), covering the holiday shopping season. The data arrives in its raw nested format — the same format that real GA4 installations produce. Each group member picks an analytical area, analyzes it in depth, and together you produce a report covering the store’s performance during this period.

Analytical Areas

Each group member chooses one area from the list below. Groups of 3 pick 3 areas; groups of 4 pick 4. Not every area needs to be covered — choose the ones that interest your group.

Traffic & Channels

How do visitors find the store, and which sources bring the most valuable traffic?

Analyze where sessions come from (organic search, paid, direct, referral, social), how the channel mix changes over time, and which channels bring engaged, converting visitors vs. high-volume low-quality traffic. Consider how new vs. returning visitors differ by channel and whether the holiday season shifts the acquisition mix.

Revenue & Products

How is revenue distributed across products, prices, and time — and what patterns stand out?

Analyze revenue trends over the 3-month period, how revenue concentrates across products and price points, which products drive views vs. actual purchases, and how the holiday season affects order values and purchasing patterns. Consider whether product co-purchases reveal natural bundles.

User Engagement

What distinguishes engaged visitors from bouncers, and how do users interact with the site?

Analyze session depth, time on site, pages visited, and site search behavior. Compare how engagement differs across devices, channels, and visitor types. Consider what separates sessions that lead to a purchase from those that don’t — not in terms of conversion funnels, but in terms of browsing behavior and engagement signals.

Funnel & Conversion

How do users progress through the purchase process, and where do they drop off?

Build a conversion funnel from product views through checkout to purchase and analyze where the biggest drop-offs occur. Consider how the funnel shape varies by channel, device, or visitor type. Analyze time-to-purchase within sessions and whether there are points beyond which conversion becomes unlikely.

Retention & Cohorts

Do visitors come back, and what drives repeat engagement?

Define user cohorts by their first visit week and track how many return in subsequent weeks. Analyze whether acquisition channel affects retention, whether holiday-acquired users behave differently from users acquired in normal weeks, and how quickly repeat purchases happen for those who buy more than once.

These are starting points, not checklists

The descriptions above suggest directions to explore — you do not need to cover everything listed, and you are welcome to go in directions not mentioned. The best analyses follow what the data reveals rather than mechanically answering every suggested angle.

Explore before you analyze

Before diving into your area, spend time exploring the raw data: what event types exist, what keys appear in event_params, what the items array looks like. The more you understand the data upfront, the fewer surprises you will encounter later — and the better your sessions_base design will be.

“No difference” is a real finding

If you segment by some dimension and find no meaningful variation, that is a legitimate result. Report what the data shows — do not twist numbers to manufacture a story that is not there.

Roles and Ownership

Each group member owns one analytical area and one grading component. Every member analyzes their area, writes their report section, reviews the group’s work in their component area, and presents their findings during Week 8.

Component	Points	Scoring	Owner’s responsibility
SQL Quality	13	Group	Builds `sessions_base`. Reviews all queries for correctness, consistent style, NULL handling, and efficiency.
Analytical Depth	11	Individual per area	Ensures each area goes beyond surface-level aggregation. Pushes for useful segmentation and cross-area connections.
Analytical Reasoning	6	Individual per area	Edits the report. Ensures every section has interpretation, not just numbers. Checks that assumptions are stated.
Documentation & Presentation	5	Group	Owns report formatting, presentation structure, timing, and AI disclosure completeness.

In groups of 3, the Analytical Reasoning and Documentation & Presentation owners are the same person (combined: 11 pts). List your component assignments in the checkpoint.

Deliverables

When	What	Where
Week 7	Checkpoint (online text)	Moodle
Week 8	`sessions_base` view + `table_sessions_base` (materialized copy)	Group BigQuery dataset (`group_NN`)
Week 8	Analysis views (e.g., `traffic_channel_scorecard`)	Group BigQuery dataset (`group_NN`)
Week 8	PDF report (one per group)	Moodle
Week 8	Presentation (~15 min + ~10 min Q&A)	In class

Group BigQuery datasets (group_NN) will be created by the instructor after groups register. Until then, develop in your personal student__* dataset.

Details on each below.

`sessions_base` view

Every group must create a sessions_base view in their group dataset — one row per session, aggregated from the raw event-level data. At a minimum, each row should capture:

Session identity: which user, which session, which visit number, what date
Entry point: the landing page (the first page viewed in the session)
Traffic attribution: source, medium, and campaign for the session
Device and location: device type, operating system, browser, country, city
Engagement metrics: event count, pageview count, engagement time
Conversion: whether a purchase occurred and the session’s revenue
Visitor type: whether this is the user’s first visit

How you extract these from the raw event data — which fields to use, how to handle NULLs and duplicates, how to pick the “first” value when multiple exist — is a design decision your group must make and document. You may add additional columns if your chosen areas require them.

Save both a view (sessions_base) and a table (table_sessions_base) in your group dataset. The view preserves the SQL for review; the table stores the results so downstream queries don’t re-scan ~3.6 GB every time.

Analysis views

Save your work in your group’s BigQuery dataset (group_NN). Until your group dataset is ready, develop in your personal student__* dataset.

For each analysis, create views so the instructor can see the underlying SQL. Name views with an {area}_{description} pattern (e.g., traffic_channel_scorecard, revenue_weekly_trend, funnel_by_device, engagement_session_depth, retention_weekly_cohorts). For analyses that need the items array, event sequences, or specific event_params not in sessions_base, query the raw events table directly.

Each view’s SQL should include:

A header comment with the analysis description and the owner’s name (e.g., -- Traffic: Channel volume vs. quality scorecard [Owner: Name])
Brief inline comments explaining non-obvious logic
CTEs for readability
SQL style: UPPERCASE keywords, lowercase identifiers, meaningful aliases

Analytical report

Submit your report as a PDF on Moodle. Include:

Team & ownership — who built sessions_base, who owns which area and component
Executive summary — 3–5 most important findings across all areas
Data notes — design choices in sessions_base, data quality issues and how you handled them
Analysis sections (one per area) — what you explored, results, and what stands out. Include tables or screenshots.
Cross-area connections — where findings from different areas reinforce, contradict, or extend each other
AI usage disclosure — per course policy

Focus on narrative and business context, not full query listings — that is what the BigQuery views are for. If you have experience with visualization tools from other courses (Tableau, Looker Studio, Python, R, etc.), you are welcome to include polished charts — but this is not required. Screenshots of query output are sufficient.

Presentation

Each group gets 25–30 minutes during Week 8: ~15 min presenting + ~10 min Q&A + a few minutes for transitions. Each member presents their area (~4–5 min per person). You may present from your PDF report, slides, or any other format.

Cover: what you explored and why, your most interesting findings (with data), connections across areas, and one SQL technique you’re proud of. Focus on insights, not syntax — only show SQL when highlighting a specific technique.

During Q&A, the instructor will direct questions to individual members. Expect to explain your segmentation choices, how your results connect to another member’s area, and to walk through specific parts of your SQL.

Attendance is required. Your individual grade depends on presenting and answering Q&A in person. If you cannot attend the Week 8 session, contact the instructor beforehand to arrange a separate defense.

Checkpoint (Week 7)

Submit via the Moodle assignment (online text) by the Week 7 deadline. One submission per group. Include these five sections:

Group members and roles — names, BigQuery dataset prefixes (student__*), area ownership, component ownership
sessions_base SQL — paste your query. It must run without errors and include at least the columns listed in Required Foundation. It does not need to be saved in your group dataset yet — a working query in a personal dataset is fine.
One working analysis — paste the SQL for one view from any area, plus a 2–3 sentence summary of what you found
Planned directions — for each area, 2–3 bullet points on what you plan to explore next
Blockers — anything you’re stuck on or unsure about

Skip the checkpoint = -5 pts. Superficial checkpoint (no working sessions_base) = -2 pts. Use this to get instructor feedback before the final push.

Final submission checklist

SQL Quality owner: sessions_base view and table_sessions_base table saved in group dataset. All analysis views have header comments, consistent style, and run without errors.
Analytical Depth owner: each area goes beyond surface-level aggregation. Cross-area connections are present.
Analytical Reasoning owner: every analysis section has interpretation, not just numbers. Assumptions are stated.
Documentation & Presentation owner: PDF report on Moodle. AI disclosure included. Presentation prepared and timed.
Everyone: each group member can explain their area’s SQL and findings in Q&A.

Grading Rubric

Technique variety matters

By Week 6, the course has covered NULL handling, conditional aggregation, date functions, window functions, cohort analysis, retention tables, anomaly detection patterns. Projects that draw on this variety score well across all components. Repetitive approaches (e.g., only basic GROUP BY queries throughout) will limit your score.

Analytical Depth (11 points)

Score	Criteria
9–11	Each area explored from multiple angles (time trends, segmentation, comparisons). Self-chosen dimensions reveal clear patterns. Cross-area connections are specific and data-backed (e.g., “the channels with the best funnel completion also show the highest week-1 retention”).
6–8	Well-explored with appropriate techniques. Some cross-references, though some may be surface-level.
3–5	Analysis stays at one level of aggregation. Little segmentation or cross-area connection.
0–2	Areas barely explored, or analyses are largely incorrect.

SQL Quality (13 points)

Score	Criteria
11–13	`sessions_base` is well-designed and complete. All queries execute correctly. CTEs used for readability. UNNEST patterns are clean. NULL and `(not set)` handling is explicit. Consistent style. Efficient queries (no unnecessary full-table scans).
8–10	`sessions_base` works but has minor gaps. Queries correct and readable. Some inconsistent NULL handling or formatting.
4–7	`sessions_base` incomplete or has issues. Most queries correct but have structural problems: no CTEs, unclear UNNEST logic, missing edge cases.
0–3	No `sessions_base` view, or most queries have errors or produce clearly wrong results.

SQL Quality and Documentation & Presentation are group scores — every member receives the same points. Analytical Depth and Analytical Reasoning are scored individually per area — each member’s section is evaluated on its own.

Analytical Reasoning (6 points)

Score	Criteria
5–6	Results compared across time, segments, or against averages. Assumptions stated. Unexpected patterns identified with plausible explanations. Report reads as one connected document, not disconnected sections.
3–4	Comparisons for most analyses, but some surface-level (“mobile is lower than desktop” without quantifying). Assumptions mostly implicit.
1–2	Minimal interpretation. Report primarily presents numbers without context.
0	No interpretation. Report is a dump of query results.

Documentation & Presentation (5 points)

Report (2 pts): professional formatting, clear structure, data notes present, readable by a non-technical audience. Presentation (2 pts): clear delivery within time limit, all members participate, Q&A answered competently. AI disclosure (1 pt): complete per course policy.

The Data

bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*

This is a date-sharded table — each day is a separate table (e.g., events_20201101). Query all days with events_*, or filter with WHERE event_date BETWEEN '20201101' AND '20210131'.

Data period: November 1, 2020 – January 31, 2021 (92 days).

How GA4 data is structured

GA4 stores data in an event-based nested format. Each row is a single event (a page view, a click, a purchase). Details are packed into nested structures:

event row
  ├── event_name, event_date, event_timestamp     ← what happened and when (microseconds since epoch)
  ├── user_pseudo_id                               ← who did it
  ├── event_params (ARRAY<STRUCT>)                 ← key-value details about the event
  │     ├── {key: "page_location", value: {string_value: "https://..."}}
  │     ├── {key: "ga_session_id", value: {int_value: 1234567}}
  │     └── {key: "engagement_time_msec", value: {int_value: 5200}}
  ├── items (ARRAY<STRUCT>)                        ← products involved (if any)
  │     ├── {item_name: "Google Tee", price_in_usd: 18.0, ...}
  │     └── {item_name: "Google Bottle", price_in_usd: 23.0, ...}
  ├── device (STRUCT)                              ← device info
  ├── geo (STRUCT)                                 ← location
  ├── traffic_source (STRUCT)                      ← user's FIRST-TOUCH attribution
  └── ecommerce (STRUCT)                           ← transaction totals (purchase events only)

STRUCTs (device, geo, traffic_source, ecommerce): access with dot notation, e.g., device.category, geo.country. No UNNEST needed.
event_params (ARRAY<STRUCT>): a key-value bag. Each parameter has a key and a value STRUCT with four type slots (string_value, int_value, float_value, double_value) — only one is populated per key. Extract using a correlated subquery with UNNEST.
items (ARRAY<STRUCT>): a list of products. Flatten with UNNEST(items) AS item.

UNNEST creates a cross join

When you UNNEST(items), each event row is multiplied by the number of items in its array. A purchase with 3 items becomes 3 rows. Always be clear about what grain you are working at — event level, item level, or session level — and use CTEs to pre-aggregate when crossing grains.

Data quality notes

This is real (obfuscated) data. Expect quirks — part of your job is to discover, document, and handle them. Two things worth knowing upfront:

traffic_source vs. event_params source/medium: traffic_source records a user’s first-touch attribution, while source/medium in event_params reflects the current session. Choose the right one for your analysis and document your reasoning.
event_timestamp: stored in microseconds since Unix epoch. Convert with TIMESTAMP_MICROS(event_timestamp).