SQL Joins | Website Analytics

6.2 Master SQL Joins for Deeper Website Analytics Insights

Welcome to Phase 6.2 Linking Your Data for Deeper Insights

You have successfully learned to write SQL queries to extract direct business insights from single tables. That is a great start. Now it is time to unleash the full power of relational databases. In this phase we will master SQL JOINs. JOINs allow you to combine data from two or more tables based on related columns. This is essential because your web analytics data is spread across different tables like 'events' and 'users'. Think of JOINs as the glue that connects different pieces of information about your users and their actions. It lets you see the complete picture.
This step is crucial for gaining richer insights by linking user characteristics with their behavior.


Why SQL Joins are Essential for Web Analytics

Your ‘events’ table holds every interaction. Your 'users' table holds unique user details. To get a complete view you need to combine them. Here is why JOINs are so valuable.

• Enriched Data Analysis Combine event data with user demographics or first visit details. This helps you understand how different types of users behave.
• Holistic User View See a user's entire journey including their initial source and device alongside every page they visited.
• Complex Querying Answer questions that span multiple data points. For example what is the average session duration for users who first came from a specific campaign?
• Data Normalization Good database design keeps data in separate tables to avoid repetition. JOINs bring this normalized data back together for analysis.
• Foundation for Advanced Metrics Many advanced metrics require combining data from different tables. JOINs are the building blocks for these calculations.

JOINs transform isolated data points into a connected web of information. This enables more sophisticated analysis.


Visual showing SQL JOINs linking website events and users data tables

Understanding Types of SQL Joins

There are several types of JOINs. For web analytics the most common and useful are INNER JOIN and LEFT JOIN.

• INNER JOIN Returns only the rows that have matching values in both tables. If a user has no events or an event has no matching user it is excluded. This is good when you only want complete matches.

• LEFT JOIN (or LEFT OUTER JOIN) Returns all rows from the left table and the matching rows from the right table. If there is no match in the right table the columns from the right table will have NULL values. This is useful when you want to keep all events and add user details if available.


Practical SQL Code Examples Using Joins

Here are some SQL queries demonstrating how to use JOINs to combine data from your 'events' and 'users' tables. These queries will help you answer more complex business questions.

1. Events with User's First Visit Details
This query combines every event with information about when the user first visited and their initial traffic source.

-- Query 1: Events with User's First Visit Details SELECT e.event_id, e.event_name, CAST(DATEADD(s, e.event_timestamp / 1000000, '1970-01-01') AS DATETIME) AS event_datetime, e.page_location, e.session_id, e.user_pseudo_id, CAST(DATEADD(s, u.first_visit_timestamp / 1000000, '1970-01-01') AS DATETIME) AS user_first_visit_datetime, u.first_traffic_source, u.first_traffic_medium FROM events e INNER JOIN users u ON e.user_pseudo_id = u.user_pseudo_id ORDER BY e.event_timestamp DESC;

Explanation

• 'FROM events e INNER JOIN users u ON e.user_pseudo_id = u.user_pseudo_id': This performs an INNER JOIN between the 'events' table (aliased as 'e') and the 'users' table (aliased as 'u'). The join condition 'ON e.user_pseudo_id = u.user_pseudo_id' links events to their corresponding user based on the common user ID.
• 'SELECT e.event_id, ..., u.first_traffic_medium': We select columns from both tables, using their aliases ('e.' or 'u.') to specify which table the column comes from.
• 'CAST(DATEADD(s, ...))': Converts Unix microseconds timestamps to readable datetimes.


2. Bounce Rate by User's First Traffic Source
This query helps you understand if users acquired from certain channels tend to bounce more or less frequently on their initial visit.

-- Query 2: Bounce Rate by User's First Traffic Source (for their first session) WITH FirstSessionEvents AS ( SELECT e.user_pseudo_id, e.session_id, e.event_timestamp, e.event_name, u.first_traffic_source, u.first_traffic_medium, ROW_NUMBER() OVER (PARTITION BY e.user_pseudo_id, e.session_id ORDER BY e.event_timestamp) as rn FROM events e JOIN users u ON e.user_pseudo_id = u.user_pseudo_id WHERE CAST(DATEADD(s, e.event_timestamp / 1000000, '1970-01-01') AS DATE) = CAST(DATEADD(s, u.first_visit_timestamp / 1000000, '1970-01-01') AS DATE) ), SessionDetails AS ( SELECT user_pseudo_id, session_id, first_traffic_source, first_traffic_medium, COUNT(*) AS event_count, MAX(CASE WHEN event_name = 'page_view' THEN 1 ELSE 0 END) AS has_page_view FROM FirstSessionEvents GROUP BY user_pseudo_id, session_id, first_traffic_source, first_traffic_medium ) SELECT sd.first_traffic_source, sd.first_traffic_medium, COUNT(sd.session_id) AS total_first_sessions, SUM(CASE WHEN sd.event_count = 1 AND sd.has_page_view = 1 THEN 1 ELSE 0 END) AS bounced_first_sessions, CAST(SUM(CASE WHEN sd.event_count = 1 AND sd.has_page_view = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(sd.session_id) AS DECIMAL(5,2)) AS bounce_rate_percent FROM SessionDetails sd GROUP BY sd.first_traffic_source, sd.first_traffic_medium ORDER BY total_first_sessions DESC;

Explanation:
'WITH FirstSessionEvents AS (...)': This CTE identifies events that belong to a user's first session. It joins 'events' and 'users' and filters for events whose date matches the 'first_visit_timestamp'.

'ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)': This is used to order events within each session, though not strictly used in the final bounce calculation here, it's a common pattern for sequential analysis.

'SessionDetails AS (...)': This CTE then aggregates the 'FirstSessionEvents' to count the total 'event_count' for each of these first sessions and determines if a page view occurred.

The final 'SELECT' statement then calculates 'total_first_sessions', 'bounced_first_sessions' (sessions with only one page view event), and the 'bounce_rate_percent' for each 'first_traffic_source' and 'first_traffic_medium'.


3. Average Session Duration by User's First Device Category
See if users who first visited on mobile stay longer than those who first visited on desktop.

-- Query 3: Average Session Duration by User's First Device Category WITH SessionDurations AS ( SELECT e.user_pseudo_id, e.session_id, u.device_category AS first_device_category, MIN(e.event_timestamp) AS session_start_timestamp, MAX(e.event_timestamp) AS session_end_timestamp FROM events e JOIN users u ON e.user_pseudo_id = u.user_pseudo_id GROUP BY e.user_pseudo_id, e.session_id, u.device_category ) SELECT sd.first_device_category, COUNT(sd.session_id) AS total_sessions, AVG(CAST((sd.session_end_timestamp - sd.session_start_timestamp) AS DECIMAL(18,2)) / 1000000.0) AS avg_session_duration_seconds FROM SessionDurations sd WHERE sd.first_device_category IS NOT NULL AND sd.first_device_category != '' GROUP BY sd.first_device_category ORDER BY total_sessions DESC;

Explanation:
'WITH SessionDurations AS (...)': This CTE first finds the 'session_start_timestamp' (minimum 'event_timestamp') and 'session_end_timestamp' (maximum 'event_timestamp') for each session, and links it to the user's 'first_device_category'.

'AVG(CAST((sd.session_end_timestamp - sd.session_start_timestamp) AS DECIMAL(18,2)) / 1000000.0)': Calculates the duration of each session in microseconds, converts it to seconds, and then averages it by 'first_device_category'. We use 'CAST' to ensure floating-point division.

'WHERE sd.first_device_category IS NOT NULL AND sd.first_device_category != ''' : Filters for valid device categories.

'GROUP BY sd.first_device_category': Aggregates the results for each initial device type.


Important Notes on SQL Joins:
When using JOINs, always specify the join condition correctly using the 'ON' clause. This links rows between tables. Use table aliases like 'e' for 'events' and 'u' for 'users' to make your queries shorter and easier to read. The choice between 'INNER JOIN' and 'LEFT JOIN' depends on whether you want to include all records from one table even if there is no match in the other. For performance, ensure that the columns used in your 'ON' clauses (e.g. 'user_pseudo_id') are indexed in both tables. This will make your queries run much faster, especially with large datasets.


Overall Value of Combining Data with SQL Joins
Mastering SQL JOINs is a significant leap in your data analytics capabilities. It allows you to integrate disparate pieces of information into a cohesive view. This enables you to answer complex business questions that require data from multiple sources. It is a fundamental skill for building robust data pipelines and sophisticated reporting solutions. This skill is highly sought after in the data industry. It demonstrates your ability to work with relational databases effectively.


Important Notes on SQL Queries:

These queries are written for SQL Server. Syntax might vary slightly for other database systems like PostgreSQL or MySQL. Always ensure your database has appropriate indexes on columns like 'user_pseudo_id' 'session_id' 'event_timestamp' and 'event_name' for optimal query performance. The 'event_timestamp' is assumed to be in Unix microseconds as exported from GA4. Adjust the division by 1000000 if your timestamps are in milliseconds or seconds. Understanding your data schema is crucial for writing effective SQL queries. Refer back to Phase 2.3 if needed.


Next Step:
You have successfully learned to combine data from multiple tables using SQL JOINs. This means you’re now good at bringing related information together. The next exciting step will be to learn how to find problematic pages using SQL. This will build on your knowledge of joins and aggregations to help you spot content or design issues right inside your database.
For now, make sure you save any SQL queries you write in your E drive, inside the SankalanAnalytics database folder. You can save them as '.sql' files for easy future access.


Previous Topic==> Queries for Website Insights  Next Topics==> Bounce Pages Website Data


SQL All topics Wise Interview Questions    Employee Salary Management SQL FAQ!.   C FAQ   Top 25 PL/SQL Interview Questions

Joins With Group by Having  Equi Join  Joins with Subqueries  Self Join  Outer Join 


Website Analytics Project: Phases and Action Steps

  • Home
  • 🟢 Live App: Web Analytics Simulator
  • Phase 0: Project Setup & Problem Definition
  • 0.1 Define Project Goals & Challenges
  • 0.2 Select Tools, Tech Stack & Data Sources
  • 0.3 Software Requirements & Installation
  • 0.4 Folder Structure & GitHub Repo
  • 0.5 Testing Project Locally
  • Phase 1: Planning for Analytics
  • 1.1 Website Analytics Project Overview
  • 1.2 Define KPIs, Bounce Rate, Engagement
  • 1.3 Identify Target Users & Pain Points
  • Phase 2: Data Collection
  • 2.1 Setup Google Analytics 4 (GA4)
  • 2.2 Export GA4 Data to BigQuery/CSV
  • 2.3 Design SQL Schema for Web Analytics
  • Phase 3: Data Cleaning & Feature Engineering
  • 3.1 Clean Website Data with Python & Pandas
  • 3.2 Create Custom Metrics (Session, Bounce, etc.)
  • Phase 4: Exploratory Data Analysis (EDA)
  • 4.1 Analyze Website Traffic Trends
  • 4.2 Behavior by Device, Source, Location
  • 4.3 Top Pages & High Bounce Pages
  • 4.4 Diagnose Low Traffic & User Drop
  • Phase 5: Business Insights
  • 5.1 Funnel Analysis & Drop-Off Points
  • 5.2 New vs Returning Users
  • 5.3 Time Spent & Scroll Depth
  • Phase 6: SQL for Business
  • 6.1 SQL for Business Insights
  • 6.2 Combine Web Data Using SQL
  • 6.3 Find Problematic Pages Using SQL
  • Phase 7: Machine Learning
  • 7.1 Segment Users with Clustering
  • 7.2 Predict Bounce Rate with ML
  • 7.3 Recommend Pages or Content
  • Phase 8: Dashboards & Visualization
  • 8.1 Dashboard with Streamlit
  • 8.2 Visualize KPIs with Python
  • 8.3 Page-Level Metrics & Drop Heatmaps
  • Phase 9: Final Analytics Story
  • 9.1 Summary Report & Findings
  • Phase 10: Hosting & Portfolio Presentation
  • 10.1 Host Website Project Online
  • 10.2 Add to GitHub with ReadMe
  • 10.3 Pitch Project in Interview
  • Other Topics
  • SQL Interview Questions
  • SQL Case Study: Account Management
  • Python Interview Questions
  • Why C Language

Get in touch

  • tech2dsm@gmail.com

© Sankalan Data Tech. All rights reserved.