5.2 Analyze New vs Returning Website Users – Understand Your Core Audience Segments

Welcome to Phase 5 2 Understanding Your Core Audience Segments

You have successfully analyzed user journeys and identified drop off points in your funnels. That gave you valuable insights into conversion paths. Now it is time to differentiate between two fundamental groups of your audience. In this phase we will analyze the behavior of new versus returning users. Understanding the differences between these two groups is like knowing whether you are talking to a first time visitor or a loyal customer. It helps you tailor your website content marketing and overall strategy to better serve each segment.
This step is crucial for developing targeted engagement strategies and optimizing your website for different user needs.

Why Analyze New vs Returning Users

Distinguishing between new and returning users gives you unique and useful insights Here is why it matters so much

Tailor First Impressions New users need clear navigation and interesting content to keep them engaged Returning users may look for updates or deeper info

Optimize Conversion Paths Returning users usually convert more often Knowing their journey helps you make it easier for them to complete actions

Assess Marketing Effectiveness New user data shows how well your marketing brings fresh visitors Returning user data shows how good you are at keeping them coming back

Content Strategy Insights This analysis helps you find out what content attracts new visitors and what keeps returning users engaged

Personalize Experiences You can offer different content or calls to action depending on whether a visitor is new or returning

Analyzing these two segments helps you build a more effective and user centric website.

Graph showing comparison of new and returning website users' engagement metrics

Key Metrics for New vs Returning User Analysis

We will compare these core metrics for both new and returning user segments.

Total Users The number of unique users in each segment

Sessions The total number of sessions started by each segment

Page Views The total number of pages viewed by each segment

Average Session Duration (Estimate) The average time spent per session for each segment

Bounce Rate The percentage of sessions where only one page was viewed for each segment

Python Code to Analyze New vs Returning Users

We will write a Python script to fetch your event and user data from SQL Server. It will then categorize sessions as new or returning and calculate key metrics for each group.

Practical Python Code Example
Here is a basic example of the Python code you will write. This code will connect to your SQL Server database. It will fetch event and user data. It will then classify sessions and compute metrics for new and returning users.


# SCRIPT SETUP 1: Import necessary libraries
import pandas as pd
import pyodbc
from datetime import datetime, timedelta

# SCRIPT SETUP 2: Database connection details for SQL Server
# Replace with your actual server name database name and credentials
DB_CONFIG = {
    'driver': '{ODBC Driver 17 for SQL Server}',
    'server': 'YOUR_SQL_SERVER_NAME',  # e.g., 'localhost' or 'SERVER_NAME\SQLEXPRESS'
    'database': 'YOUR_DATABASE_NAME',
    'uid': 'YOUR_USERNAME',
    'pwd': 'YOUR_PASSWORD'
}

# FUNCTION 1: Connect to the database
def connect_to_db():
    """Establishes a connection to the SQL Server database."""
    conn_str = (
        f"DRIVER={DB_CONFIG['driver']};"
        f"SERVER={DB_CONFIG['server']};"
        f"DATABASE={DB_CONFIG['database']};"
        f"UID={DB_CONFIG['uid']};"
        f"PWD={DB_CONFIG['pwd']};"
    )
    try:
        conn = pyodbc.connect(conn_str)
        print("Successfully connected to SQL Server.")
        return conn
    except pyodbc.Error as ex:
        sqlstate = ex.args[0]
        print(f"Database connection error: {sqlstate}")
        print(ex)
        return None

# FUNCTION 2: Fetch event and user data
def fetch_analytics_data(conn):
    """Fetches event and user data from SQL Server."""
    query_events = """
    SELECT
        user_pseudo_id,
        session_id,
        event_timestamp,
        event_name,
        page_location
    FROM
        events
    ORDER BY
        user_pseudo_id, session_id, event_timestamp;
    """
    query_users = """
    SELECT
        user_pseudo_id,
        first_visit_timestamp
    FROM
        users;
    """
    try:
        df_events = pd.read_sql(query_events, conn)
        df_users = pd.read_sql(query_users, conn)
        print(f"Fetched {len(df_events)} event records and {len(df_users)} user records.")
        return df_events, df_users
    except Exception as e:
        print(f"Error fetching analytics data: {e}")
        return None, None

# FUNCTION 3: Analyze New vs Returning Users
def analyze_new_vs_returning(df_events, df_users):
    """Categorizes sessions as new or returning and calculates metrics for each group."""
    if df_events is None or df_events.empty or df_users is None or df_users.empty:
        print("Missing event or user data for analysis.")
        return None

    # STEP 3.1: Convert timestamps to datetime objects
    df_events['event_datetime'] = pd.to_datetime(df_events['event_timestamp'], unit='us')
    df_users['first_visit_datetime'] = pd.to_datetime(df_users['first_visit_timestamp'], unit='us')

    # STEP 3.2: Merge events with user first visit data
    df_merged = pd.merge(df_events, df_users[['user_pseudo_id', 'first_visit_datetime']], on='user_pseudo_id', how='left')

    # STEP 3.3: Classify each session as 'New' or 'Returning'
    # A session is 'New' if its start date matches the user's first visit date
    # This requires finding the first event timestamp for each session
    session_first_event_time = df_merged.groupby(['user_pseudo_id', 'session_id'])['event_datetime'].min().reset_index(name='session_start_datetime')
    df_merged = pd.merge(df_merged, session_first_event_time, on=['user_pseudo_id', 'session_id'], how='left')

    df_merged['user_type'] = df_merged.apply(
        lambda row: 'New User' if row['session_start_datetime'].date() == row['first_visit_datetime'].date() else 'Returning User',
        axis=1
    )

    # STEP 3.4: Calculate core metrics for each user type
    metrics_by_user_type = df_merged.groupby('user_type').agg(
        total_users=('user_pseudo_id', 'nunique'),
        total_sessions=('session_id', 'nunique'),
        total_page_views=('event_name', lambda x: (x == 'page_view').sum())
    ).reset_index()

    # STEP 3.5: Calculate Bounce Rate for each user type
    # Get event count per session for bounce rate calculation
    events_per_session = df_merged.groupby(['user_pseudo_id', 'session_id']).size().reset_index(name='event_count')
    df_sessions_with_counts = pd.merge(df_merged.drop_duplicates(subset=['user_pseudo_id', 'session_id', 'user_type']),
                                       events_per_session, on=['user_pseudo_id', 'session_id'], how='left')
    
    bounced_sessions_by_type = df_sessions_with_counts[
        (df_sessions_with_counts['event_count'] == 1) & (df_sessions_with_counts['event_name'] == 'page_view')
    ].groupby('user_type')['session_id'].nunique().reset_index(name='bounced_sessions')

    metrics_by_user_type = pd.merge(metrics_by_user_type, bounced_sessions_by_type, on='user_type', how='left').fillna(0)
    metrics_by_user_type['bounce_rate'] = (metrics_by_user_type['bounced_sessions'] / metrics_by_user_type['total_sessions']) * 100
    metrics_by_user_type['bounce_rate'] = metrics_by_user_type['bounce_rate'].fillna(0).round(2)

    # STEP 3.6: Calculate Average Session Duration (estimation)
    # For each session, find the min and max timestamp
    session_duration_df = df_merged.groupby(['user_pseudo_id', 'session_id', 'user_type'])['event_datetime'].agg(
        session_start='min',
        session_end='max'
    ).reset_index()
    session_duration_df['duration_seconds'] = (session_duration_df['session_end'] - session_duration_df['session_start']).dt.total_seconds()
    
    # Filter out sessions with 0 duration (single event sessions) or very short ones if desired
    # For simplicity, we'll include all for average but note that single-event sessions have 0 duration
    avg_session_duration = session_duration_df.groupby('user_type')['duration_seconds'].mean().reset_index(name='avg_session_duration_seconds')
    avg_session_duration['avg_session_duration_minutes'] = (avg_session_duration['avg_session_duration_seconds'] / 60).round(2)

    metrics_by_user_type = pd.merge(metrics_by_user_type, avg_session_duration[['user_type', 'avg_session_duration_minutes']], on='user_type', how='left').fillna(0)


    print("\n--- New vs Returning User Metrics ---")
    print(metrics_by_user_type[[
        'user_type', 'total_users', 'total_sessions', 'total_page_views', 
        'avg_session_duration_minutes', 'bounce_rate'
    ]])

    return metrics_by_user_type

# MAIN EXECUTION 1: This block runs when the script starts
if __name__ == "__main__":
    # MAIN EXECUTION 2: Connect to the database
    conn = connect_to_db()
    if conn:
        # MAIN EXECUTION 3: Fetch event and user data
        events_df, users_df = fetch_analytics_data(conn)
        
        # MAIN EXECUTION 4: Analyze new vs returning users
        if events_df is not None and users_df is not None:
            user_type_metrics = analyze_new_vs_returning(events_df, users_df)
            
            if user_type_metrics is not None:
                print("\n--- New vs Returning User Analysis Completed ---")
                # You can now use 'user_type_metrics' DataFrame for plotting or further analysis
                # For example, to save to a new CSV:
                # user_type_metrics.to_csv('E:/SankalanAnalytics/data/processed/new_vs_returning_metrics.csv', index=False)
        
        # MAIN EXECUTION 5: Close the database connection
        conn.close()
        print("Database connection closed.")
    else:
        print("Could not establish database connection. Exiting.")

Important Notes on This Code

This script connects to your SQL Server database to pull event and user data. It then uses Pandas to classify each session as either a new user session or a returning user session. A session is marked as new if its start date matches the user's first visit date. All other sessions for that user are considered returning.

The average session duration is estimated by calculating the time between the first and last event in a session. For single event sessions this duration will be zero. The bounce rate calculation is consistent with previous phases. Remember to fill in your actual SQL Server connection details in the DB CONFIG section. This includes your server name database name username and password.

Understanding Your Python New vs Returning User Analysis Script

This Python script helps you gain valuable insights by comparing how new visitors behave versus returning visitors on your website. It pulls data from your SQL Server database and performs a detailed analysis for each group. Let’s break down each part of the code.

1. Setting Up Your Tools and Connections

At the top of the script, you’ll see some import lines bringing in the tools needed for this work. First is import pandas as pd which lets us work with data tables and do calculations. Next is import pyodbc which allows Python to connect to your SQL Server database. Then from datetime import datetime, timedelta helps work with dates and time differences. After that, you’ll see DB_CONFIG where you enter your actual server name, database name, username, and password to connect to your database.

2. Connecting to Your Database

The connect_to_db function handles connecting to your database. It builds a connection string using the info in DB_CONFIG and tries to open the connection. It will print messages to confirm if the connection was successful or if there was an error.

3. Fetching Event and User Data

The fetch_analytics_data function fetches the raw event and user data from your database. It runs two SQL queries: one to get event data like user IDs, sessions, timestamps, and page views; the other to get user info including their first visit timestamp. These results are loaded into two Pandas DataFrames for analysis. The function also reports how many records it fetched and handles any errors.

4. Analyze New vs Returning Users

This core function classifies each session as belonging to a new or returning user and calculates key metrics for both groups. Here’s how it works:

Convert timestamps: Changes numeric timestamps into readable dates and times.
Merge data: Combines event data with each user’s first visit date.
Classify sessions: Marks sessions as 'New User' if they happen on the user’s first visit date; otherwise 'Returning User'.
Calculate metrics: Counts total users, sessions, and page views per user type.
Calculate bounce rate: Measures the percentage of single-page sessions per user type.
Estimate session duration: Finds average session time by calculating the difference between the first and last event timestamps in each session.

The function then prints a clear summary table of these metrics for new and returning users.

5. Running the Script The Main Block

This part runs when you start the Python file. It connects to your database, fetches the event and user data, analyzes new vs returning users, prints the results, and finally closes the database connection to free resources.

Overall Value of This Script

This Python script is a valuable tool for tailoring your website strategy. By understanding the different behaviors of new and returning visitors, you can create more effective content and personalized experiences. For example, you might provide clear guidance for first-time visitors and special offers or updates for returning ones. This shows your ability to do advanced audience segmentation — a key skill in web analytics and user personalization.

Next Steps

After running this Python script, you’ll have a clear comparison of how new and returning users interact with your website. This means you’ve successfully analyzed two important audience groups. The next exciting phase will be to analyze time spent and scroll depth, giving you even deeper insights into user engagement page by page. For now, save this script in your E drive under SankalanAnalytics backend folder and name it something like analyze_user_types.py.