7.1 Segment Users with Clustering for Smarter Targeting

Welcome to Phase 7.1 Grouping Your Audience with Machine Learning

You have successfully completed the SQL for Business phase. You are now proficient in extracting direct insights and combining data using SQL. That is a solid foundation. Now it is time to step into the exciting world of Machine Learning. In this phase we will learn how to segment your website users into distinct groups using clustering algorithms. Think of this as automatically finding natural categories within your audience. This is based on their behavior. It helps you move beyond simple new versus returning user analysis. It allows for much more targeted strategies.
This step is crucial for personalizing user experiences and optimizing marketing efforts.

Why User Segmentation with Clustering is Essential

Understanding different user groups is key to effective website management. Here is why clustering is so valuable.

• Personalized Experiences Deliver content and features tailored to specific user needs. This increases engagement and satisfaction.
• Targeted Marketing Create highly effective marketing campaigns. These campaigns speak directly to the interests and behaviors of each segment.
• Improved Conversion Rates By understanding what each segment values you can optimize conversion paths for each group.
• Resource Allocation Focus your development and marketing budget on the segments that offer the most potential.
• Anomaly Detection Identify unusual user groups. This could signal new trends or potential issues.

Clustering transforms your broad audience into manageable, actionable segments. This allows for more precise and impactful strategies.

Website user segmentation with K‑Means clustering using Python

Key Concepts for User Segmentation

Clustering: This is a machine learning method that groups similar users together without using any predefined labels or categories. It helps you find natural patterns in your data.

K-Means Algorithm: A popular clustering technique that divides users into K groups. Each user is placed in the group where their behavior is closest to the group average, also known as the centroid.

Features for Clustering: These are the behavior-based numbers that describe each user and are used by the algorithm to compare users. Examples include:
– Total Sessions per User
– Total Page Views per User
– Average Session Duration per User
– Bounce Rate per User
– Number of Unique Pages Visited per User
– Number of Specific Events (e.g., clicks, form submissions) per User

Feature Scaling: This step makes sure all user behavior numbers are on the same scale. It’s important because larger numbers can unfairly influence clustering results.

Elbow Method: A simple way to decide how many clusters to create. It shows a chart where you look for a “bend” or “elbow,” which usually points to the best number of groups.

Python Code for User Segmentation with K-Means Clustering

We will write a Python script to fetch aggregated user behavior data from SQL Server. It will then apply K-Means clustering to segment your users. Finally it will analyze the characteristics of each cluster.

1.Practical Python Code Example
Here is a basic example of the Python code you will write. This code will connect to your SQL Server database. It will fetch user data. It will then prepare this data for clustering. It will apply the K-Means algorithm and print the characteristics of the discovered user segments.



# SCRIPT SETUP 1: Import necessary libraries
import pandas as pd
import pyodbc
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# SCRIPT SETUP 2: Database connection details for SQL Server
# Replace with your actual server name database name and credentials
DB_CONFIG = {
    'driver': '{ODBC Driver 17 for SQL Server}',
    'server': 'YOUR_SQL_SERVER_NAME',  # e.g., 'localhost' or 'SERVER_NAME\SQLEXPRESS'
    'database': 'YOUR_DATABASE_NAME',
    'uid': 'YOUR_USERNAME',
    'pwd': 'YOUR_PASSWORD'
}

# FUNCTION 1: Connect to the database
def connect_to_db():
    """Establishes a connection to the SQL Server database."""
    conn_str = (
        f"DRIVER={DB_CONFIG['driver']};"
        f"SERVER={DB_CONFIG['server']};"
        f"DATABASE={DB_CONFIG['database']};"
        f"UID={DB_CONFIG['uid']};"
        f"PWD={DB_CONFIG['pwd']};"
    )
    try:
        conn = pyodbc.connect(conn_str)
        print("Successfully connected to SQL Server.")
        return conn
    except pyodbc.Error as ex:
        sqlstate = ex.args[0]
        print(f"Database connection error: {sqlstate}")
        print(ex)
        return None

# FUNCTION 2: Fetch aggregated user behavior data
def fetch_user_behavior_data(conn):
    """
    Fetches aggregated user behavior data for clustering.
    This query calculates total sessions, page views, and estimated time on site per user.
    """
    query = """
    WITH UserSessionMetrics AS (
        SELECT
            e.user_pseudo_id,
            e.session_id,
            MIN(e.event_timestamp) AS session_start_timestamp,
            MAX(e.event_timestamp) AS session_end_timestamp,
            COUNT(DISTINCT e.page_location) AS unique_pages_visited_in_session,
            COUNT(e.event_name) AS total_events_in_session,
            MAX(CASE WHEN e.event_name = 'page_view' THEN 1 ELSE 0 END) AS has_page_view
        FROM
            events e
        GROUP BY
            e.user_pseudo_id, e.session_id
    )
    SELECT
        usm.user_pseudo_id,
        COUNT(usm.session_id) AS total_sessions,
        SUM(CASE WHEN usm.has_page_view = 1 THEN 1 ELSE 0 END) AS total_page_views, -- Simplified, ideally sum of actual page_view events
        AVG(CAST((usm.session_end_timestamp - usm.session_start_timestamp) AS DECIMAL(18,2)) / 1000000.0) AS avg_session_duration_seconds,
        SUM(CASE WHEN usm.total_events_in_session = 1 AND usm.has_page_view = 1 THEN 1 ELSE 0 END) AS bounced_sessions,
        SUM(usm.unique_pages_visited_in_session) AS total_unique_pages_visited -- Sum of unique pages across all sessions for a user
    FROM
        UserSessionMetrics usm
    GROUP BY
        usm.user_pseudo_id
    HAVING
        COUNT(usm.session_id) > 0; -- Ensure users have at least one session
    """
    try:
        df = pd.read_sql(query, conn)
        print(f"Fetched {len(df)} user behavior records for clustering.")
        return df
    except Exception as e:
        print(f"Error fetching user behavior data: {e}")
        return None

# FUNCTION 3: Perform User Clustering
def perform_user_clustering(df_user_data, n_clusters=3):
    """
    Performs K-Means clustering on user behavior data.
    """
    if df_user_data is None or df_user_data.empty:
        print("No user data to perform clustering.")
        return None

    # STEP 3.1: Select features for clustering
    # Ensure all selected features are numeric. Handle potential NaNs by filling with 0 or mean.
    features = [
        'total_sessions',
        'total_page_views',
        'avg_session_duration_seconds',
        'bounced_sessions',
        'total_unique_pages_visited'
    ]
    
    # Fill any NaN values in features, e.g., with 0 or mean
    df_user_data[features] = df_user_data[features].fillna(0)

    X = df_user_data[features]

    # STEP 3.2: Scale the features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    print("Features scaled successfully.")

    # STEP 3.3: Determine optimal number of clusters (Elbow Method - for visualization)
    # This part is for demonstrating how to find optimal K, not for automatic selection
    # It's good practice to run this separately and then decide on n_clusters
    wcss = [] # Within-cluster sum of squares
    max_k = min(10, len(X)) # Don't try more clusters than data points
    if max_k > 1:
        for i in range(1, max_k):
            kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
            kmeans.fit(X_scaled)
            wcss.append(kmeans.inertia_)
        
        plt.figure(figsize=(8, 5))
        plt.plot(range(1, max_k), wcss, marker='o', linestyle='--')
        plt.title('Elbow Method for Optimal K')
        plt.xlabel('Number of Clusters (K)')
        plt.ylabel('WCSS')
        plt.grid(True)
        plt.show()
        print("Elbow method plot displayed. Choose optimal K based on the plot.")
    else:
        print("Not enough data points to perform Elbow Method for optimal K.")

    # STEP 3.4: Apply K-Means clustering with chosen n_clusters
    # Here, n_clusters is passed as an argument, you would typically choose this after Elbow Method
    if len(X) < n_clusters:
        print(f"Cannot create {n_clusters} clusters with only {len(X)} data points. Adjust n_clusters.")
        return None

    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42, n_init=10)
    df_user_data['cluster'] = kmeans.fit_predict(X_scaled)
    print(f"Clustering completed with {n_clusters} clusters.")

    # STEP 3.5: Analyze cluster characteristics
    cluster_summary = df_user_data.groupby('cluster')[features].mean()
    print("\n--- Cluster Characteristics (Average Feature Values) ---")
    print(cluster_summary)

    # Optional: Count users in each cluster
    cluster_counts = df_user_data['cluster'].value_counts().sort_index()
    print("\n--- Number of Users per Cluster ---")
    print(cluster_counts)

    return df_user_data, cluster_summary, scaler # Return scaler for potential inverse transform if needed

# MAIN EXECUTION 1: This block runs when the script starts
if __name__ == "__main__":
    # MAIN EXECUTION 2: Connect to the database
    conn = connect_to_db()
    if conn:
        # MAIN EXECUTION 3: Fetch user behavior data
        user_data_df = fetch_user_behavior_data(conn)
        
        # MAIN EXECUTION 4: Perform user clustering
        if user_data_df is not None and not user_data_df.empty:
            # You might adjust n_clusters based on the Elbow Method output
            final_df, summary, data_scaler = perform_user_clustering(user_data_df, n_clusters=3) 
            
            if final_df is not None:
                print("\n--- User Segmentation Analysis Completed ---")
                # You can now use 'final_df' (with 'cluster' column) and 'summary' for further analysis,
                # e.g., saving to CSV or visualizing with more complex plots.
                # final_df.to_csv('E:/SankalanAnalytics/data/processed/user_segments.csv', index=False)
        
        # MAIN EXECUTION 5: Close the database connection
        conn.close()
        print("Database connection closed.")
    else:
        print("Could not establish database connection. Exiting.")

Important Notes on This Code:
This script connects to your SQL Server database to pull aggregated user behavior data. It then uses scikit learn to perform K-Means clustering. The 'fetch_user_behavior_data' query aggregates data to get per user metrics like total sessions, total page views, average session duration and bounced sessions. The 'perform_user_clustering' function first scales these features. This is crucial for K-Means as it relies on distances. It then applies K-Means. The Elbow Method is included as a visualization to help you decide the optimal number of clusters K.

You will need to manually adjust the 'n_clusters' parameter in the 'perform_user_clustering' call based on the Elbow Method plot. The script prints the average values of features for each cluster. This helps you understand the characteristics of each user segment. Remember to fill in your actual SQL Server connection details in the DB CONFIG section. This includes your server name, database name, username and password.

Understanding Your Python User Segmentation Script
This Python script is your first step into applying machine learning to your web analytics data. It helps you automatically discover natural groups of users based on their behavior. Let us break down each part of the code.

1. Setting Up Your Tools and Connections (Refer to SCRIPT SETUP 1, 2 in code)
At the very top of the script you see some import lines. These lines bring in the necessary tools for our work.
• import pandas as pd (SCRIPT SETUP 1) This brings in Pandas. We use it to work with data in a table like format and perform calculations.
• import pyodbc (SCRIPT SETUP 1) This lets Python talk to your SQL Server database.
• from sklearn.preprocessing import StandardScaler (SCRIPT SETUP 1) This is from scikit learn. It helps us scale our data.
• from sklearn.cluster import KMeans (SCRIPT SETUP 1) This is also from scikit learn. It provides the K-Means clustering algorithm.
• import matplotlib.pyplot as plt and import seaborn as sns (SCRIPT SETUP 1) These are for plotting. They help visualize the Elbow Method.

Next you see DB_CONFIG.
• DB_CONFIG (SCRIPT SETUP 2) This holds all the details for connecting to your SQL Server. You need to update YOUR SQL SERVER NAME YOUR DATABASE NAME YOUR USERNAME and YOUR PASSWORD with your actual information.

2. Connecting to Your Database connect_to_db (Refer to FUNCTION 1 in code)
The connect_to_db function is responsible for making the connection to your database.
• What it does It tries to open a link to your SQL Server using the details from DB_CONFIG.
• How it works It builds a connection string. This string helps pyodbc find and log in to your database. It then attempts the connection.
• Safety check It prints messages to let you know if the connection was successful or if there was an error.

3. Fetch Aggregated User Behavior Data fetch_user_behavior_data (Refer to FUNCTION 2 in code)
This function gets the user level data that we will use for clustering.
• What it does It runs a complex SQL query. This query aggregates event data to calculate metrics for each unique user. These metrics include total sessions total page views average session duration bounced sessions and total unique pages visited.
• How it works It uses a Common Table Expression CTE to first get session level details. Then it aggregates these session details to get user level metrics. It uses 'pd.read_sql' to pull this aggregated data into a Pandas DataFrame.
• Safety check It prints how many user records were fetched. If there is an error it prints a message.

4. Perform User Clustering perform_user_clustering (Refer to FUNCTION 3 and its internal steps 3.1 to 3.5 in code)
This is the core function where the machine learning magic happens. It takes the aggregated user data and applies clustering.
• What it does It prepares the data for clustering. It determines a good number of clusters. It then applies the K-Means algorithm to group users. Finally it summarizes what each group looks like.
• How it works
o Step 3.1 Select features for clustering It chooses the specific columns from your user data that will be used by the clustering algorithm. These are numerical features that describe user behavior. It also handles any missing values by filling them with zero.
o Step 3.2 Scale the features It uses `StandardScaler` to transform the feature data. This is important because K-Means is sensitive to the scale of your features. Scaling ensures all features contribute equally to the distance calculations.
o Step 3.3 Determine optimal number of clusters Elbow Method This part calculates the Within-Cluster Sum of Squares WCSS for different numbers of clusters. It then plots these values. You look for the "elbow" point on the graph. This point suggests the optimal K where adding more clusters does not give much more benefit. This step is for visual guidance not automatic selection.
o Step 3.4 Apply K-Means clustering with chosen n_clusters After you decide on the number of clusters K it runs the K-Means algorithm. It assigns a cluster label to each user.
o Step 3.5 Analyze cluster characteristics It groups the original user data by the newly assigned cluster labels. Then it calculates the average value for each feature within each cluster. This summary helps you understand what defines each user segment. For example Cluster 1 might be "High Engaged Users" and Cluster 2 might be "Low Engagement Visitors."
• Output It prints the average feature values for each cluster and the number of users in each cluster.

5. Running the Script The Main Block (Refer to MAIN EXECUTION 1 to 5 in code)

This part of the script puts everything into action when you run the Python file.
• MAIN EXECUTION 1 This line ensures the code inside this block only runs when you directly start this Python file.
• MAIN EXECUTION 2 Connect to the database It calls the 'connect_to_db' function to establish your database connection. If it fails the script stops.
• MAIN EXECUTION 3 Fetch user behavior data If the connection is good it calls 'fetch_user_behavior_data' to get your aggregated user data.
• MAIN EXECUTION 4 Perform user clustering If user data is fetched successfully it then calls 'perform_user_clustering' to apply the clustering algorithm. Remember to adjust 'n_clusters' based on your Elbow Method analysis.
• MAIN EXECUTION 5 Close the database connection Finally it closes the database connection. This is a very important good practice to free up resources.

Overall Value of User Segmentation with Clustering
User segmentation using clustering is a powerful machine learning technique for web analytics. It allows you to automatically identify distinct groups within your audience based on their actual behavior. This leads to much more effective and personalized marketing content and website optimization strategies. Instead of a one size fits all approach you can tailor your efforts to the unique needs and preferences of each user segment. This demonstrates your ability to apply advanced analytics for strategic business impact. This is a vital skill in modern data science and marketing analytics.

Next Steps:

You have successfully segmented your users into distinct groups using K-Means clustering. This means you are now proficient in applying unsupervised machine learning for audience understanding. The next exciting phase will be to predict bounce rate with machine learning. This will involve building a predictive model to identify sessions or users likely to bounce. This will allow for proactive interventions.
For now make sure you save this Python script in your E drive SankalanAnalytics backend folder. Name it something like 'user_segmentation.py'.