7.2 Predict Bounce Rate Using Machine Learning

Welcome to Phase 7.2 Proactive Engagement with Predictive Analytics

You have successfully segmented your users into distinct groups using clustering. That is a powerful way to understand your audience. Now it is time to move from understanding to predicting. In this phase we will learn how to predict website bounce rate using machine learning. Think of this as building a crystal ball for your website. It tells you which sessions are most likely to end in a bounce. This allows for proactive interventions. You can try to re engage users before they leave.
This step is crucial for implementing proactive strategies to improve user engagement and reduce abandonment.

Why Predict Bounce Rate with Machine Learning is Essential

Predicting bounce rate allows for a shift from reactive analysis to proactive optimization. Here is why it is so valuable.

• Proactive Intervention Identify sessions at high risk of bouncing in real time or near real time. This allows you to trigger personalized messages pop ups or content recommendations.
• Optimize User Experience By understanding the factors that lead to a bounce you can refine your website design content and navigation.
• Resource Allocation Focus your efforts on improving the experience for users who are most likely to leave.
• Personalization Deliver dynamic content or offers to users who are predicted to bounce. This can help re engage them.
• Early Warning System Detect potential issues before they significantly impact your overall bounce rate.

Predictive analytics for bounce rate empowers you to actively shape user behavior and improve website performance.

Predicting which sessions will bounce using a machine learning model

Key Concepts for Bounce Rate Prediction

Supervised Learning: This is a type of machine learning where the model learns from labeled data. In our case, the labels indicate whether a session bounced or not.

Classification: Bounce rate prediction is a classification problem. The model classifies each session into one of two categories: ‘bounced’ or ‘not bounced.’

Features (Independent Variables): These are the input variables the model uses to make predictions. Examples include: Session Duration (seconds), Number of Events in Session, Device Category (e.g., mobile, desktop), Traffic Source (e.g., organic, direct), Page Location (landing page), and Time of Day, Day of Week.

Target Variable (Dependent Variable): This is what the model is trying to predict. In bounce rate prediction, the target is a binary variable: 1 for a bounced session, 0 for a non-bounced session.

Model Training: This involves feeding historical session data into a machine learning algorithm so it can learn the relationship between the features and the bounce behavior.

Model Evaluation: After training, the model’s performance is evaluated using key metrics. Accuracy: The proportion of correctly predicted sessions. Precision: Of all sessions predicted to bounce, how many actually bounced. Recall: Of all sessions that actually bounced, how many did the model correctly identify. F1 Score: The harmonic mean of precision and recall, offering a balanced view of performance.

Logistic Regression: A widely used algorithm for binary classification problems like bounce prediction. It estimates the probability of a session bouncing.

Feature Engineering: The process of creating new features from existing ones to improve model accuracy. For example, extracting the day of the week from a timestamp.

One Hot Encoding: A technique for converting categorical variables (like device type or traffic source) into numerical format so that machine learning models can process them effectively.

Python Code for Bounce Rate Prediction with Machine Learning

We will write a Python script to fetch session level data from SQL Server. It will then prepare this data for machine learning. It will train a classification model. Finally it will evaluate the model's ability to predict bounce rate.

1. Practical Python Code Example
Here is a basic example of the Python code you will write. This code will connect to your SQL Server database. It will fetch session data. It will then prepare this data. It will train a Logistic Regression model. It will predict bounce rate and evaluate the model's performance.



# SCRIPT SETUP 1: Import necessary libraries
import pandas as pd
import pyodbc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# SCRIPT SETUP 2: Database connection details for SQL Server
# Replace with your actual server name database name and credentials
DB_CONFIG = {
    'driver': '{ODBC Driver 17 for SQL Server}',
    'server': 'YOUR_SQL_SERVER_NAME',  # e.g., 'localhost' or 'SERVER_NAME\SQLEXPRESS'
    'database': 'YOUR_DATABASE_NAME',
    'uid': 'YOUR_USERNAME',
    'pwd': 'YOUR_PASSWORD'
}

# FUNCTION 1: Connect to the database
def connect_to_db():
    """Establishes a connection to the SQL Server database."""
    conn_str = (
        f"DRIVER={DB_CONFIG['driver']};"
        f"SERVER={DB_CONFIG['server']};"
        f"DATABASE={DB_CONFIG['database']};"
        f"UID={DB_CONFIG['uid']};"
        f"PWD={DB_CONFIG['pwd']};"
    )
    try:
        conn = pyodbc.connect(conn_str)
        print("Successfully connected to SQL Server.")
        return conn
    except pyodbc.Error as ex:
        sqlstate = ex.args[0]
        print(f"Database connection error: {sqlstate}")
        print(ex)
        return None

# FUNCTION 2: Fetch session-level data for bounce prediction
def fetch_session_data(conn):
    """
    Fetches aggregated session-level data for bounce prediction.
    Calculates features and the 'is_bounced' target variable.
    """
    query = """
    WITH SessionAgg AS (
        SELECT
            session_id,
            user_pseudo_id,
            MIN(event_timestamp) AS session_start_timestamp,
            MAX(event_timestamp) AS session_end_timestamp,
            COUNT(*) AS total_events_in_session,
            COUNT(CASE WHEN event_name = 'page_view' THEN 1 ELSE 0 END) AS total_page_views_in_session,
            COUNT(DISTINCT page_location) AS unique_pages_visited_in_session,
            MAX(device_category) AS device_category, -- Assuming device_category is consistent per session
            MAX(traffic_source) AS traffic_source,   -- Assuming traffic_source is consistent per session
            MAX(traffic_medium) AS traffic_medium    -- Assuming traffic_medium is consistent per session
        FROM
            events
        GROUP BY
            session_id, user_pseudo_id
    )
    SELECT
        sa.session_id,
        sa.user_pseudo_id,
        CAST((sa.session_end_timestamp - sa.session_start_timestamp) AS DECIMAL(18,2)) / 1000000.0 AS session_duration_seconds,
        sa.total_events_in_session,
        sa.total_page_views_in_session,
        sa.unique_pages_visited_in_session,
        sa.device_category,
        sa.traffic_source,
        sa.traffic_medium,
        -- Define bounce: session with 1 event AND that event was a page_view
        CASE
            WHEN sa.total_events_in_session = 1 AND sa.total_page_views_in_session = 1 THEN 1
            ELSE 0
        END AS is_bounced
    FROM
        SessionAgg sa
    WHERE
        sa.total_page_views_in_session > 0; -- Only consider sessions with at least one page view
    """
    try:
        df = pd.read_sql(query, conn)
        print(f"Fetched {len(df)} session records for bounce prediction.")
        return df
    except Exception as e:
        print(f"Error fetching session data: {e}")
        return None

# FUNCTION 3: Prepare data for Machine Learning
def prepare_data_for_ml(df_sessions):
    """
    Prepares the DataFrame for machine learning: handles missing values,
    defines features and target, performs one-hot encoding and scaling.
    """
    if df_sessions is None or df_sessions.empty:
        print("No session data to prepare for ML.")
        return None, None, None, None, None

    # STEP 3.1: Define features and target variable
    # Numerical features
    numerical_features = [
        'session_duration_seconds',
        'total_events_in_session',
        'total_page_views_in_session',
        'unique_pages_visited_in_session'
    ]
    # Categorical features
    categorical_features = [
        'device_category',
        'traffic_source',
        'traffic_medium'
    ]
    
    # Target variable
    target = 'is_bounced'

    # Fill missing values for numerical features (e.g., with 0 or mean)
    df_sessions[numerical_features] = df_sessions[numerical_features].fillna(0)
    
    # Fill missing values for categorical features (e.g., with 'unknown' or mode)
    for col in categorical_features:
        df_sessions[col] = df_sessions[col].fillna('unknown')

    X = df_sessions[numerical_features + categorical_features]
    y = df_sessions[target]

    # STEP 3.2: Create a preprocessing pipeline for numerical and categorical features
    # Numerical features will be scaled
    # Categorical features will be one-hot encoded
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numerical_features),
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
        ])

    # STEP 3.3: Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    print(f"Data split into training ({len(X_train)} samples) and testing ({len(X_test)} samples).")

    return X_train, X_test, y_train, y_test, preprocessor

# FUNCTION 4: Train and Evaluate Machine Learning Model
def train_and_evaluate_model(X_train, X_test, y_train, y_test, preprocessor):
    """
    Trains a Logistic Regression model and evaluates its performance.
    """
    if X_train is None:
        print("No data to train or evaluate model.")
        return None

    # STEP 4.1: Create a pipeline that first preprocesses, then trains the model
    model_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(random_state=42, solver='liblinear')) # liblinear is good for small datasets
    ])

    # STEP 4.2: Train the model
    print("Training Logistic Regression model...")
    model_pipeline.fit(X_train, y_train)
    print("Model training completed.")

    # STEP 4.3: Make predictions on the test set
    y_pred = model_pipeline.predict(X_test)
    
    # STEP 4.4: Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    print(f"\n--- Model Evaluation Results ---")
    print(f"Accuracy: {accuracy:.4f}")
    print("\nClassification Report:")
    print(report)

    return model_pipeline, accuracy, report

# MAIN EXECUTION 1: This block runs when the script starts
if __name__ == "__main__":
    # MAIN EXECUTION 2: Connect to the database
    conn = connect_to_db()
    if conn:
        # MAIN EXECUTION 3: Fetch session data
        sessions_df = fetch_session_data(conn)
        
        # MAIN EXECUTION 4: Prepare data for ML
        if sessions_df is not None and not sessions_df.empty:
            X_train, X_test, y_train, y_test, preprocessor = prepare_data_for_ml(sessions_df)
            
            # MAIN EXECUTION 5: Train and evaluate the model
            if X_train is not None:
                trained_model, acc, class_report = train_and_evaluate_model(X_train, X_test, y_train, y_test, preprocessor)
                
                if trained_model is not None:
                    print("\n--- Bounce Rate Prediction Analysis Completed ---")
                    # You can now use 'trained_model' for new predictions or save it
                    # import joblib
                    # joblib.dump(trained_model, 'E:/SankalanAnalytics/models/bounce_prediction_model.pkl')
        
        # MAIN EXECUTION 6: Close the database connection
        conn.close()
        print("Database connection closed.")
    else:
        print("Could not establish database connection. Exiting.")

Important Notes on This Code:
SQL Connection and Data Aggregation: This script connects to your SQL Server database to retrieve aggregated session-level data. It uses Scikit-learn to build and evaluate a Logistic Regression model. The fetch_session_data query is responsible for aggregating event data to create useful features such as session duration, total events, and the number of unique pages visited per session. It also defines the is_bounced target variable that indicates whether a session resulted in a bounce. The prepare_data_for_ml function plays a key role in preprocessing the data. It handles missing values, applies one hot encoding to categorical features, and scales numerical data to ensure all features are on a similar range, which is critical for many machine learning algorithms.

Model Training, Evaluation and Deployment: The train_and_evaluate_model function trains the Logistic Regression model and outputs its accuracy along with a detailed classification report. The solver is set to 'liblinear' due to its efficiency with smaller datasets. In the DB CONFIG section, remember to fill in your actual SQL Server connection details including server name, database name, username, and password. For real-world deployment, you would typically save the trained model so it can be reused later without needing to retrain it on the same data.

Understanding Your Python Bounce Rate Prediction Script

Introduction to Supervised Machine Learning: This Python script introduces you to supervised machine learning by building a model to predict whether a website session will bounce. Let us break down each part of the code to understand how it works

Setting Up Your Tools and Connections: At the beginning of the script you will find several import statements that bring in the tools needed for working with data machine learning and database access

import pandas as pd (SCRIPT SETUP 1) This imports Pandas which helps you work with data in a table-like format and perform calculations
import pyodbc (SCRIPT SETUP 1) This allows Python to connect to your SQL Server database
from sklearn.model_selection import train_test_split (SCRIPT SETUP 1) This splits your dataset into training and testing sets
from sklearn.preprocessing import StandardScaler OneHotEncoder (SCRIPT SETUP 1) StandardScaler scales numerical data and OneHotEncoder converts categorical values into numerical format
from sklearn.linear_model import LogisticRegression (SCRIPT SETUP 1) This brings in the Logistic Regression model which is used to classify sessions as bounced or not bounced
from sklearn.metrics import accuracy_score classification_report (SCRIPT SETUP 1) These functions help you measure how well your model performs
from sklearn.compose import ColumnTransformer (SCRIPT SETUP 1) This allows you to apply different preprocessing steps to different columns in your data
from sklearn.pipeline import Pipeline (SCRIPT SETUP 1) This lets you combine all preprocessing and modeling steps into a single flow

DB_CONFIG (SCRIPT SETUP 2) This section holds the connection details for your SQL Server You need to update YOUR SQL SERVER NAME YOUR DATABASE NAME YOUR USERNAME and YOUR PASSWORD with your actual database credentials

Connecting to Your Database using connect_to_db: This section refers to FUNCTION 1 in the code. The connect_to_db function is responsible for establishing the connection to your database.

What it does It tries to open a connection to your SQL Server using the information provided in DB_CONFIG
How it works It builds a connection string which helps pyodbc locate and access your database. After that it attempts to establish the connection
Safety check It prints a message to inform you whether the connection was successful or if there was an error

3. Fetch Session Level Data using fetch_session_data: This refers to FUNCTION 2 in the code. The function retrieves session level data that will be used to train the prediction model.

What it does It runs a complex SQL query that aggregates event data to calculate important metrics for each unique session. These metrics serve as features for the model. It also defines the 'is_bounced' column which is the target variable. The value is 1 if the session is a bounce and 0 otherwise. A bounce is defined as a session with only one event and that event must be a page view.

How it works The function uses a Common Table Expression CTE to group events by session. It then calculates aggregated metrics such as total events, page views, and session duration. Finally, it selects these features along with the 'is_bounced' flag and loads them into a Pandas DataFrame.

Safety check It prints the number of session records fetched and if there is an error, it prints an error message.

4. Prepare Data for Machine Learning using prepare_data_for_ml: This refers to FUNCTION 3 and its internal steps 3.1 to 3.3 in the code. The function transforms raw session data into a format suitable for machine learning.

What it does It identifies numerical and categorical features. It handles missing data. It performs one hot encoding on categorical features and scales numerical features. Finally, it splits the data into training and testing sets.

How it works

Step 3.1 Define features and target variable The function explicitly lists which columns are numerical and which are categorical. It also identifies the 'is_bounced' column as the target variable. Missing values in both numerical and categorical columns are filled to prevent errors.

Step 3.2 Create a preprocessing pipeline It uses ColumnTransformer to apply StandardScaler to numerical features and OneHotEncoder to categorical features. This ensures different types of data are preprocessed correctly.

Step 3.3 Split data into training and testing sets The dataset is divided into training and testing parts. The training set is used to teach the model and the testing set is used to evaluate its performance on unseen data. Stratify equalizes the proportion of bounced versus non-bounced sessions in both sets.

Output The function returns the prepared training and testing data along with the preprocessing pipeline.

5.Train and Evaluate Machine Learning Model using train_and_evaluate_model: This refers to FUNCTION 4 and its internal steps 4.1 to 4.4 in the code. This function builds and assesses the machine learning model.

What it does It creates a machine learning pipeline. It trains a Logistic Regression model. It makes predictions and evaluates the model's performance.

How it works

Step 4.1 Create a pipeline The function sets up a Pipeline that first applies the preprocessing steps defined earlier and then passes the transformed data to the Logistic Regression classifier.

Step 4.2 Train the model It uses the fit method on the training data. This is where the model learns the relationships between features and the likelihood of a bounce.

Step 4.3 Make predictions on the test set After training the model predicts whether sessions in the unseen test set will bounce or not.

Step 4.4 Evaluate the model The function calculates accuracy score to measure the overall correctness of predictions. It also generates a classification report that provides detailed metrics such as precision recall and F1 score for both bounced and non-bounced sessions.

Output It prints the accuracy and the full classification report of the model.

6. Running the Script: The Main Block This corresponds to MAIN EXECUTION 1 to 6 in the code. This section puts everything into action when you run the Python file.

MAIN EXECUTION 1 This line ensures the code inside this block only runs when you directly start this Python file

MAIN EXECUTION 2 Connect to the database It calls the connect_to_db function to establish your database connection. If it fails the script stops

MAIN EXECUTION 3 Fetch session data If the connection is successful it calls fetch_session_data to retrieve your session level data

MAIN EXECUTION 4 Prepare data for ML If session data is fetched successfully it calls prepare_data_for_ml to preprocess the data

MAIN EXECUTION 5 Train and evaluate the model If data preparation succeeds it calls train_and_evaluate_model to build and assess the predictive model

MAIN EXECUTION 6 Close the database connection Finally it closes the database connection which is a good practice to free up resources

Overall Value of Bounce Rate Prediction with Machine Learning

Predicting bounce rate with machine learning is a significant advancement in your web analytics project. It allows you to move beyond simply observing bounces to proactively identifying and addressing them. By understanding which sessions are at risk you can implement targeted strategies to improve engagement and reduce abandonment.

This demonstrates your ability to apply supervised machine learning to real world business problems. This is a vital skill in data science and digital marketing.

Next Steps

You have successfully built and evaluated a machine learning model to predict bounce rate. This means you are now proficient in applying supervised machine learning for predictive analytics. The next exciting phase will be to develop a recommendation engine. This will involve using machine learning to suggest relevant pages or content to users. This will further enhance user engagement.

For now make sure you save this Python script in your E drive SankalanAnalytics backend folder. Name it something like 'predict_bounce_rate.py'.