Clean Website Data with Python and Pandas

Welcome to Phase 3.1 Preparing Your Data for Analysis
You have successfully exported your website data from Google Analytics 4 (GA4) into a CSV file. You have also designed the perfect home for it in your SQL Server database. Now it is time to bring those two pieces together. In this phase we will use Python and a powerful tool called Pandas. We will read your raw CSV data clean it up and then load it into your SQL Server tables. This step is like getting your ingredients ready for cooking. We make sure everything is fresh and in the right form before we start making a meal.

Diagram showing Python Pandas cleaning messy web analytics data and loading it into a structured database.

Why Data Cleaning is Crucial

Raw data is rarely perfect. It often has issues that can mess up your analysis. Cleaning your data makes sure your insights are accurate and trustworthy. Here are common problems and how cleaning helps.

Missing Information Sometimes data is just not there. We will decide how to handle these gaps. We might fill them in or remove rows that are not complete.

Wrong Data Types Numbers might be stored as text. Dates might be in a strange format. We will change these to the correct types so SQL Server understands them.

Duplicate Entries You might have the same event or user listed more than once. We will find and remove these extra copies to keep your data unique.

Inconsistent Values A country might be spelled differently in various places. We will make sure all names and categories are spelled the same way.

By cleaning your data now you save a lot of headaches later. Your analysis will be much more reliable.

Steps to Clean and Load Data Using Python and Pandas

We will create a Python script. This script will do all the heavy lifting for you. It will make sure your data is ready for SQL Server.

Practical Python Code Example

Here is a basic example of the Python code you will write. This code will connect to your SQL Server database. It will read your CSV file. It will perform some basic cleaning. Then it will load the data into your events and users tables.

# SCRIPT SETUP 1: Import necessary libraries
import pandas as pd
import pyodbc
import json

# SCRIPT SETUP 2: Database connection details for SQL Server
# Replace with your actual server name database name and credentials
DB_CONFIG = {
    'driver': '{ODBC Driver 17 for SQL Server}',
    'server': 'YOUR_SQL_SERVER_NAME',  # e.g., 'localhost' or 'SERVER_NAME\SQLEXPRESS'
    'database': 'YOUR_DATABASE_NAME',
    'uid': 'YOUR_USERNAME',
    'pwd': 'YOUR_PASSWORD'
}

# SCRIPT SETUP 3: Path to your exported GA4 CSV file
CSV_FILE_PATH = 'E:/SankalanAnalytics/data/raw/ga4_export_data.csv'

# FUNCTION 1: Connect to the database
def connect_to_db():
    """Establishes a connection to the SQL Server database."""
    conn_str = (
        f"DRIVER={DB_CONFIG['driver']};"
        f"SERVER={DB_CONFIG['server']};"
        f"DATABASE={DB_CONFIG['database']};"
        f"UID={DB_CONFIG['uid']};"
        f"PWD={DB_CONFIG['pwd']};"
    )
    try:
        conn = pyodbc.connect(conn_str)
        print("Successfully connected to SQL Server.")
        return conn
    except pyodbc.Error as ex:
        sqlstate = ex.args[0]
        print(f"Database connection error: {sqlstate}")
        print(ex)
        return None

# FUNCTION 2: Load and Clean Data
def load_and_clean_data(file_path):
    """Loads CSV data into a Pandas DataFrame and performs initial cleaning."""
    try:
        # STEP 2.1: Load CSV into DataFrame
        df = pd.read_csv(file_path)
        print(f"CSV file loaded. Original rows: {len(df)}")

        # STEP 2.2: Basic cleaning steps
        # Rename columns to match SQL table names if needed
        # Example: df = df.rename(columns={'ga_session_id': 'session_id'})

        # STEP 2.3: Convert timestamp to integer (microseconds)
        # Assuming event_timestamp is already in microseconds as a string or number
        # If it's in seconds or milliseconds and needs conversion, adjust here
        df['event_timestamp'] = pd.to_numeric(df['event_timestamp'], errors='coerce').fillna(0).astype(int)
        
        # STEP 2.4: Handle missing values for text columns (fill with empty string)
        text_cols = ['page_location', 'page_title', 'session_id', 'traffic_source', 
                     'traffic_medium', 'device_category', 'geo_country', 'geo_city', 
                     'user_age_bracket', 'user_gender', 'event_params_json']
        for col in text_cols:
            if col in df.columns:
                df[col] = df[col].fillna('')

        # STEP 2.5: Ensure event_id and user_pseudo_id are strings
        df['event_id'] = df['event_id'].astype(str)
        df['user_pseudo_id'] = df['user_pseudo_id'].astype(str)

        # STEP 2.6: Remove duplicate events if event_id is truly unique per event
        # If event_id is not always unique in GA4 export, you might need a composite key
        df.drop_duplicates(subset=['event_id'], inplace=True)
        print(f"After removing duplicate events: {len(df)} rows")

        return df
    except Exception as e:
        print(f"Error loading or cleaning data: {e}")
        return None

# FUNCTION 3: Insert Users Data
def insert_users_data(conn, df):
    """Extracts unique users and inserts them into the users table."""
    cursor = conn.cursor()
    
    # STEP 3.1: Extract unique users and their first visit details
    # This assumes the first row for a user in the CSV contains their first visit info
    # In a real scenario, you might need more sophisticated logic or a separate 'user_first_visit' report
    users_data = df.sort_values('event_timestamp').drop_duplicates(subset=['user_pseudo_id'])
    
    # STEP 3.2: Select only columns relevant to the users table
    users_to_insert = users_data[[
        'user_pseudo_id', 
        'event_timestamp', # Using event_timestamp as first_visit_timestamp for simplicity
        'traffic_source',  # Using traffic_source as first_traffic_source for simplicity
        'traffic_medium',  # Using traffic_medium as first_traffic_medium for simplicity
        'device_category', 
        'geo_country', 
        'geo_city'
        # user_age_bracket and user_gender might need custom logic or be added later if not in GA4 export
    ]].copy()

    # STEP 3.3: Rename columns to match the users table schema
    users_to_insert = users_to_insert.rename(columns={
        'event_timestamp': 'first_visit_timestamp',
        'traffic_source': 'first_traffic_source',
        'traffic_medium': 'first_traffic_medium'
    })

    # STEP 3.4: Add placeholder for user_age_bracket and user_gender if not directly available
    if 'user_age_bracket' not in users_to_insert.columns:
        users_to_insert['user_age_bracket'] = ''
    if 'user_gender' not in users_to_insert.columns:
        users_to_insert['user_gender'] = ''

    # STEP 3.5: Prepare data for insertion
    # Ensure all columns match the users table
    users_final = users_to_insert[[
        'user_pseudo_id', 'first_visit_timestamp', 'first_traffic_source', 
        'first_traffic_medium', 'device_category', 'geo_country', 'geo_city',
        'user_age_bracket', 'user_gender'
    ]]

    print(f"Inserting {len(users_final)} unique users...")
    # STEP 3.6: Insert data row by row (handle existing users)
    for index, row in users_final.iterrows():
        try:
            # Use MERGE or INSERT OR IGNORE for idempotency (avoiding duplicates on re-run)
            # For SQL Server, MERGE is ideal but more complex for a simple example
            # We will use INSERT and handle potential primary key violations
            cursor.execute(f"""
                INSERT INTO users (
                    user_pseudo_id, first_visit_timestamp, first_traffic_source, 
                    first_traffic_medium, device_category, geo_country, geo_city,
                    user_age_bracket, user_gender
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, 
            row['user_pseudo_id'], row['first_visit_timestamp'], row['first_traffic_source'],
            row['first_traffic_medium'], row['device_category'], row['geo_country'],
            row['geo_city'], row['user_age_bracket'], row['user_gender'])
            
        except pyodbc.IntegrityError:
            # User already exists, skip or update if needed
            pass 
        except Exception as e:
            print(f"Error inserting user {row['user_pseudo_id']}: {e}")
            conn.rollback() # Rollback in case of error
            return False
    # STEP 3.7: Commit changes to the database
    conn.commit()
    print("User data inserted successfully.")
    return True

# FUNCTION 4: Insert Events Data
def insert_events_data(conn, df):
    """Inserts cleaned event data into the events table."""
    cursor = conn.cursor()
    print(f"Inserting {len(df)} events...")
    
    # STEP 4.1: Prepare data for insertion
    # Ensure all columns match the events table
    events_to_insert = df[[
        'event_id', 'user_pseudo_id', 'event_timestamp', 'event_name', 
        'page_location', 'page_title', 'session_id', 'session_number', 
        'traffic_source', 'traffic_medium', 'device_category', 'geo_country', 
        'geo_city', 'event_params_json'
    ]]

    # Example of batch insertion for performance
    # SQL Server supports INSERT ... VALUES (...), (...), ...
    # Or using executemany for multiple rows
    
    # STEP 4.2: Insert data row by row (handle existing events)
    # For simplicity, inserting row by row; for large datasets, use executemany or bulk insert
    for index, row in events_to_insert.iterrows():
        try:
            cursor.execute(f"""
                INSERT INTO events (
                    event_id, user_pseudo_id, event_timestamp, event_name, 
                    page_location, page_title, session_id, session_number, 
                    traffic_source, traffic_medium, device_category, geo_country, 
                    geo_city, event_params_json
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, 
            row['event_id'], row['user_pseudo_id'], row['event_timestamp'], row['event_name'],
            row['page_location'], row['page_title'], row['session_id'], row['session_number'],
            row['traffic_source'], row['traffic_medium'], row['device_category'], row['geo_country'],
            row['geo_city'], row['event_params_json'])
        except pyodbc.IntegrityError:
            # Event already exists (due to PRIMARY KEY on event_id), skip
            pass
        except Exception as e:
            print(f"Error inserting event {row['event_id']}: {e}")
            conn.rollback() # Rollback in case of error
            return False
    # STEP 4.3: Commit changes to the database
    conn.commit()
    print("Event data inserted successfully.")
    return True

# MAIN EXECUTION 1: This block runs when the script starts
if __name__ == "__main__":
    # MAIN EXECUTION 2: Connect to the database
    conn = connect_to_db()
    if conn is None:
        print("Failed to connect to the database. Exiting.")
    else:
        # MAIN EXECUTION 3: Load and clean data from CSV
        df_raw = load_and_clean_data(CSV_FILE_PATH)
        if df_raw is None:
            print("Failed to load or clean data. Exiting.")
        else:
            # MAIN EXECUTION 4: Insert into users table
            if insert_users_data(conn, df_raw):
                # MAIN EXECUTION 5: Insert into events table
                insert_events_data(conn, df_raw)
            else:
                print("Skipping event insertion due to user data insertion failure.")
        
        # MAIN EXECUTION 6: Close the database connection
        conn.close()
        print("Database connection closed.")

Important Notes on This Code

This script is a starting point. Your actual GA4 CSV file may have slightly different column names. You will need to adjust the Python code so the column names match exactly. For example, if your CSV uses page location instead of page_location, you should update the code accordingly.

The code assumes that event_id is unique for every event. If your GA4 export does not provide a unique event_id, you may need to generate one manually or use a combination of fields (like timestamp, user ID, and event name) to ensure uniqueness.

This version of the code inserts records one by one. While this works well for small files, it can be slow for large datasets. For better performance, consider using executemany in Pandas or SQL Server bulk insert methods.

Make sure to replace the placeholders in the DB_CONFIG section with your actual SQL Server connection details. This includes your server name, database name, username, and password.

Understanding Your Python Data Cleaning and Loading Script

This Python script acts as your personal data preparation assistant. It processes raw website data from an exported CSV file, cleans it, and loads it neatly into your SQL Server database tables. Let’s break down each section of the code to understand how it works.

1. Setting Up Your Tools and Connections

(Refer to SCRIPT SETUP 1, 2, 3 in code)

At the top of the script, you'll find a few import statements. These bring in the essential tools needed for the task:

import pandas as pd (SCRIPT SETUP 1): Loads the Pandas library, which works like a powerful spreadsheet tool in Python. It’s great for reading and organizing tabular data. We'll use it to load and clean the CSV file.

import pyodbc (SCRIPT SETUP 1): Imports the pyodbc library, which allows Python to interact with SQL Server directly. This is how the script sends data to the database.

import json (SCRIPT SETUP 1): Includes the JSON module, useful if your data contains JSON-formatted text.

Next, you'll see two important configuration items:

DB_CONFIG (SCRIPT SETUP 2): These are the settings the script uses to connect to your SQL Server. It includes:

driver: Specifies the software driver for connecting to SQL Server.

server: The name of your SQL Server instance, like localhost or a network path.

database: The name of the database created for your analytics.

uid: The SQL Server username.

pwd: The password for that user.

You must replace the placeholder values with your actual SQL Server credentials.

CSV_FILE_PATH (SCRIPT SETUP 3): Indicates the exact location of your exported GA4 CSV file. Make sure this path is correct before running the script.

2. Connecting to Your Database

(Function: connect_to_db)
(Refer to FUNCTION 1 in code)

This is the first key function in the script.

What it does: Establishes a connection to your SQL Server using the values set in DB_CONFIG.

How it works: Builds a connection string and uses pyodbc to attempt a connection.

Safety check: If the connection is successful, it prints a confirmation message. If not (e.g. due to incorrect credentials), it prints an error to help you troubleshoot.

3. Loading and Cleaning Your Data

(Function: load_and_clean_data)
(Refer to FUNCTION 2 and internal steps 2.1 to 2.6 in code)

This function uses Pandas to process your CSV and prepare the data for the database.

What it does: Loads your CSV into a Pandas DataFrame and applies a series of basic cleaning steps.

How it works:

Step 2.1 – Load CSV into DataFrame: Reads your CSV into a variable named df, which acts like a table.

Step 2.2 – Basic cleaning steps: Rename or adjust column names as needed to match the SQL Server schema.

Step 2.3 – Timestamp Conversion: Ensures that the event_timestamp column is stored as an integer. This is crucial for compatibility with SQL Server.

Step 2.4 – Handling Missing Text: Replaces any missing values in text fields with empty strings to avoid insertion errors.

Step 2.5 – Ensuring String Types: Converts IDs like event_id and user_pseudo_id into strings. This ensures data consistency.

Step 2.6 – Removing Duplicates: Drops rows with duplicate event_id values to maintain uniqueness in the database.

Safety check: It prints how many rows were loaded, and how many remain after duplicate removal. Any loading or cleaning errors are also reported.

4. Inserting User Data

(Function: insert_users_data)
(Refer to FUNCTION 3 and internal steps 3.1 to 3.7 in code)

This function inserts user-specific information into the users table.

What it does: Identifies unique users and loads their information into your SQL Server.

How it works:

Step 3.1 – Extract unique users and first visit details: Sorts users by time and keeps the first visit for each user_pseudo_id.

Step 3.2 – Select relevant columns: Filters the DataFrame to only include columns needed by the users table.

Step 3.3 – Rename columns: Adjusts column names from the CSV to match your table schema.

Step 3.4 – Add placeholders: Adds user_age_bracket and user_gender if these fields aren't present in the CSV.

Step 3.5 – Prepare for insertion: Finalizes the data for database insertion.

Step 3.6 – Insert row by row: Adds each user to the database, skipping any that already exist to prevent duplication.

Step 3.7 – Commit changes: Saves all insert operations to the database.

5. Inserting Event Data

(Function: insert_events_data)
(Refer to FUNCTION 4 and internal steps 4.1 to 4.3 in code)

This function handles the event records from the CSV.

What it does: Loads every cleaned event into the events table in SQL Server.

How it works:

Step 4.1 – Prepare data: Selects and structures only the required columns for the events table.

Step 4.2 – Insert row by row: Adds each event to the database. If a record with the same event_id already exists, it skips it.

Step 4.3 – Commit changes: Finalizes and saves the inserts to the database.

6. Running the Script

(Refer to MAIN EXECUTION 1 to 6 in code)

This section brings everything together. It’s located in the if __name__ == "__main__": block.

Step 1 – Connect: Calls connect_to_db() to initiate a database connection.

Step 2 – Load and Clean: If the connection works, it calls load_and_clean_data() to process the CSV.

Step 3 – Insert Users: If the data is cleaned successfully, user data is inserted using insert_users_data().

Step 4 – Insert Events: Once users are added, it calls insert_events_data() to load event records.

Step 5 – Close Connection: Regardless of success or failure, the script ensures the database connection is closed properly. This is a best practice for stability.

Overall Value of This Script

This Python script is a powerful part of your web analytics project. It transforms raw, messy CSV files into structured, organized data inside your SQL Server database. This prepares your data for detailed analysis and dashboard creation later on. The script automates a critical step of your data pipeline, making your project efficient and professional.