Website Analytics |Data Cleaning

Clean Website Data with Python and Pandas

Welcome to Phase 3.1 Preparing Your Data for Analysis
You have successfully exported your website data from Google Analytics 4 (GA4) into a CSV file. You have also designed the perfect home for it in your SQL Server database. Now it is time to bring those two pieces together. In this phase we will use Python and a powerful tool called Pandas. We will read your raw CSV data clean it up and then load it into your SQL Server tables. This step is like getting your ingredients ready for cooking. We make sure everything is fresh and in the right form before we start making a meal.


Diagram showing Python Pandas cleaning messy web analytics data and loading it into a structured database.

Why Data Cleaning is Crucial

Raw data is rarely perfect. It often has issues that can mess up your analysis. Cleaning your data makes sure your insights are accurate and trustworthy. Here are common problems and how cleaning helps.

Missing Information Sometimes data is just not there. We will decide how to handle these gaps. We might fill them in or remove rows that are not complete.

Wrong Data Types Numbers might be stored as text. Dates might be in a strange format. We will change these to the correct types so SQL Server understands them.

Duplicate Entries You might have the same event or user listed more than once. We will find and remove these extra copies to keep your data unique.

Inconsistent Values A country might be spelled differently in various places. We will make sure all names and categories are spelled the same way.

By cleaning your data now you save a lot of headaches later. Your analysis will be much more reliable.


Steps to Clean and Load Data Using Python and Pandas

We will create a Python script. This script will do all the heavy lifting for you. It will make sure your data is ready for SQL Server.

Practical Python Code Example

Here is a basic example of the Python code you will write. This code will connect to your SQL Server database. It will read your CSV file. It will perform some basic cleaning. Then it will load the data into your events and users tables.

# SCRIPT SETUP 1: Import necessary libraries import pandas as pd import pyodbc import json # SCRIPT SETUP 2: Database connection details for SQL Server # Replace with your actual server name database name and credentials DB_CONFIG = { 'driver': '{ODBC Driver 17 for SQL Server}', 'server': 'YOUR_SQL_SERVER_NAME', # e.g., 'localhost' or 'SERVER_NAME\SQLEXPRESS' 'database': 'YOUR_DATABASE_NAME', 'uid': 'YOUR_USERNAME', 'pwd': 'YOUR_PASSWORD' } # SCRIPT SETUP 3: Path to your exported GA4 CSV file CSV_FILE_PATH = 'E:/SankalanAnalytics/data/raw/ga4_export_data.csv' # FUNCTION 1: Connect to the database def connect_to_db(): """Establishes a connection to the SQL Server database.""" conn_str = ( f"DRIVER={DB_CONFIG['driver']};" f"SERVER={DB_CONFIG['server']};" f"DATABASE={DB_CONFIG['database']};" f"UID={DB_CONFIG['uid']};" f"PWD={DB_CONFIG['pwd']};" ) try: conn = pyodbc.connect(conn_str) print("Successfully connected to SQL Server.") return conn except pyodbc.Error as ex: sqlstate = ex.args[0] print(f"Database connection error: {sqlstate}") print(ex) return None # FUNCTION 2: Load and Clean Data def load_and_clean_data(file_path): """Loads CSV data into a Pandas DataFrame and performs initial cleaning.""" try: # STEP 2.1: Load CSV into DataFrame df = pd.read_csv(file_path) print(f"CSV file loaded. Original rows: {len(df)}") # STEP 2.2: Basic cleaning steps # Rename columns to match SQL table names if needed # Example: df = df.rename(columns={'ga_session_id': 'session_id'}) # STEP 2.3: Convert timestamp to integer (microseconds) # Assuming event_timestamp is already in microseconds as a string or number # If it's in seconds or milliseconds and needs conversion, adjust here df['event_timestamp'] = pd.to_numeric(df['event_timestamp'], errors='coerce').fillna(0).astype(int) # STEP 2.4: Handle missing values for text columns (fill with empty string) text_cols = ['page_location', 'page_title', 'session_id', 'traffic_source', 'traffic_medium', 'device_category', 'geo_country', 'geo_city', 'user_age_bracket', 'user_gender', 'event_params_json'] for col in text_cols: if col in df.columns: df[col] = df[col].fillna('') # STEP 2.5: Ensure event_id and user_pseudo_id are strings df['event_id'] = df['event_id'].astype(str) df['user_pseudo_id'] = df['user_pseudo_id'].astype(str) # STEP 2.6: Remove duplicate events if event_id is truly unique per event # If event_id is not always unique in GA4 export, you might need a composite key df.drop_duplicates(subset=['event_id'], inplace=True) print(f"After removing duplicate events: {len(df)} rows") return df except Exception as e: print(f"Error loading or cleaning data: {e}") return None # FUNCTION 3: Insert Users Data def insert_users_data(conn, df): """Extracts unique users and inserts them into the users table.""" cursor = conn.cursor() # STEP 3.1: Extract unique users and their first visit details # This assumes the first row for a user in the CSV contains their first visit info # In a real scenario, you might need more sophisticated logic or a separate 'user_first_visit' report users_data = df.sort_values('event_timestamp').drop_duplicates(subset=['user_pseudo_id']) # STEP 3.2: Select only columns relevant to the users table users_to_insert = users_data[[ 'user_pseudo_id', 'event_timestamp', # Using event_timestamp as first_visit_timestamp for simplicity 'traffic_source', # Using traffic_source as first_traffic_source for simplicity 'traffic_medium', # Using traffic_medium as first_traffic_medium for simplicity 'device_category', 'geo_country', 'geo_city' # user_age_bracket and user_gender might need custom logic or be added later if not in GA4 export ]].copy() # STEP 3.3: Rename columns to match the users table schema users_to_insert = users_to_insert.rename(columns={ 'event_timestamp': 'first_visit_timestamp', 'traffic_source': 'first_traffic_source', 'traffic_medium': 'first_traffic_medium' }) # STEP 3.4: Add placeholder for user_age_bracket and user_gender if not directly available if 'user_age_bracket' not in users_to_insert.columns: users_to_insert['user_age_bracket'] = '' if 'user_gender' not in users_to_insert.columns: users_to_insert['user_gender'] = '' # STEP 3.5: Prepare data for insertion # Ensure all columns match the users table users_final = users_to_insert[[ 'user_pseudo_id', 'first_visit_timestamp', 'first_traffic_source', 'first_traffic_medium', 'device_category', 'geo_country', 'geo_city', 'user_age_bracket', 'user_gender' ]] print(f"Inserting {len(users_final)} unique users...") # STEP 3.6: Insert data row by row (handle existing users) for index, row in users_final.iterrows(): try: # Use MERGE or INSERT OR IGNORE for idempotency (avoiding duplicates on re-run) # For SQL Server, MERGE is ideal but more complex for a simple example # We will use INSERT and handle potential primary key violations cursor.execute(f""" INSERT INTO users ( user_pseudo_id, first_visit_timestamp, first_traffic_source, first_traffic_medium, device_category, geo_country, geo_city, user_age_bracket, user_gender ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?) """, row['user_pseudo_id'], row['first_visit_timestamp'], row['first_traffic_source'], row['first_traffic_medium'], row['device_category'], row['geo_country'], row['geo_city'], row['user_age_bracket'], row['user_gender']) except pyodbc.IntegrityError: # User already exists, skip or update if needed pass except Exception as e: print(f"Error inserting user {row['user_pseudo_id']}: {e}") conn.rollback() # Rollback in case of error return False # STEP 3.7: Commit changes to the database conn.commit() print("User data inserted successfully.") return True # FUNCTION 4: Insert Events Data def insert_events_data(conn, df): """Inserts cleaned event data into the events table.""" cursor = conn.cursor() print(f"Inserting {len(df)} events...") # STEP 4.1: Prepare data for insertion # Ensure all columns match the events table events_to_insert = df[[ 'event_id', 'user_pseudo_id', 'event_timestamp', 'event_name', 'page_location', 'page_title', 'session_id', 'session_number', 'traffic_source', 'traffic_medium', 'device_category', 'geo_country', 'geo_city', 'event_params_json' ]] # Example of batch insertion for performance # SQL Server supports INSERT ... VALUES (...), (...), ... # Or using executemany for multiple rows # STEP 4.2: Insert data row by row (handle existing events) # For simplicity, inserting row by row; for large datasets, use executemany or bulk insert for index, row in events_to_insert.iterrows(): try: cursor.execute(f""" INSERT INTO events ( event_id, user_pseudo_id, event_timestamp, event_name, page_location, page_title, session_id, session_number, traffic_source, traffic_medium, device_category, geo_country, geo_city, event_params_json ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) """, row['event_id'], row['user_pseudo_id'], row['event_timestamp'], row['event_name'], row['page_location'], row['page_title'], row['session_id'], row['session_number'], row['traffic_source'], row['traffic_medium'], row['device_category'], row['geo_country'], row['geo_city'], row['event_params_json']) except pyodbc.IntegrityError: # Event already exists (due to PRIMARY KEY on event_id), skip pass except Exception as e: print(f"Error inserting event {row['event_id']}: {e}") conn.rollback() # Rollback in case of error return False # STEP 4.3: Commit changes to the database conn.commit() print("Event data inserted successfully.") return True # MAIN EXECUTION 1: This block runs when the script starts if __name__ == "__main__": # MAIN EXECUTION 2: Connect to the database conn = connect_to_db() if conn is None: print("Failed to connect to the database. Exiting.") else: # MAIN EXECUTION 3: Load and clean data from CSV df_raw = load_and_clean_data(CSV_FILE_PATH) if df_raw is None: print("Failed to load or clean data. Exiting.") else: # MAIN EXECUTION 4: Insert into users table if insert_users_data(conn, df_raw): # MAIN EXECUTION 5: Insert into events table insert_events_data(conn, df_raw) else: print("Skipping event insertion due to user data insertion failure.") # MAIN EXECUTION 6: Close the database connection conn.close() print("Database connection closed.")

Important Notes on This Code

This script is a starting point. Your actual GA4 CSV file may have slightly different column names. You will need to adjust the Python code so the column names match exactly. For example, if your CSV uses page location instead of page_location, you should update the code accordingly.

The code assumes that event_id is unique for every event. If your GA4 export does not provide a unique event_id, you may need to generate one manually or use a combination of fields (like timestamp, user ID, and event name) to ensure uniqueness.

This version of the code inserts records one by one. While this works well for small files, it can be slow for large datasets. For better performance, consider using executemany in Pandas or SQL Server bulk insert methods.

Make sure to replace the placeholders in the DB_CONFIG section with your actual SQL Server connection details. This includes your server name, database name, username, and password.


Understanding Your Python Data Cleaning and Loading Script

This Python script acts as your personal data preparation assistant. It processes raw website data from an exported CSV file, cleans it, and loads it neatly into your SQL Server database tables. Let’s break down each section of the code to understand how it works.

1. Setting Up Your Tools and Connections

(Refer to SCRIPT SETUP 1, 2, 3 in code)

At the top of the script, you'll find a few import statements. These bring in the essential tools needed for the task:

import pandas as pd (SCRIPT SETUP 1): Loads the Pandas library, which works like a powerful spreadsheet tool in Python. It’s great for reading and organizing tabular data. We'll use it to load and clean the CSV file.

import pyodbc (SCRIPT SETUP 1): Imports the pyodbc library, which allows Python to interact with SQL Server directly. This is how the script sends data to the database.

import json (SCRIPT SETUP 1): Includes the JSON module, useful if your data contains JSON-formatted text.

Next, you'll see two important configuration items:

DB_CONFIG (SCRIPT SETUP 2): These are the settings the script uses to connect to your SQL Server. It includes:

driver: Specifies the software driver for connecting to SQL Server.

server: The name of your SQL Server instance, like localhost or a network path.

database: The name of the database created for your analytics.

uid: The SQL Server username.

pwd: The password for that user.

You must replace the placeholder values with your actual SQL Server credentials.

CSV_FILE_PATH (SCRIPT SETUP 3): Indicates the exact location of your exported GA4 CSV file. Make sure this path is correct before running the script.


2. Connecting to Your Database

(Function: connect_to_db)
(Refer to FUNCTION 1 in code)

This is the first key function in the script.

What it does: Establishes a connection to your SQL Server using the values set in DB_CONFIG.

How it works: Builds a connection string and uses pyodbc to attempt a connection.

Safety check: If the connection is successful, it prints a confirmation message. If not (e.g. due to incorrect credentials), it prints an error to help you troubleshoot.


3. Loading and Cleaning Your Data

(Function: load_and_clean_data)
(Refer to FUNCTION 2 and internal steps 2.1 to 2.6 in code)

This function uses Pandas to process your CSV and prepare the data for the database.

What it does: Loads your CSV into a Pandas DataFrame and applies a series of basic cleaning steps.

How it works:

Step 2.1 – Load CSV into DataFrame: Reads your CSV into a variable named df, which acts like a table.

Step 2.2 – Basic cleaning steps: Rename or adjust column names as needed to match the SQL Server schema.

Step 2.3 – Timestamp Conversion: Ensures that the event_timestamp column is stored as an integer. This is crucial for compatibility with SQL Server.

Step 2.4 – Handling Missing Text: Replaces any missing values in text fields with empty strings to avoid insertion errors.

Step 2.5 – Ensuring String Types: Converts IDs like event_id and user_pseudo_id into strings. This ensures data consistency.

Step 2.6 – Removing Duplicates: Drops rows with duplicate event_id values to maintain uniqueness in the database.

Safety check: It prints how many rows were loaded, and how many remain after duplicate removal. Any loading or cleaning errors are also reported.


4. Inserting User Data

(Function: insert_users_data)
(Refer to FUNCTION 3 and internal steps 3.1 to 3.7 in code)

This function inserts user-specific information into the users table.

What it does: Identifies unique users and loads their information into your SQL Server.

How it works:

Step 3.1 – Extract unique users and first visit details: Sorts users by time and keeps the first visit for each user_pseudo_id.

Step 3.2 – Select relevant columns: Filters the DataFrame to only include columns needed by the users table.

Step 3.3 – Rename columns: Adjusts column names from the CSV to match your table schema.

Step 3.4 – Add placeholders: Adds user_age_bracket and user_gender if these fields aren't present in the CSV.

Step 3.5 – Prepare for insertion: Finalizes the data for database insertion.

Step 3.6 – Insert row by row: Adds each user to the database, skipping any that already exist to prevent duplication.

Step 3.7 – Commit changes: Saves all insert operations to the database.


5. Inserting Event Data

(Function: insert_events_data)
(Refer to FUNCTION 4 and internal steps 4.1 to 4.3 in code)

This function handles the event records from the CSV.

What it does: Loads every cleaned event into the events table in SQL Server.

How it works:

Step 4.1 – Prepare data: Selects and structures only the required columns for the events table.

Step 4.2 – Insert row by row: Adds each event to the database. If a record with the same event_id already exists, it skips it.

Step 4.3 – Commit changes: Finalizes and saves the inserts to the database.


6. Running the Script

(Refer to MAIN EXECUTION 1 to 6 in code)

This section brings everything together. It’s located in the if __name__ == "__main__": block.

Step 1 – Connect: Calls connect_to_db() to initiate a database connection.

Step 2 – Load and Clean: If the connection works, it calls load_and_clean_data() to process the CSV.

Step 3 – Insert Users: If the data is cleaned successfully, user data is inserted using insert_users_data().

Step 4 – Insert Events: Once users are added, it calls insert_events_data() to load event records.

Step 5 – Close Connection: Regardless of success or failure, the script ensures the database connection is closed properly. This is a best practice for stability.


Overall Value of This Script

This Python script is a powerful part of your web analytics project. It transforms raw, messy CSV files into structured, organized data inside your SQL Server database. This prepares your data for detailed analysis and dashboard creation later on. The script automates a critical step of your data pipeline, making your project efficient and professional.


Previous Topic==> Clean Data with Python and Pandas! ||  Next Topics==> Create Custom Metrics/Feature Engineering.


SQL All topics Wise Interview Questions    Employee Salary Management SQL FAQ!.   C FAQ   Top 25 PL/SQL Interview Questions

Joins With Group by Having  Equi Join  Joins with Subqueries  Self Join  Outer Join 


Website Analytics Project: Phases and Action Steps

  • Home
  • 🟢 Live App: Web Analytics Simulator
  • Phase 0: Project Setup & Problem Definition
  • 0.1 Define Project Goals & Challenges
  • 0.2 Select Tools, Tech Stack & Data Sources
  • 0.3 Software Requirements & Installation
  • 0.4 Folder Structure & GitHub Repo
  • 0.5 Testing Project Locally
  • Phase 1: Planning for Analytics
  • 1.1 Website Analytics Project Overview
  • 1.2 Define KPIs, Bounce Rate, Engagement
  • 1.3 Identify Target Users & Pain Points
  • Phase 2: Data Collection
  • 2.1 Setup Google Analytics 4 (GA4)
  • 2.2 Export GA4 Data to BigQuery/CSV
  • 2.3 Design SQL Schema for Web Analytics
  • Phase 3: Data Cleaning & Feature Engineering
  • 3.1 Clean Website Data with Python & Pandas
  • 3.2 Create Custom Metrics (Session, Bounce, etc.)
  • Phase 4: Exploratory Data Analysis (EDA)
  • 4.1 Analyze Website Traffic Trends
  • 4.2 Behavior by Device, Source, Location
  • 4.3 Top Pages & High Bounce Pages
  • 4.4 Diagnose Low Traffic & User Drop
  • Phase 5: Business Insights
  • 5.1 Funnel Analysis & Drop-Off Points
  • 5.2 New vs Returning Users
  • 5.3 Time Spent & Scroll Depth
  • Phase 6: SQL for Business
  • 6.1 SQL for Business Insights
  • 6.2 Combine Web Data Using SQL
  • 6.3 Find Problematic Pages Using SQL
  • Phase 7: Machine Learning
  • 7.1 Segment Users with Clustering
  • 7.2 Predict Bounce Rate with ML
  • 7.3 Recommend Pages or Content
  • Phase 8: Dashboards & Visualization
  • 8.1 Dashboard with Streamlit
  • 8.2 Visualize KPIs with Python
  • 8.3 Page-Level Metrics & Drop Heatmaps
  • Phase 9: Final Analytics Story
  • 9.1 Summary Report & Findings
  • Phase 10: Hosting & Portfolio Presentation
  • 10.1 Host Website Project Online
  • 10.2 Add to GitHub with ReadMe
  • 10.3 Pitch Project in Interview
  • Other Topics
  • SQL Interview Questions
  • SQL Case Study: Account Management
  • Python Interview Questions
  • Why C Language

Get in touch

  • tech2dsm@gmail.com

© Sankalan Data Tech. All rights reserved.