Understanding Self Join in SQL for Banking Data Analysis
Note: All SQL queries used in the Case study is tested on Oracle,SQLServer,MySQL and PostgreSQL.
1. Find customers who share the same phone number.
SELECT c1.CustomerID, c1.Name, c1.Phone
FROM Customers c1
JOIN Customers c2 ON c1.Phone = c2.Phone
WHERE c1.CustomerID <> c2.CustomerID;
Explanation
This SQL query does a self join and returns customers who have the same number in the Customers table. Here, the table is being joined to itself; c1 and c2 are just aliases for the same Customers table. The JOIN condition is that the Phone number should match between both occurrences of the table (c1.Phone = c2.Phone). This is further ensured by making sure that the customers being compared are different individuals by checking if their CustomerID values are not equal-c1.CustomerID<>c2.CustomerID. Conclusion The result set returns CustomerID, Name, and Phone customers with the same phone number. This query works for Oracle, MySQL, PostgreSQL and SQL Server: the syntax of the self join is similar across all these databases.
2.List customers who live at the same address.
SELECT c1.CustomerID, c1.Name, c1.Address
FROM Customers c1
JOIN Customers c2 ON c1.Address = c2.Address
WHERE c1.CustomerID <> c2.CustomerID;
Explanation
This SQL statement uses a self join to find customers who live in the same address in the Customers table. Here, the same table is referenced twice with the aliases c1 and c2. The JOIN condition states that addresses should match for the two copies of the table (c1.Address = c2.Address). In addition to extracting unique customers only, excluding duplicate pairs in which a customer is compared to himself/herself, the WHERE clause checks for the condition that the CustomerID of the two instances differs as c1.CustomerID <> c2.CustomerID. The result set will contain the CustomerID, Name and Address of customers with the same address thereby making an effective indication of their duplicates based on their resident information. This can be very handy when working with customer data to be analyzed for potential duplicates or as part of a targeted marketing effort that is specific to geographic regions. For Oracle, MySQL, PostgreSQL and SQL Server the query is database system independent since the syntax of self joins is similar among these systems.
3.Show pairs of customers who share the same email domain (e.g., gmail.com).
Oracle
SELECT c1.CustomerID AS Customer1ID, c1.Name AS Customer1Name,
c2.CustomerID AS Customer2ID, c2.Name AS Customer2Name,
SUBSTR(c1.Email, INSTR(c1.Email, '@') + 1) AS EmailDomain
FROM Customers c1
JOIN Customers c2
ON
SUBSTR(c1.Email, INSTR(c1.Email, '@') + 1) = SUBSTR(c2.Email, INSTR(c2.Email, '@') + 1)
WHERE c1.CustomerID <> c2.CustomerID;
Explanation
The SQL query is written to deliver a group of pairs of customers from the table Customers based on the same email domain. This has been through a self join on the Customers table named c1 and c2. In the SELECT statement, it comes out with the CustomerID and Name for each customer but naming it Customer1ID, Customer1Name for the first case at c1 and Customer2ID, Customer2Name for the case at c2. Further, it retrieves the domain portion of the e-mail addresses of c1 by using the SUBSTR and INSTR functions together to isolate the part of the e-mail after the '@' character and names that result as EmailDomain. The join condition is that this customer instances' email domains must be equal to each other. Therefore, only those customers with the same email domain will be matched together. This WHERE clause eliminates the result set of cases where two customer records are the same to avoid comparison of a record with itself. The resultant list turns out to be distinct customer pairs who happen to belong to the same email domain and which sometimes is extremely crucial while identifying potential duplicates or analyzing the relationship between customers about shared email characteristics.
______________
MySQL
SELECT c1.CustomerID AS Customer1ID, c1.Name AS Customer1Name,
c2.CustomerID AS Customer2ID, c2.Name AS Customer2Name,
SUBSTRING_INDEX(c1.Email, '@', -1) AS EmailDomain
FROM Customers c1
JOIN Customers c2
ON
SUBSTRING_INDEX(c1.Email, '@', -1) = SUBSTRING_INDEX(c2.Email, '@', -1)
WHERE c1.CustomerID <> c2.CustomerID;
Explanation
The query finds the pairs of customers within the Customers table that have an email address with the same domain. A self join refers to referencing the Customers table twice, using different aliases. The SELECT statement returns CustomerID and Name both instances being listed with labels Customer1ID, Customer1Name, Customer2ID and Customer2Name. It also pulls the domain component of the email address of c1 with the SUBSTRING_INDEX function which will isolate the segment of an email after the '@' character and assigns a name for this result as EmailDomain. Join considers only those pairs of customers by looking through the comparison of the pulled email domains on both instances that match. The WHERE clause excludes cases where two records by the same customer just coincidentally happen to be the same. So there can't be any comparison made with themselves. A list of distinct unique pairs of customers who happen to share the same email domain can be returned which will then be used for the elimination of potential duplicates or used within a marketing strategy targeting their common email addresses.
___________________
PostgreSQL
SELECT c1.CustomerID AS Customer1ID, c1.Name AS Customer1Name,
c2.CustomerID AS Customer2ID, c2.Name AS Customer2Name,
SPLIT_PART(c1.Email, '@', 2) AS EmailDomain
FROM Customers c1
JOIN Customers c2
ON
SPLIT_PART(c1.Email, '@', 2) = SPLIT_PART(c2.Email, '@', 2)
WHERE c1.CustomerID <>c2.CustomerID;
_____________
SQL Server
SELECT c1.CustomerID AS Customer1ID, c1.Name AS Customer1Name,
c2.CustomerID AS Customer2ID, c2.Name AS Customer2Name,
RIGHT(c1.Email, LEN(c1.Email) - CHARINDEX('@', c1.Email)) AS EmailDomain
FROM Customers c1
JOIN Customers c2
ON
RIGHT(c1.Email, LEN(c1.Email) - CHARINDEX('@', c1.Email)) = RIGHT(c2.Email, LEN(c2.Email) - CHARINDEX('@', c2.Email))
WHERE c1.CustomerID <> c2.CustomerID;