Ahmed Emad's Profile Page

If you've ever inherited a dataset which includes duplicate records, missing values, inconsistent formatting, you know that data cleaning isn't optional. It's the foundation of every analysis you'll ever do. Today, I'm showing you five SQL queries that will handle 80% of your data cleaning challenges. These aren't theoretical examples. These are the exact patterns I use every single week in production environments. CLICK TO DOWNLOAD THE DATA SECTION 1: FINDING AND REMOVING DUPLICATES "The first challenge we're tackling is duplicates. And here's the thing — duplicates aren't always exact copies. Sometimes you've got the same customer entered twice with slightly different email addresses, or the same transaction logged multiple times. Let me show you the most reliable way to handle this. WITH ranked_records AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY customer_id, email ORDER BY created_date DESC ) AS row_num FROM customers ) SELECT * FROM ranked_records WHERE row_num = 1; Here's what's happening. We're using a Common Table Expression that's the WITH clause to create a temporary result set. Inside it, we're using ROW_NUMBER with PARTITION BY. PARTITION BY is doing the heavy lifting here. It's grouping records by customer_id and email, then numbering them. The ORDER BY created_date DESC means we keep the most recent record. When we filter for row_num = 1, we get exactly one record per unique combination. Before: 1,247 records. After: 1,089 records. We just eliminated 158 duplicates with complete confidence that we kept the right ones. If you want to delete these duplicates permanently, you'd wrap this in a DELETE statement but always, always test with SELECT first. SECTION 2: HANDLING MISSING VALUES INTELLIGENTLY Missing data is inevitable. The question is: what do you do about it? First, you need to understand WHERE your nulls are and HOW MANY you have. SELECT COUNT(*) AS total_records, COUNT(customer_name) AS records_with_name, COUNT(*) - COUNT(customer_name) AS missing_names, ROUND(100.0 * (COUNT(*) - COUNT(customer_name)) / COUNT(*), 2) AS pct_missing_names, COUNT(email) AS records_with_email, COUNT(*) - COUNT(email) AS missing_emails, ROUND(100.0 * (COUNT(*) - COUNT(email)) / COUNT(*), 2) AS pct_missing_emails FROM customers; This gives you a completeness report. You can see that 15% of records are missing email addresses. That's important context before you decide how to handle them. Now, here's a pattern for filling in missing values using COALESCE. SELECT customer_id, COALESCE(phone_primary, phone_secondary, 'No phone available') AS contact_phone, COALESCE(city, 'Unknown') AS city, COALESCE(purchase_amount, 0) AS purchase_amount FROM customers; COALESCE takes the first non-null value in the list. So if phone_primary is null, it checks phone_secondary. If that's also null, it uses our fallback text. Be careful with this on numeric columns. Replacing nulls with zero can skew your averages. Sometimes the right move is to exclude nulls entirely, not replace them. SELECT AVG(purchase_amount) FROM customers WHERE purchase_amount IS NOT NULL; The key is understanding what the null represents. Is it genuinely zero, or is it missing information? That determines your strategy. SECTION 3: STANDARDIZING INCONSISTENT TEXT DATA Inconsistent text formatting will destroy your groupings and aggregations. Look at this country column. Same country, four different values. Here's how you fix it. SELECT customer_id, UPPER(TRIM(country)) AS country_clean, CASE WHEN UPPER(TRIM(country)) IN ('USA', 'UNITED STATES', 'U.S.A.', 'US') THEN 'USA' WHEN UPPER(TRIM(country)) IN ('UK', 'UNITED KINGDOM', 'GREAT BRITAIN') THEN 'UK' WHEN UPPER(TRIM(country)) IN ('UAE', 'UNITED ARAB EMIRATES') THEN 'UAE' ELSE UPPER(TRIM(country)) END AS country_standardized FROM customers; We're combining three functions here. TRIM removes leading and trailing spaces. UPPER converts everything to uppercase for consistency. Then the CASE statement maps variations to a single standard value. Now your country grouping actually works. This same pattern applies to product names, categories, status fields anywhere you've got human data entry, you'll need this. SECTION 4: IDENTIFYING AND REMOVING OUTLIERS Outliers can completely distort your analysis. A single typo someone enters 100,000 instead of 100 and your average is useless. Here's how to identify them using statistical boundaries. WITH stats AS ( SELECT AVG(purchase_amount) AS mean_amount, STDDEV(purchase_amount) AS stddev_amount FROM transactions ), flagged_records AS ( SELECT t.*, s.mean_amount, s.stddev_amount, CASE WHEN ABS(t.purchase_amount - s.mean_amount) > 3 * s.stddev_amount THEN 'Outlier' ELSE 'Normal' END AS outlier_flag FROM transactions t CROSS JOIN stats s ) SELECT * FROM flagged_records WHERE outlier_flag = 'Outlier'; This uses the three standard deviations rule. In a normal distribution, 99.7% of data falls within three standard deviations of the mean. Anything beyond that is statistically unusual. We've identified all outliers. Now, here's the important part don't automatically delete them. Investigate first. SELECT transaction_id, customer_id, purchase_amount, purchase_date, payment_method FROM flagged_records WHERE outlier_flag = 'Outlier' ORDER BY purchase_amount DESC; Sometimes outliers are legitimate a bulk order, a corporate purchase. Sometimes they're errors. You need domain knowledge to decide. For ongoing data pipelines, you can use this pattern to automatically flag suspicious records for manual review before they enter your reporting layer. SECTION 5: VALIDATING DATA AGAINST BUSINESS RULES The last essential query is validation. You need to check that your data actually makes sense according to your business rules. Here's a comprehensive validation query that checks multiple conditions at once. SELECT order_id, customer_id, order_date, ship_date, quantity, unit_price, total_amount, CASE WHEN order_date > CURRENT_DATE THEN 'Future order date' END AS date_error, CASE WHEN ship_date CURRENT_DATE OR ship_date < order_date OR quantity <= 0 OR unit_price 0.01 OR email NOT LIKE '%_@__%.__%'; This query checks six different business rules: No future order dates Ship date can't be before order date Quantity must be positive Price must be positive Total amount should equal quantity times unit price Email must have basic valid format Every row returned is a violation. You get the specific error in its own column, so you can filter and fix by error type. SELECT COUNT(CASE WHEN order_date > CURRENT_DATE THEN 1 END) AS future_dates, COUNT(CASE WHEN ship_date < order_date THEN 1 END) AS logic_errors, COUNT(CASE WHEN quantity <= 0 THEN 1 END) AS quantity_errors, COUNT(CASE WHEN unit_price 0.01 THEN 1 END) AS calc_errors FROM orders; This gives you a dashboard view of data quality. You can track these metrics over time to see if your data quality is improving or degrading. Conclusion Let's recap. These five query patterns will handle the majority of your data cleaning work: ROW_NUMBER with PARTITION BY for intelligent duplicate removal COALESCE and NULL counting for handling missing values CASE statements with TRIM and UPPER for standardizing text Statistical outlier detection using standard deviations Multi-condition validation to enforce business rules The key to effective data cleaning isn't knowing exotic functions. It's having reliable patterns you can adapt to any dataset.

Ahmed Emad

Biography

Quick Links

Resources

Policies