Five Essential SQL Queries Every Data Analyst Needs for Effective Data Cleaning Operations

If you’ve ever inherited a dataset which includes duplicate records, missing values, inconsistent formatting, you know that data cleaning isn’t optional. It’s the foundation of every analysis you’ll ever do. Today, I’m showing you five SQL queries that will handle 80% of your data cleaning challenges. These aren’t theoretical examples. These are the exact patterns I use every single week in production environments.

CLICK TO DOWNLOAD THE DATA

SECTION 1: FINDING AND REMOVING DUPLICATES

“The first challenge we’re tackling is duplicates. And here’s the thing — duplicates aren’t always exact copies. Sometimes you’ve got the same customer entered twice with slightly different email addresses, or the same transaction logged multiple times.

Let me show you the most reliable way to handle this.

WITH ranked_records AS (
    SELECT 
        *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id, email 
            ORDER BY created_date DESC
        ) AS row_num
    FROM customers
)
SELECT *
FROM ranked_records
WHERE row_num = 1;

Here’s what’s happening. We’re using a Common Table Expression that’s the WITH clause to create a temporary result set. Inside it, we’re using ROW_NUMBER with PARTITION BY.

PARTITION BY is doing the heavy lifting here. It’s grouping records by customer_id and email, then numbering them. The ORDER BY created_date DESC means we keep the most recent record.

When we filter for row_num = 1, we get exactly one record per unique combination. Before: 1,247 records. After: 1,089 records. We just eliminated 158 duplicates with complete confidence that we kept the right ones.

If you want to delete these duplicates permanently, you’d wrap this in a DELETE statement but always, always test with SELECT first.

SECTION 2: HANDLING MISSING VALUES INTELLIGENTLY

Missing data is inevitable. The question is: what do you do about it?

First, you need to understand WHERE your nulls are and HOW MANY you have.

SELECT 
    COUNT(*) AS total_records,
    COUNT(customer_name) AS records_with_name,
    COUNT(*) - COUNT(customer_name) AS missing_names,
    ROUND(100.0 * (COUNT(*) - COUNT(customer_name)) / COUNT(*), 2) AS pct_missing_names,
    COUNT(email) AS records_with_email,
    COUNT(*) - COUNT(email) AS missing_emails,
    ROUND(100.0 * (COUNT(*) - COUNT(email)) / COUNT(*), 2) AS pct_missing_emails
FROM customers;

This gives you a completeness report. You can see that 15% of records are missing email addresses. That’s important context before you decide how to handle them. Now, here’s a pattern for filling in missing values using COALESCE.

SELECT 
    customer_id,
    COALESCE(phone_primary, phone_secondary, 'No phone available') AS contact_phone,
    COALESCE(city, 'Unknown') AS city,
    COALESCE(purchase_amount, 0) AS purchase_amount
FROM customers;

COALESCE takes the first non-null value in the list. So if phone_primary is null, it checks phone_secondary. If that’s also null, it uses our fallback text. Be careful with this on numeric columns. Replacing nulls with zero can skew your averages. Sometimes the right move is to exclude nulls entirely, not replace them.

SELECT AVG(purchase_amount)
FROM customers
WHERE purchase_amount IS NOT NULL;

The key is understanding what the null represents. Is it genuinely zero, or is it missing information? That determines your strategy.

SECTION 3: STANDARDIZING INCONSISTENT TEXT DATA

Inconsistent text formatting will destroy your groupings and aggregations. Look at this country column. Same country, four different values. Here’s how you fix it.

SELECT 
    customer_id,
    UPPER(TRIM(country)) AS country_clean,
    CASE 
        WHEN UPPER(TRIM(country)) IN ('USA', 'UNITED STATES', 'U.S.A.', 'US') 
            THEN 'USA'
        WHEN UPPER(TRIM(country)) IN ('UK', 'UNITED KINGDOM', 'GREAT BRITAIN') 
            THEN 'UK'
        WHEN UPPER(TRIM(country)) IN ('UAE', 'UNITED ARAB EMIRATES') 
            THEN 'UAE'
        ELSE UPPER(TRIM(country))
    END AS country_standardized
FROM customers;

We’re combining three functions here. TRIM removes leading and trailing spaces. UPPER converts everything to uppercase for consistency. Then the CASE statement maps variations to a single standard value. Now your country grouping actually works. This same pattern applies to product names, categories, status fields anywhere you’ve got human data entry, you’ll need this.

SECTION 4: IDENTIFYING AND REMOVING OUTLIERS

Outliers can completely distort your analysis. A single typo someone enters 100,000 instead of 100 and your average is useless. Here’s how to identify them using statistical boundaries.

WITH stats AS (
    SELECT 
        AVG(purchase_amount) AS mean_amount,
        STDEV(purchase_amount) AS stddev_amount
    FROM transactions
),
flagged_records AS (
    SELECT 
        t.*,
        s.mean_amount,
        s.stddev_amount,
        CASE 
            WHEN ABS(t.purchase_amount - s.mean_amount) > 3 * s.stddev_amount 
                THEN 'Outlier'
            ELSE 'Normal'
        END AS outlier_flag
    FROM transactions t
    CROSS JOIN stats s
)
SELECT *
FROM flagged_records
WHERE outlier_flag = 'Outlier';

This uses the three standard deviations rule. In a normal distribution, 99.7% of data falls within three standard deviations of the mean. Anything beyond that is statistically unusual. We’ve identified all outliers. Now, here’s the important part don’t automatically delete them. Investigate first.

SELECT 
    transaction_id,
    customer_id,
    purchase_amount,
    purchase_date,
    payment_method
FROM flagged_records
WHERE outlier_flag = 'Outlier'
ORDER BY purchase_amount DESC;

Sometimes outliers are legitimate a bulk order, a corporate purchase. Sometimes they’re errors. You need domain knowledge to decide. For ongoing data pipelines, you can use this pattern to automatically flag suspicious records for manual review before they enter your reporting layer.

SECTION 5: VALIDATING DATA AGAINST BUSINESS RULES

The last essential query is validation. You need to check that your data actually makes sense according to your business rules. Here’s a comprehensive validation query that checks multiple conditions at once.

SELECT 
    order_id,
    customer_id,
    order_date,
    ship_date,
    quantity,
    unit_price,
    total_amount,
    CASE WHEN order_date > GETDATE() THEN 'Future order date' END AS date_error,
    CASE WHEN ship_date < order_date THEN 'Ship before order' END AS logic_error,
    CASE WHEN quantity <= 0 THEN 'Invalid quantity' END AS quantity_error,
    CASE WHEN unit_price <= 0 THEN 'Invalid price' END AS price_error,
    CASE WHEN ABS(quantity * unit_price - total_amount) > 0.01 THEN 'Calculation mismatch' END AS calc_error,
    CASE WHEN email NOT LIKE '%_@__%.__%' THEN 'Invalid email format' END AS email_error
FROM orders
WHERE 
    order_date > GETDATE()
    OR ship_date < order_date
    OR quantity <= 0
    OR unit_price <= 0
    OR ABS(quantity * unit_price - total_amount) > 0.01
    OR email NOT LIKE '%_@__%.__%';

This query checks six different business rules:

No future order dates
Ship date can’t be before order date
Quantity must be positive
Price must be positive
Total amount should equal quantity times unit price
Email must have basic valid format

Every row returned is a violation. You get the specific error in its own column, so you can filter and fix by error type.

SELECT 
    COUNT(CASE WHEN order_date > GETDATE() THEN 1 END) AS future_dates,
    COUNT(CASE WHEN ship_date < order_date THEN 1 END) AS logic_errors,
    COUNT(CASE WHEN quantity <= 0 THEN 1 END) AS quantity_errors,
    COUNT(CASE WHEN unit_price <= 0 THEN 1 END) AS price_errors,
    COUNT(CASE WHEN ABS(quantity * unit_price - total_amount) > 0.01 THEN 1 END) AS calc_errors
FROM orders;

This gives you a dashboard view of data quality. You can track these metrics over time to see if your data quality is improving or degrading.

Conclusion

Let’s recap. These five query patterns will handle the majority of your data cleaning work:

ROW_NUMBER with PARTITION BY for intelligent duplicate removal
COALESCE and NULL counting for handling missing values
CASE statements with TRIM and UPPER for standardizing text
Statistical outlier detection using standard deviations
Multi-condition validation to enforce business rules

The key to effective data cleaning isn’t knowing exotic functions. It’s having reliable patterns you can adapt to any dataset.