Five Essential SQL Queries Every Data Analyst Needs for Effective Data Cleaning Operations

If you’ve ever inherited a dataset which includes duplicate records, missing values, inconsistent formatting, you know that data cleaning isn’t optional. It’s the foundation of every analysis you’ll ever do. Today, I’m showing you five SQL queries that will handle 80% of your data cleaning challenges. These aren’t theoretical examples. These are the exact patterns I use every single week in production environments.

SECTION 1: FINDING AND REMOVING DUPLICATES

“The first challenge we’re tackling is duplicates. And here’s the thing — duplicates aren’t always exact copies. Sometimes you’ve got the same customer entered twice with slightly different email addresses, or the same transaction logged multiple times.

Let me show you the most reliable way to handle this.

Here’s what’s happening. We’re using a Common Table Expression that’s the WITH clause to create a temporary result set. Inside it, we’re using ROW_NUMBER with PARTITION BY.

PARTITION BY is doing the heavy lifting here. It’s grouping records by customer_id and email, then numbering them. The ORDER BY created_date DESC means we keep the most recent record.

When we filter for row_num = 1, we get exactly one record per unique combination. Before: 1,247 records. After: 1,089 records. We just eliminated 158 duplicates with complete confidence that we kept the right ones.

If you want to delete these duplicates permanently, you’d wrap this in a DELETE statement but always, always test with SELECT first.

SECTION 2: HANDLING MISSING VALUES INTELLIGENTLY

Missing data is inevitable. The question is: what do you do about it?

First, you need to understand WHERE your nulls are and HOW MANY you have.

This gives you a completeness report. You can see that 15% of records are missing email addresses. That’s important context before you decide how to handle them. Now, here’s a pattern for filling in missing values using COALESCE.

COALESCE takes the first non-null value in the list. So if phone_primary is null, it checks phone_secondary. If that’s also null, it uses our fallback text. Be careful with this on numeric columns. Replacing nulls with zero can skew your averages. Sometimes the right move is to exclude nulls entirely, not replace them.

The key is understanding what the null represents. Is it genuinely zero, or is it missing information? That determines your strategy.

SECTION 3: STANDARDIZING INCONSISTENT TEXT DATA

Inconsistent text formatting will destroy your groupings and aggregations. Look at this country column. Same country, four different values. Here’s how you fix it.

We’re combining three functions here. TRIM removes leading and trailing spaces. UPPER converts everything to uppercase for consistency. Then the CASE statement maps variations to a single standard value. Now your country grouping actually works. This same pattern applies to product names, categories, status fields anywhere you’ve got human data entry, you’ll need this.

SECTION 4: IDENTIFYING AND REMOVING OUTLIERS

Outliers can completely distort your analysis. A single typo someone enters 100,000 instead of 100 and your average is useless. Here’s how to identify them using statistical boundaries.

This uses the three standard deviations rule. In a normal distribution, 99.7% of data falls within three standard deviations of the mean. Anything beyond that is statistically unusual. We’ve identified all outliers. Now, here’s the important part don’t automatically delete them. Investigate first.

Sometimes outliers are legitimate a bulk order, a corporate purchase. Sometimes they’re errors. You need domain knowledge to decide. For ongoing data pipelines, you can use this pattern to automatically flag suspicious records for manual review before they enter your reporting layer.

SECTION 5: VALIDATING DATA AGAINST BUSINESS RULES

The last essential query is validation. You need to check that your data actually makes sense according to your business rules. Here’s a comprehensive validation query that checks multiple conditions at once.

This query checks six different business rules:

  1. No future order dates
  2. Ship date can’t be before order date
  3. Quantity must be positive
  4. Price must be positive
  5. Total amount should equal quantity times unit price
  6. Email must have basic valid format

Every row returned is a violation. You get the specific error in its own column, so you can filter and fix by error type.

This gives you a dashboard view of data quality. You can track these metrics over time to see if your data quality is improving or degrading.

Conclusion

Let’s recap. These five query patterns will handle the majority of your data cleaning work:

  1. ROW_NUMBER with PARTITION BY for intelligent duplicate removal
  2. COALESCE and NULL counting for handling missing values
  3. CASE statements with TRIM and UPPER for standardizing text
  4. Statistical outlier detection using standard deviations
  5. Multi-condition validation to enforce business rules

The key to effective data cleaning isn’t knowing exotic functions. It’s having reliable patterns you can adapt to any dataset.