📌 So you’ve got a dataset, here’s how you clean it!
Came across this article again while skimming through my old bookmarks. I remembered liking it the first time around for how grounded and practical it was — and it still holds up.
📌 So You’ve Got a Dataset. Here’s How You Clean It.
A step-by-step mindset for approaching data cleaning, by Ang Li-Lian.
Reflections
What I appreciate about this piece is that it’s not tied to a specific tool or language. You could apply the exact same process in R, Python, SQL, or Excel, the core ideas are all about intentional, structured thinking.
Some solid reminders:
- Start by actually looking at your dataset. Dimensions, types, nulls, duplicates, don’t skip the orientation.
- Don’t treat missing data like an afterthought. Handle it with intention based on context.
- Encode categorical variables early and deliberately. It’ll save you pain later.
- Check your assumptions with basic summaries, sometimes the weirdness is obvious if you pause to look.
- Modularize your cleaning steps. If you build them clearly, you can reuse them across projects.
It’s not fancy, but that’s the point. There’s something satisfying about a clean, thoughtful approach that isn’t trying to be clever, just effective. I’ve found this kind of steady rhythm in cleaning always pays off, especially when I come back to a dataset weeks or months later and can still follow my own trail.
This post builds on a recent LinkedIn #BookmarkDive reflection, feel free to join the conversation there.