Data cleaning toolkit

Interactive data demo
Upstream prerequisite, not a side utility: downstream models and dashboards ingest the same reviewed tables—multi-format inputs become auditable CSV/Parquet/JSON plus HTML step log and before/after views, capped near 100K rows; rules cover bad formats, duplicates, skewed categories, optional outliers, plus bundled samples for dry runs.

Commercial fit

Paid work follows the same constraints as the storefront Data Prep Sprint: a frozen export list you sign off, auditable artefacts your next KPI or automation step can ingest, explicit exclusions—not open-ended exploratory cleaning.

Reference overview

Upstream prerequisite, not a side utility: downstream models and dashboards ingest the same reviewed tables—multi-format inputs become auditable CSV/Parquet/JSON plus HTML step log and before/after views, capped near 100K rows; rules cover bad formats, duplicates, skewed categories, optional outliers, plus bundled samples for dry runs.

Handoff notes

JSON flattening stays one level by design. Pairs with EDA on this page (profile vs fix). Deploy mirrors the repository limits and validation logic.

Repositories & demos

Public proof only—client deliverables stay under separate agreements.

Evidence iddata-cleaning
Closest storefront packageData Prep Sprint

Cleaning, profiling, and clear documentation so the next dashboard, forecast, or model rests on inputs your team can explain and reproduce.

Stack & keywords
  • Streamlit
  • pandas
  • pytest
  • Parquet / JSON
  • Docker
Discuss a similar milestone