How to Build a Repeatable Data-Cleaning Report in GeoPandas
Cleaning spatial data is easy to do and hard to explain. You drop some nulls, repair a few geometries, deduplicate, and end up with a clean file — but two weeks later you cannot say why the row count went from 12,403 to 11,987. This article shows you how to make every cleaning step auditable by recording what it changed, accumulating those facts into a summary table, and rendering a report you save right next to the cleaned data.
The approach is deliberately lightweight: a small CleaningReport helper, plain Python dictionaries, and a pandas DataFrame for the summary. No new dependencies, no framework. By the end you will have a pipeline that produces a clean GeoDataFrame and a Markdown report explaining itself, every single time.
Problem statement
If any of these sound familiar, this article is for you:
- Cleaning is a black box. A script runs, a clean file appears, and nobody knows what happened in between.
- No record of what was dropped. You removed 416 rows, but you cannot list which ones or why — null geometries? duplicates? both?
- Results are not reproducible. Rerun the same script on slightly different input and you cannot tell whether the difference is the data or the code.
- You cannot explain row-count changes. A stakeholder asks "where did those features go?" and your only answer is a shrug.
- No paper trail for review. There is nothing to attach to a pull request, a data handoff, or an audit.
The fix is not more cleaning code. It is measurement: capture before/after metrics at each step and persist them.
Quick answer
Wrap each cleaning operation in a helper that records row counts before and after, plus a short note on what the step did. Accumulate those records and print (or save) a summary table at the end. Here is a minimal version you can drop into any script:
import geopandas as gpd
import pandas as pd
class CleaningReport:
def __init__(self):
self.steps = []
def log(self, gdf_before, gdf_after, name, note=""):
before = len(gdf_before)
after = len(gdf_after)
self.steps.append({
"step": name,
"rows_in": before,
"rows_out": after,
"removed": before - after,
"note": note,
})
return gdf_after
def summary(self):
return pd.DataFrame(self.steps)
# usage
report = CleaningReport()
gdf = gpd.read_file("input.gpkg")
clean = gdf[gdf.geometry.notna()]
gdf = report.log(gdf, clean, "drop null geometries", "removed missing geometry rows")
clean = gdf.drop_duplicates(subset="id")
gdf = report.log(gdf, clean, "drop duplicate ids", "deduplicated on id")
print(report.summary().to_string(index=False))
That is the whole idea. The rest of this article hardens it: more metrics, a summary DataFrame, Markdown/HTML rendering, and saving the report alongside the cleaned data.
Step-by-step solution
Decide what to measure
Before writing any code, decide which facts each step should record. For data cleaning, the high-value metrics are almost always counts of things added, changed, or removed:
- Rows in / rows out — the headline number that explains feature loss.
- Null or empty geometries dropped — missing or
GEOMETRYCOLLECTION EMPTYfeatures. - Invalid geometries repaired — self-intersections, bowties, etc.
- Duplicates removed — by attribute key, by geometry, or both.
- Attribute fixes — values normalized, coerced, or imputed.
You do not need all of these on every step. The goal is that each step records enough to reconstruct what it did.
import geopandas as gpd
import pandas as pd
# A single record describes one step. Keep keys consistent so they
# stack cleanly into a DataFrame later. Unused metrics default to 0.
record_template = {
"step": "",
"rows_in": 0,
"rows_out": 0,
"removed": 0,
"repaired": 0,
"note": "",
}
Capture before/after counts per step
The core trick is to measure the GeoDataFrame before an operation and after, then store the difference. Computing this by hand for one step looks like this:
import geopandas as gpd
gdf = gpd.read_file("input.gpkg")
rows_in = len(gdf)
clean = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty]
rows_out = len(clean)
print(f"dropped {rows_in - rows_out} null/empty geometries "
f"({rows_in} -> {rows_out})")
Doing this manually for every step is repetitive and error-prone — you will forget to measure one, or measure the wrong variable. That is exactly what the helper automates.
Accumulate stats into a summary DataFrame
Collect each record in a list, then build a DataFrame at the end. The list-of-dicts pattern is the simplest reliable way to do this in pandas:
import pandas as pd
steps = []
steps.append({"step": "drop null geometries", "rows_in": 1000,
"rows_out": 980, "removed": 20, "note": "missing geom"})
steps.append({"step": "repair invalid", "rows_in": 980,
"rows_out": 980, "removed": 0, "repaired": 7,
"note": "make_valid"})
summary = pd.DataFrame(steps).fillna(0)
# integer columns may become float after fillna; coerce back
int_cols = ["rows_in", "rows_out", "removed", "repaired"]
summary[int_cols] = summary[int_cols].astype(int)
print(summary.to_string(index=False))
The fillna(0) matters: steps that do not set repaired would otherwise produce NaN, which forces the column to float and looks messy in the report.
Wrap each cleaning step with logging
Now fold the before/after measurement into the helper so your pipeline reads as a sequence of logged steps. Each cleaning function takes a GeoDataFrame and returns a cleaned one; the report records the delta:
import geopandas as gpd
import pandas as pd
class CleaningReport:
def __init__(self):
self.steps = []
def log(self, before, after, name, note="", repaired=0):
self.steps.append({
"step": name,
"rows_in": len(before),
"rows_out": len(after),
"removed": len(before) - len(after),
"repaired": int(repaired),
"note": note,
})
return after
report = CleaningReport()
gdf = gpd.read_file("input.gpkg")
step = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty]
gdf = report.log(gdf, step, "drop null/empty", "removed missing geometry")
invalid_mask = ~gdf.geometry.is_valid
step = gdf.copy()
step.loc[invalid_mask, "geometry"] = step.loc[invalid_mask, "geometry"].make_valid()
gdf = report.log(gdf, step, "repair invalid", "make_valid()",
repaired=int(invalid_mask.sum()))
step = gdf.drop_duplicates(subset="id")
gdf = report.log(gdf, step, "drop duplicate ids", "deduplicated on id")
Each line of the pipeline now does two things: transforms the data and records what it did. The two never drift apart because they happen in the same call.
Render a Markdown/HTML report
A DataFrame knows how to render itself. pandas ships to_markdown() (which needs the small tabulate package) and to_html() (built in). Wrap the table with a title and a few totals:
import pandas as pd
def render_markdown(report, title="Data Cleaning Report"):
df = report.summary()
total_in = int(df["rows_in"].iloc[0]) if len(df) else 0
total_out = int(df["rows_out"].iloc[-1]) if len(df) else 0
lines = [
f"# {title}",
"",
f"- Rows in: **{total_in}**",
f"- Rows out: **{total_out}**",
f"- Net removed: **{total_in - total_out}**",
"",
"## Steps",
"",
df.to_markdown(index=False),
"",
]
return "\n".join(lines)
# add summary() to the class
def summary(self):
df = pd.DataFrame(self.steps).fillna(0)
int_cols = ["rows_in", "rows_out", "removed", "repaired"]
for col in int_cols:
if col in df:
df[col] = df[col].astype(int)
return df
CleaningReport.summary = summary
If you prefer HTML, swap df.to_markdown(index=False) for df.to_html(index=False) and adjust the surrounding markup. Markdown is usually the better default because it diffs cleanly in version control.
Save the report with the cleaned data
Write the report next to the output so the two never get separated. Use a shared stem and a timestamp so reruns do not overwrite each other:
from pathlib import Path
from datetime import datetime, timezone
def save_outputs(gdf, report, out_dir="output", stem="parcels"):
out = Path(out_dir)
out.mkdir(parents=True, exist_ok=True)
stamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
data_path = out / f"{stem}_clean_{stamp}.gpkg"
report_path = out / f"{stem}_report_{stamp}.md"
gdf.to_file(data_path, layer=stem, driver="GPKG")
report_path.write_text(render_markdown(report), encoding="utf-8")
return data_path, report_path
data_path, report_path = save_outputs(gdf, report)
print(f"data: {data_path}")
print(f"report: {report_path}")
Now every clean dataset arrives with its own explanation attached.
Code examples
Example 1: A fuller CleaningReport class
This version records timestamps per step, supports arbitrary extra metrics, and exposes both a DataFrame summary and rendered output.
import pandas as pd
from datetime import datetime, timezone
class CleaningReport:
def __init__(self, title="Data Cleaning Report"):
self.title = title
self.steps = []
self.started = datetime.now(timezone.utc)
def log(self, before, after, name, note="", **metrics):
record = {
"step": name,
"rows_in": len(before),
"rows_out": len(after),
"removed": len(before) - len(after),
"note": note,
"at": datetime.now(timezone.utc).isoformat(),
}
record.update(metrics)
self.steps.append(record)
return after
def summary(self):
df = pd.DataFrame(self.steps).fillna(0)
for col in ("rows_in", "rows_out", "removed", "repaired"):
if col in df:
df[col] = df[col].astype(int)
return df
The **metrics parameter lets any step record extra facts (repaired=7, coerced=12) without changing the class.
Example 2: Logging a single step cleanly
A thin wrapper makes the call site read like prose and guarantees the measurement happens.
import geopandas as gpd
report = CleaningReport()
gdf = gpd.read_file("input.gpkg")
# drop rows with no geometry or empty geometry
keep = gdf.geometry.notna() & ~gdf.geometry.is_empty
gdf = report.log(gdf, gdf[keep], "drop null/empty geometries",
note="notna() & ~is_empty")
Because log returns the after GeoDataFrame, you reassign gdf in the same line. There is no way to clean without logging.
Example 3: Repairing invalid geometries and counting repairs
make_valid() is the GeoPandas 1.x way to fix self-intersections and similar problems. Count how many were invalid before you repair so the number is accurate.
import geopandas as gpd
invalid = ~gdf.geometry.is_valid
n_invalid = int(invalid.sum())
fixed = gdf.copy()
fixed.loc[invalid, "geometry"] = fixed.loc[invalid, "geometry"].make_valid()
gdf = report.log(gdf, fixed, "repair invalid geometries",
note="GeoSeries.make_valid()", repaired=n_invalid)
Row count does not change here, so removed is 0 — the repaired metric carries the story instead.
Example 4: Recording an attribute fix
Not every step touches geometry. Normalizing a column is a cleaning step too, and it deserves a record.
before = gdf.copy()
gdf["category"] = gdf["category"].str.strip().str.lower()
n_changed = int((before["category"] != gdf["category"]).sum())
gdf = report.log(before, gdf, "normalize category",
note="strip + lowercase", coerced=n_changed)
Comparing the column before and after gives you an exact count of how many values actually changed.
Example 5: Exporting to Markdown and JSON
Markdown is for humans; JSON is for machines that consume your report downstream (dashboards, tests, monitoring).
import json
from pathlib import Path
def export_report(report, stem, out_dir="output"):
out = Path(out_dir)
out.mkdir(parents=True, exist_ok=True)
df = report.summary()
(out / f"{stem}.md").write_text(
f"# {report.title}\n\n{df.to_markdown(index=False)}\n",
encoding="utf-8",
)
(out / f"{stem}.json").write_text(
json.dumps(report.steps, indent=2), encoding="utf-8",
)
return out / f"{stem}.md", out / f"{stem}.json"
The raw report.steps list serializes directly to JSON because it is just dicts of strings and ints.
Explanation
Why auditability matters. Cleaning destroys information on purpose — that is its job. But destruction without a record is indistinguishable from a bug. When you can point at a table that says "drop duplicate ids removed 312 rows," a 312-row discrepancy stops being alarming and becomes expected. The report turns silent data loss into an explained, signed-off decision.
Why reproducibility matters. A clean file is a snapshot; the report is the recipe. If someone reruns your pipeline next quarter and gets a different row count, the two reports side by side show exactly which step diverged. Without the report you are reduced to guessing whether the code changed or the input did. Recording library versions and the target CRS (covered below) closes the last gaps.
How this complements validation. Cleaning and validation are different jobs. Cleaning changes data and the report says what changed. Validation checks data and either passes or fails against a schema. They work best together: clean, generate the report, then validate the result against a schema before it flows downstream. The report explains the transformation; validation guarantees the output meets your contract. Neither replaces the other.
Edge cases or notes
Log decisions, not just counts
A count tells you how many rows left; a note tells you why. Always record the rule you applied — "drop_duplicates(subset='id', keep='first')" is far more useful six months later than removed: 312. The note field is cheap to fill and is often the first thing a reviewer reads.
Keep the raw input immutable
Never overwrite your source file. Read it, clean a copy, and write to a new path. Use gdf.copy() before in-place modifications so the "before" you measure is genuinely the before. If the raw file is ever mutated, the report becomes fiction because you can no longer reproduce the starting point.
Record library versions and the target CRS
Geometry operations depend on GEOS and Shapely versions, and results can subtly differ across them. Capture the environment in the report header so a future rerun is comparable.
import geopandas as gpd, shapely, pandas as pd
env = {
"geopandas": gpd.__version__,
"shapely": shapely.__version__,
"pandas": pd.__version__,
"target_crs": str(gdf.crs),
}
Record gdf.crs explicitly — if a reprojection happened during cleaning, the report should say which CRS the output lands in.
The report is documentation, not a substitute for validation
A report can faithfully describe a pipeline that produced garbage. It records what did happen; it does not assert the result is correct. Run a schema or geometry validation pass after cleaning, and treat the report as evidence rather than proof.
Large pipelines: log to a file as you go
For long-running or scheduled jobs, do not wait until the end to learn something failed. Emit each step to a log file or logging handler the moment it completes, so a crash at step 40 still leaves a trail through step 39.
import logging
logging.basicConfig(filename="cleaning.log", level=logging.INFO)
def logged(report, before, after, name, **kw):
after = report.log(before, after, name, **kw)
rec = report.steps[-1]
logging.info("%s: %s -> %s", name, rec["rows_in"], rec["rows_out"])
return after
Record timestamps explicitly
Do not rely on file modification times — they change when files are copied or moved. Stamp each step and the run itself with an explicit UTC timestamp (datetime.now(timezone.utc)), and put it in the report. This makes it possible to correlate a report with a specific run when several exist.
Internal links
- The Python GIS Data Cleaning Checklist
- How to Validate a GeoDataFrame Against a Schema Before Analysis
- How to Find and Remove Duplicate Geometries in GeoPandas
- How to Fix Invalid Geometries in Python (GeoPandas)
- How to Read and Write GeoPackage Files in Python
FAQ
Do I need extra libraries for this?
No. The core helper uses only geopandas and pandas, which you already have. The single optional dependency is tabulate, required by DataFrame.to_markdown(). If you would rather avoid it, use to_html() (built in) or build the Markdown table yourself with a small loop.
How do I count rows removed versus rows repaired?
They are different metrics, so record both. "Removed" is len(before) - len(after) and applies to filtering steps. "Repaired" is a count you compute explicitly, such as int((~gdf.geometry.is_valid).sum()) measured before you call make_valid(), since the row count itself does not change.
Can I use this with method chaining or pipe()?
Yes. Write each cleaning step as a function that takes and returns a GeoDataFrame, then call report.log(before, after, ...) around it. You can also pass the report into functions used with DataFrame.pipe(), as long as each function logs its own before/after counts.
Where should the report live?
Next to the cleaned data, with a matching filename stem and timestamp — for example parcels_clean_20260617T101500Z.gpkg beside parcels_report_20260617T101500Z.md. Keeping them together prevents a clean file from ever circulating without its explanation.
What format should the report be in?
Markdown for review (it diffs well in git and renders on most platforms), JSON for any program that consumes the stats, and optionally HTML for sharing. The CleaningReport.steps list and its summary() DataFrame can produce all three from the same source.
Does logging slow the pipeline down?
Negligibly. Each log call does two len() operations and appends a dict — microseconds against the seconds your actual geometry operations take. The metrics you compute (like counting invalid geometries) cost the same whether or not you record them.
How does this differ from just printing row counts?
Printing is ephemeral and unstructured; the moment your terminal scrolls, the evidence is gone. Accumulating records into a DataFrame gives you a persistable, sortable, exportable artifact you can save, diff, and attach to a review. The structure is the point.
Can I add custom metrics for my own steps?
Yes. The fuller CleaningReport.log accepts **metrics, so any step can record extra facts like coerced=12 or imputed=4. Steps that omit a metric get filled with 0 in the summary DataFrame, so the columns still line up.