The Python GIS Data Cleaning Checklist: From Raw Download to Analysis-Ready

You downloaded a dataset, opened it in Python, and ran your first spatial join — and got an empty result, a TopologyException, or a silent reprojection that placed your features in the ocean. Welcome to spatial data cleaning, the unglamorous work that decides whether the rest of your analysis is trustworthy.

This guide gives you a complete, ordered mental model for turning a raw download into an analysis-ready GeoDataFrame using current GeoPandas (1.x) and Shapely (2.x). Work through it top to bottom the first few times. Once the order is in your head, you will spot which step a given dataset actually needs.

Problem statement

Real-world spatial data is messy in ways that tabular data is not. A shapefile from a city open-data portal, a GeoJSON export from someone's desktop GIS, or a scraped collection of features all carry their own defects. Worse, many of these defects are invisible until a downstream operation fails — or, more dangerously, succeeds with wrong numbers.

Geometry problems you will hit constantly:

Invalid geometries — self-intersecting (bowtie) polygons, rings that touch at a point, holes that escape their shell. Shapely will raise or return garbage from overlay and area calculations.
Null geometries — rows where the geometry cell is literally None. They survive most operations until one suddenly errors.
Empty geometries — a non-null geometry containing no coordinates (POLYGON EMPTY). These pass null checks but break area, length, and distance math.
Mixed geometry types — Polygon and MultiPolygon (or worse, points and lines) in one column, which some formats and operations refuse to handle.

Coordinate and attribute problems that quietly corrupt results:

Missing or wrong CRS — no .crs at all, or a defined CRS that does not match the actual coordinate values.
CRS mismatch between layers — two datasets that look fine alone but never intersect because they are in different projections.
Duplicate features — the same parcel digitised twice, or a join that fanned rows out.
Inconsistent attributes — "N/A", "", and None all meaning "missing"; whitespace, mixed case, and numbers stored as text in key columns.

Skip the cleaning and you do not get an error — you get a plausible-looking answer that is wrong. That is the real cost.

Quick answer

Clean spatial data in a fixed order: load and inspect, fix the CRS, remove null and empty geometries, repair invalid ones with make_valid(), drop duplicates, normalise attributes, then export to GeoPackage. Doing it in this order means each step works on data the previous step already made safe.

The spatial data cleaning pipeline: load and inspect, fix CRS, drop null and empty geometries, repair invalid geometries, remove duplicates, normalise attributes, then export to GeoPackage. — The cleaning pipeline runs in a fixed order — each step works on data the previous one made safe.

Here is the whole pipeline in one block — adapt the column names to your dataset.

import geopandas as gpd
from shapely import make_valid

# 1. Load and inspect
gdf = gpd.read_file("data/parcels.shp")
print(gdf.shape, gdf.crs, gdf.geom_type.value_counts(), sep="\n")

# 2. Confirm / repair the CRS, then project to a metric CRS for analysis
if gdf.crs is None:
    gdf = gdf.set_crs(epsg=4326)          # ONLY if you know the true source CRS
gdf = gdf.to_crs(epsg=3857)               # use a CRS appropriate for your area

# 3. Drop null and empty geometries
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()

# 4. Repair invalid geometries
invalid = ~gdf.geometry.is_valid
gdf.loc[invalid, "geometry"] = gdf.loc[invalid, "geometry"].apply(make_valid)
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()  # make_valid can drop slivers

# 5. Remove duplicates (attribute + geometry)
gdf = gdf.drop_duplicates(subset=[c for c in gdf.columns if c != "geometry"]).copy()
gdf = gdf[~gdf.geometry.normalize().duplicated()].copy()

# 6. Normalise attribute columns
gdf.columns = gdf.columns.str.strip().str.lower()
for col in gdf.select_dtypes(include="object").columns:
    gdf[col] = gdf[col].str.strip().replace({"": None, "N/A": None, "n/a": None})

# 7. Export cleanly
gdf.to_file("data/parcels_clean.gpkg", layer="parcels", driver="GPKG")

The cleaning checklist

Load the file and inspect shape, CRS, geometry types, and dtypes.
Confirm the CRS is present and matches the actual coordinates.
Set the CRS if missing (only when you know the true source CRS).
Reproject to a CRS appropriate for your analysis (metric for area/length).
Drop rows with null geometries.
Drop rows with empty geometries.
Flag invalid geometries with is_valid.
Repair invalid geometries with make_valid().
Re-check for nulls/empties introduced by the repair step.
Remove duplicate features by attributes and by geometry.
Normalise attribute columns (names, whitespace, missing-value tokens, dtypes).
Validate against expectations, then export to GeoPackage.

Step-by-step solution

Deduplicating before normalising, for instance, leaves duplicates behind.

Load & inspect

Read the file and look before you touch anything. The shape, CRS, geometry-type counts, and dtypes tell you which of the later steps you actually need.

import geopandas as gpd

gdf = gpd.read_file("data/parcels.shp")

print("rows, cols:", gdf.shape)
print("crs:", gdf.crs)
print(gdf.geom_type.value_counts())
print(gdf.dtypes)
print("nulls:", gdf.geometry.isna().sum())
print("empties:", gdf.geometry.is_empty.sum())
print("invalid:", (~gdf.geometry.is_valid).sum())

Confirm/repair the CRS

A missing CRS and a wrong CRS are different problems. set_crs() only labels the data — it does not move coordinates — so use it solely when the file lost a CRS it should have had. Use to_crs() to actually reproject.

# Label a missing CRS (only if you KNOW the source projection)
if gdf.crs is None:
    gdf = gdf.set_crs(epsg=4326, allow_override=False)

# Reproject for analysis — pick a metric CRS suited to your area of interest
gdf = gdf.to_crs(epsg=3857)
print(gdf.crs)
print(gdf.total_bounds)   # sanity-check the coordinate ranges

Drop null and empty geometries

Null and empty are distinct: notna() catches None, and is_empty catches geometries with no coordinates. You need both checks.

before = len(gdf)
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()
print(f"dropped {before - len(gdf)} null/empty rows")

Check and repair validity

Find invalid geometries with is_valid, then repair them with Shapely 2.x's make_valid(). It is the standard tool and, unlike buffer(0), it will not silently distort your shapes.

from shapely import make_valid

invalid_mask = ~gdf.geometry.is_valid
print("invalid:", invalid_mask.sum())

gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(make_valid)

# make_valid can turn a degenerate polygon into an empty or collection — re-clean
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()
assert gdf.geometry.is_valid.all()

Remove duplicates

Duplicates come in two flavours: identical attribute rows and identical geometries. Handle both. normalize() puts geometries into a canonical form so that logically identical shapes compare equal.

# Exact attribute-row duplicates
attr_cols = [c for c in gdf.columns if c != gdf.geometry.name]
gdf = gdf.drop_duplicates(subset=attr_cols).copy()

# Geometric duplicates (same shape, possibly different vertex order)
gdf = gdf[~gdf.geometry.normalize().duplicated()].copy()

Normalise attribute columns

Standardise column names, strip whitespace, collapse the various "missing" tokens to a real None, and fix dtypes that loaded as text. This is what makes joins and filters behave.

# Tidy column names
gdf.columns = gdf.columns.str.strip().str.lower()

# Clean string columns
for col in gdf.select_dtypes(include="object").columns:
    gdf[col] = (gdf[col].str.strip()
                        .replace({"": None, "N/A": None, "n/a": None, "NULL": None}))

# Coerce a numeric column stored as text
gdf["area_m2"] = gpd.pd.to_numeric(gdf.get("area_m2"), errors="coerce")

Validate against expectations

Before exporting, assert the things that must be true. A handful of cheap checks here catch problems that would otherwise surface three notebooks later.

assert gdf.crs is not None,              "CRS is undefined"
assert gdf.geometry.notna().all(),       "null geometries remain"
assert (~gdf.geometry.is_empty).all(),   "empty geometries remain"
assert gdf.geometry.is_valid.all(),      "invalid geometries remain"
assert gdf["parcel_id"].is_unique,       "duplicate parcel ids"
print(f"{len(gdf)} clean features ready")

Export cleanly

Write to GeoPackage, not Shapefile. GeoPackage is a single file, has no 10-character field-name limit or 2 GB cap, stores the CRS reliably, and supports full datatypes and Unicode.

gdf.to_file("data/parcels_clean.gpkg", layer="parcels", driver="GPKG")

# Round-trip check
check = gpd.read_file("data/parcels_clean.gpkg", layer="parcels")
print(check.shape, check.crs)

Explanation

The order is not arbitrary — each step depends on the one before it being done. Inspecting first tells you which steps matter; there is no point repairing validity on a dataset that has none invalid. CRS comes early because reprojection touches every coordinate, and you want it settled before you compute anything geometric. Do area or distance math in a geographic CRS like EPSG:4326 and your numbers are in degrees — meaningless. A metric CRS suited to your region must be in place first.

A table of feature counts after each cleaning step with what an unusual drop means. — These five numbers are the whole quality report for a cleaning run.

Null and empty removal precedes validity repair because make_valid() and is_valid expect real geometries to chew on; feeding them None or empty shapes wastes effort and clutters your masks. And because make_valid() can itself produce empties (when it collapses a degenerate sliver) or GeometryCollections, you re-run the null/empty filter after repair. That second pass is the step beginners forget.

Deduplication comes after repair so that two copies of the same broken polygon normalise to the same valid shape and collapse together — dedupe first and a later repair might make two surviving rows identical anyway. Attribute normalisation is last among the transforms because it is independent of geometry; doing it at the end keeps it from interfering with the geometric steps. Validation and export close the loop: you assert your invariants, then write to a format that preserves them.

Edge cases or notes

GeoPackage vs Shapefile for output

Always prefer driver="GPKG" for cleaned output. Shapefile truncates field names to 10 characters, caps at 2 GB, splits one logical dataset across several sidecar files, and has shaky encoding and CRS storage. GeoPackage is a single SQLite file with none of those limits. Read shapefiles you are given; do not create new ones.

A defined CRS does not guarantee correctness

gdf.crs only tells you what label the file carries — not that the label is right. Data is routinely exported with a wrong or default CRS. Always sanity-check total_bounds against where the data should be on Earth before trusting the CRS, and never use set_crs() to "fix" coordinates that are simply in a different projection.

Valid is not the same as correct

make_valid() guarantees a geometry obeys the OGC rules — not that it represents reality. A repaired bowtie polygon is now topologically valid but may have an area you never intended. Validity is a floor, not a guarantee; spot-check repaired features, especially if only a few rows were affected but they carry a lot of weight.

Mixed geometry types

A column mixing Polygon and MultiPolygon is common and usually fine in GeoPandas and GeoPackage. But some operations and consumers want one type — promote everything to multi-part before export. Mixing fundamentally different types (points with polygons) almost always signals a data problem to investigate, not normalise away.

Large files and memory

read_file() loads everything into memory. For large datasets, use the bbox or mask argument to read only your area of interest, the rows argument to sample, or read in chunks via pyogrio. Reproject and clean the subset, not the whole continent.

Re-checking after every transform

Repairs can introduce new defects: make_valid() makes empties, reprojection can fail on geometries near the poles or antimeridian, and a join inflates duplicates. Treat the inspection block from step one as something you can re-run at any point. Cheap to run, expensive to skip.

CRS for area and distance

EPSG:3857 (Web Mercator) is convenient and used above as a placeholder, but its distances and areas are distorted away from the equator. For real measurement use an equal-area or local projected CRS for your region (a UTM zone, a national grid, or an Albers equal-area). Pick the CRS for the question you are answering.

Internal links

FAQ

What is the difference between a null and an empty geometry?

A null geometry is None in the geometry cell — there is no geometry object at all, and notna() catches it. An empty geometry is a real object that contains no coordinates, such as POLYGON EMPTY; it passes a null check but fails area, length, and distance calculations. You must test for both with gdf.geometry.notna() & ~gdf.geometry.is_empty.

Should I use `make_valid()` or `buffer(0)` to fix geometries?

Use make_valid() from Shapely 2.x as your default — it is purpose-built and preserves your geometry's intent without distorting it. buffer(0) is an old trick that sometimes repairs polygons but can silently delete parts of a shape or change its area, and it does not help with lines or points. Reach for buffer(0) only as a fallback when make_valid() produces something you cannot use.

Does `set_crs()` reproject my data?

No. set_crs() only attaches or relabels the CRS metadata — the coordinate numbers do not move. Use it only when a file is missing a CRS it should have had and you know the correct one. To actually transform coordinates from one CRS to another, use to_crs().

Why GeoPackage instead of Shapefile?

GeoPackage is a single SQLite-based file with no 2 GB size limit, no 10-character field-name truncation, reliable CRS and Unicode handling, and support for multiple layers. Shapefile is an older multi-file format with all of those limitations. Read shapefiles when you are handed them, but write your cleaned output to .gpkg with driver="GPKG".

In what order should I clean spatial data?

Inspect, fix and set the CRS, reproject, drop null and empty geometries, repair invalid geometries, re-drop any new empties, remove duplicates, normalise attributes, validate, then export. The order matters because each step assumes the previous one already ran — for example, validity repair expects non-null geometries, and you re-check for empties because make_valid() can create them.

How do I know if my CRS is actually correct?

Check gdf.crs to see the declared CRS, then check gdf.total_bounds and confirm the coordinate ranges match where the data should sit on Earth. Latitude/longitude data should fall within roughly -180 to 180 and -90 to 90; projected data will have large metre values. If the bounds put your features in the wrong place, the declared CRS is wrong — do not "fix" it with set_crs() unless you know the true source projection.

Why do I still have duplicate geometries after `drop_duplicates()`?

drop_duplicates() on attribute columns only catches rows whose attributes match — two records of the same shape with different IDs survive. Geometrically identical shapes can also differ in vertex order or starting point, so a raw comparison misses them. Use gdf.geometry.normalize().duplicated() to canonicalise geometries before comparing, which collapses logically identical shapes.

Do I need to reproject before calculating area or distance?

Yes, if your data is in a geographic CRS like EPSG:4326. Area and length computed on degrees are meaningless. Reproject to a metric CRS appropriate for your region — a UTM zone or an equal-area projection — before any measurement, and do it early so every downstream calculation uses correct units.

The Python GIS Data Cleaning Checklist: From Raw Download to Analysis-Ready #

Problem statement #

Quick answer #

The cleaning checklist #

Step-by-step solution #

Load & inspect #

Confirm/repair the CRS #

Drop null and empty geometries #

Check and repair validity #

Remove duplicates #

Normalise attribute columns #

Validate against expectations #

Export cleanly #

Explanation #

Edge cases or notes #

GeoPackage vs Shapefile for output #

A defined CRS does not guarantee correctness #

Valid is not the same as correct #

Mixed geometry types #

Large files and memory #

Re-checking after every transform #

CRS for area and distance #

Internal links #

FAQ #

What is the difference between a null and an empty geometry? #

Should I use make_valid() or buffer(0) to fix geometries? #

Does set_crs() reproject my data? #

Why GeoPackage instead of Shapefile? #

In what order should I clean spatial data? #

How do I know if my CRS is actually correct? #

Why do I still have duplicate geometries after drop_duplicates()? #

Do I need to reproject before calculating area or distance? #