The Python GIS Data Cleaning Checklist: From Raw Download to Analysis-Ready
You downloaded a dataset, opened it in Python, and ran your first spatial join — and got an empty result, a TopologyException, or a silent reprojection that placed your features in the ocean. Welcome to spatial data cleaning, the unglamorous work that decides whether the rest of your analysis is trustworthy.
This guide gives you a complete, ordered mental model for turning a raw download into an analysis-ready GeoDataFrame using current GeoPandas (1.x) and Shapely (2.x). Work through it top to bottom the first few times. Once the order is in your head, you will spot which step a given dataset actually needs.
Problem statement
Real-world spatial data is messy in ways that tabular data is not. A shapefile from a city open-data portal, a GeoJSON export from someone's desktop GIS, or a scraped collection of features all carry their own defects. Worse, many of these defects are invisible until a downstream operation fails — or, more dangerously, succeeds with wrong numbers.
Geometry problems you will hit constantly:
- Invalid geometries — self-intersecting (bowtie) polygons, rings that touch at a point, holes that escape their shell. Shapely will raise or return garbage from overlay and area calculations.
- Null geometries — rows where the geometry cell is literally
None. They survive most operations until one suddenly errors. - Empty geometries — a non-null geometry containing no coordinates (
POLYGON EMPTY). These pass null checks but break area, length, and distance math. - Mixed geometry types —
PolygonandMultiPolygon(or worse, points and lines) in one column, which some formats and operations refuse to handle.
Coordinate and attribute problems that quietly corrupt results:
- Missing or wrong CRS — no
.crsat all, or a defined CRS that does not match the actual coordinate values. - CRS mismatch between layers — two datasets that look fine alone but never intersect because they are in different projections.
- Duplicate features — the same parcel digitised twice, or a join that fanned rows out.
- Inconsistent attributes —
"N/A","", andNoneall meaning "missing"; whitespace, mixed case, and numbers stored as text in key columns.
Skip the cleaning and you do not get an error — you get a plausible-looking answer that is wrong. That is the real cost.
Quick answer
Clean spatial data in a fixed order: load and inspect, fix the CRS, remove null and empty geometries, repair invalid ones with make_valid(), drop duplicates, normalise attributes, then export to GeoPackage. Doing it in this order means each step works on data the previous step already made safe.
Here is the whole pipeline in one block — adapt the column names to your dataset.
import geopandas as gpd
from shapely import make_valid
# 1. Load and inspect
gdf = gpd.read_file("data/parcels.shp")
print(gdf.shape, gdf.crs, gdf.geom_type.value_counts(), sep="\n")
# 2. Confirm / repair the CRS, then project to a metric CRS for analysis
if gdf.crs is None:
gdf = gdf.set_crs(epsg=4326) # ONLY if you know the true source CRS
gdf = gdf.to_crs(epsg=3857) # use a CRS appropriate for your area
# 3. Drop null and empty geometries
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()
# 4. Repair invalid geometries
invalid = ~gdf.geometry.is_valid
gdf.loc[invalid, "geometry"] = gdf.loc[invalid, "geometry"].apply(make_valid)
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy() # make_valid can drop slivers
# 5. Remove duplicates (attribute + geometry)
gdf = gdf.drop_duplicates(subset=[c for c in gdf.columns if c != "geometry"]).copy()
gdf = gdf[~gdf.geometry.normalize().duplicated()].copy()
# 6. Normalise attribute columns
gdf.columns = gdf.columns.str.strip().str.lower()
for col in gdf.select_dtypes(include="object").columns:
gdf[col] = gdf[col].str.strip().replace({"": None, "N/A": None, "n/a": None})
# 7. Export cleanly
gdf.to_file("data/parcels_clean.gpkg", layer="parcels", driver="GPKG")
The cleaning checklist
is_valid.make_valid().
Step-by-step solution
Load & inspect
Read the file and look before you touch anything. The shape, CRS, geometry-type counts, and dtypes tell you which of the later steps you actually need.
import geopandas as gpd
gdf = gpd.read_file("data/parcels.shp")
print("rows, cols:", gdf.shape)
print("crs:", gdf.crs)
print(gdf.geom_type.value_counts())
print(gdf.dtypes)
print("nulls:", gdf.geometry.isna().sum())
print("empties:", gdf.geometry.is_empty.sum())
print("invalid:", (~gdf.geometry.is_valid).sum())
Confirm/repair the CRS
A missing CRS and a wrong CRS are different problems. set_crs() only labels the data — it does not move coordinates — so use it solely when the file lost a CRS it should have had. Use to_crs() to actually reproject.
# Label a missing CRS (only if you KNOW the source projection)
if gdf.crs is None:
gdf = gdf.set_crs(epsg=4326, allow_override=False)
# Reproject for analysis — pick a metric CRS suited to your area of interest
gdf = gdf.to_crs(epsg=3857)
print(gdf.crs)
print(gdf.total_bounds) # sanity-check the coordinate ranges
Drop null and empty geometries
Null and empty are distinct: notna() catches None, and is_empty catches geometries with no coordinates. You need both checks.
before = len(gdf)
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()
print(f"dropped {before - len(gdf)} null/empty rows")
Check and repair validity
Find invalid geometries with is_valid, then repair them with Shapely 2.x's make_valid(). It is the standard tool and, unlike buffer(0), it will not silently distort your shapes.
from shapely import make_valid
invalid_mask = ~gdf.geometry.is_valid
print("invalid:", invalid_mask.sum())
gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(make_valid)
# make_valid can turn a degenerate polygon into an empty or collection — re-clean
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()
assert gdf.geometry.is_valid.all()
Remove duplicates
Duplicates come in two flavours: identical attribute rows and identical geometries. Handle both. normalize() puts geometries into a canonical form so that logically identical shapes compare equal.
# Exact attribute-row duplicates
attr_cols = [c for c in gdf.columns if c != gdf.geometry.name]
gdf = gdf.drop_duplicates(subset=attr_cols).copy()
# Geometric duplicates (same shape, possibly different vertex order)
gdf = gdf[~gdf.geometry.normalize().duplicated()].copy()
Normalise attribute columns
Standardise column names, strip whitespace, collapse the various "missing" tokens to a real None, and fix dtypes that loaded as text. This is what makes joins and filters behave.
# Tidy column names
gdf.columns = gdf.columns.str.strip().str.lower()
# Clean string columns
for col in gdf.select_dtypes(include="object").columns:
gdf[col] = (gdf[col].str.strip()
.replace({"": None, "N/A": None, "n/a": None, "NULL": None}))
# Coerce a numeric column stored as text
gdf["area_m2"] = gpd.pd.to_numeric(gdf.get("area_m2"), errors="coerce")
Validate against expectations
Before exporting, assert the things that must be true. A handful of cheap checks here catch problems that would otherwise surface three notebooks later.
assert gdf.crs is not None, "CRS is undefined"
assert gdf.geometry.notna().all(), "null geometries remain"
assert (~gdf.geometry.is_empty).all(), "empty geometries remain"
assert gdf.geometry.is_valid.all(), "invalid geometries remain"
assert gdf["parcel_id"].is_unique, "duplicate parcel ids"
print(f"{len(gdf)} clean features ready")
Export cleanly
Write to GeoPackage, not Shapefile. GeoPackage is a single file, has no 10-character field-name limit or 2 GB cap, stores the CRS reliably, and supports full datatypes and Unicode.
gdf.to_file("data/parcels_clean.gpkg", layer="parcels", driver="GPKG")
# Round-trip check
check = gpd.read_file("data/parcels_clean.gpkg", layer="parcels")
print(check.shape, check.crs)
Explanation
The order is not arbitrary — each step depends on the one before it being done. Inspecting first tells you which steps matter; there is no point repairing validity on a dataset that has none invalid. CRS comes early because reprojection touches every coordinate, and you want it settled before you compute anything geometric. Do area or distance math in a geographic CRS like EPSG:4326 and your numbers are in degrees — meaningless. A metric CRS suited to your region must be in place first.
Null and empty removal precedes validity repair because make_valid() and is_valid expect real geometries to chew on; feeding them None or empty shapes wastes effort and clutters your masks. And because make_valid() can itself produce empties (when it collapses a degenerate sliver) or GeometryCollections, you re-run the null/empty filter after repair. That second pass is the step beginners forget.
Deduplication comes after repair so that two copies of the same broken polygon normalise to the same valid shape and collapse together — dedupe first and a later repair might make two surviving rows identical anyway. Attribute normalisation is last among the transforms because it is independent of geometry; doing it at the end keeps it from interfering with the geometric steps. Validation and export close the loop: you assert your invariants, then write to a format that preserves them.
Edge cases or notes
GeoPackage vs Shapefile for output
Always prefer driver="GPKG" for cleaned output. Shapefile truncates field names to 10 characters, caps at 2 GB, splits one logical dataset across several sidecar files, and has shaky encoding and CRS storage. GeoPackage is a single SQLite file with none of those limits. Read shapefiles you are given; do not create new ones.
A defined CRS does not guarantee correctness
gdf.crs only tells you what label the file carries — not that the label is right. Data is routinely exported with a wrong or default CRS. Always sanity-check total_bounds against where the data should be on Earth before trusting the CRS, and never use set_crs() to "fix" coordinates that are simply in a different projection.
Valid is not the same as correct
make_valid() guarantees a geometry obeys the OGC rules — not that it represents reality. A repaired bowtie polygon is now topologically valid but may have an area you never intended. Validity is a floor, not a guarantee; spot-check repaired features, especially if only a few rows were affected but they carry a lot of weight.
Mixed geometry types
A column mixing Polygon and MultiPolygon is common and usually fine in GeoPandas and GeoPackage. But some operations and consumers want one type — promote everything to multi-part before export. Mixing fundamentally different types (points with polygons) almost always signals a data problem to investigate, not normalise away.
Large files and memory
read_file() loads everything into memory. For large datasets, use the bbox or mask argument to read only your area of interest, the rows argument to sample, or read in chunks via pyogrio. Reproject and clean the subset, not the whole continent.
Re-checking after every transform
Repairs can introduce new defects: make_valid() makes empties, reprojection can fail on geometries near the poles or antimeridian, and a join inflates duplicates. Treat the inspection block from step one as something you can re-run at any point. Cheap to run, expensive to skip.
CRS for area and distance
EPSG:3857 (Web Mercator) is convenient and used above as a placeholder, but its distances and areas are distorted away from the equator. For real measurement use an equal-area or local projected CRS for your region (a UTM zone, a national grid, or an Albers equal-area). Pick the CRS for the question you are answering.
Internal links
- How to Fix Invalid Geometries in Python (GeoPandas)
- How to Remove Null and Empty Geometries in GeoPandas
- How to Fix CRS Mismatch in GeoPandas
- How to Reproject Spatial Data in Python (GeoPandas)
- How to Read a Shapefile in Python with GeoPandas
- How to Read and Write GeoPackage Files in Python
FAQ
What is the difference between a null and an empty geometry?
A null geometry is None in the geometry cell — there is no geometry object at all, and notna() catches it. An empty geometry is a real object that contains no coordinates, such as POLYGON EMPTY; it passes a null check but fails area, length, and distance calculations. You must test for both with gdf.geometry.notna() & ~gdf.geometry.is_empty.
Should I use make_valid() or buffer(0) to fix geometries?
Use make_valid() from Shapely 2.x as your default — it is purpose-built and preserves your geometry's intent without distorting it. buffer(0) is an old trick that sometimes repairs polygons but can silently delete parts of a shape or change its area, and it does not help with lines or points. Reach for buffer(0) only as a fallback when make_valid() produces something you cannot use.
Does set_crs() reproject my data?
No. set_crs() only attaches or relabels the CRS metadata — the coordinate numbers do not move. Use it only when a file is missing a CRS it should have had and you know the correct one. To actually transform coordinates from one CRS to another, use to_crs().
Why GeoPackage instead of Shapefile?
GeoPackage is a single SQLite-based file with no 2 GB size limit, no 10-character field-name truncation, reliable CRS and Unicode handling, and support for multiple layers. Shapefile is an older multi-file format with all of those limitations. Read shapefiles when you are handed them, but write your cleaned output to .gpkg with driver="GPKG".
In what order should I clean spatial data?
Inspect, fix and set the CRS, reproject, drop null and empty geometries, repair invalid geometries, re-drop any new empties, remove duplicates, normalise attributes, validate, then export. The order matters because each step assumes the previous one already ran — for example, validity repair expects non-null geometries, and you re-check for empties because make_valid() can create them.
How do I know if my CRS is actually correct?
Check gdf.crs to see the declared CRS, then check gdf.total_bounds and confirm the coordinate ranges match where the data should sit on Earth. Latitude/longitude data should fall within roughly -180 to 180 and -90 to 90; projected data will have large metre values. If the bounds put your features in the wrong place, the declared CRS is wrong — do not "fix" it with set_crs() unless you know the true source projection.
Why do I still have duplicate geometries after drop_duplicates()?
drop_duplicates() on attribute columns only catches rows whose attributes match — two records of the same shape with different IDs survive. Geometrically identical shapes can also differ in vertex order or starting point, so a raw comparison misses them. Use gdf.geometry.normalize().duplicated() to canonicalise geometries before comparing, which collapses logically identical shapes.
Do I need to reproject before calculating area or distance?
Yes, if your data is in a geographic CRS like EPSG:4326. Area and length computed on degrees are meaningless. Reproject to a metric CRS appropriate for your region — a UTM zone or an equal-area projection — before any measurement, and do it early so every downstream calculation uses correct units.