How to Standardise and Repair CRS Across a Folder of Files in Python

You inherited a folder of vector data from three different people, two software packages, and one government portal that hasn't been updated since 2014. Some files are in WGS84, some are in a national grid, and at least one has no CRS at all. When you load them on a map, nothing lines up. This guide shows you how to march through that folder, repair or relabel each file's coordinate reference system, reproject everything to one common CRS, and write clean output — without a single bad file killing the whole run.

Problem statement

You have a directory full of vector files and you are hitting some mix of these symptoms:

  • Mixed CRSs in one folder — some files are EPSG:4326, others are EPSG:27700 or EPSG:3857, and you have to handle each differently.
  • Files with no .crsgdf.crs returns None, usually because a shapefile is missing its .prj sidecar, so GeoPandas has no idea what the coordinates mean.
  • Layers that never line up — you overlay two datasets and they sit thousands of kilometres apart, or one collapses to a dot, because they are stored in different units and projections.
  • Spatial joins and clips silently fail or warn — GeoPandas refuses to operate across mismatched CRSs, or worse, gives wrong answers if you force it.
  • No record of what you fixed — you patch files by hand and a week later you cannot remember which ones were relabelled versus reprojected.

The goal: one repeatable script that discovers every file, decides whether each needs a label fix or a reprojection, writes standardised output, and hands you a summary of exactly what changed.

Quick answer

Loop over the folder with pathlib, read each file, fill in a missing CRS with set_crs() only when you know the true source CRS, reproject everything to one common target with to_crs(), and write each result to GeoPackage. Wrap every file in its own try/except so one corrupt file does not abort the batch, and collect a row of metadata per file for a summary table at the end.

A folder of files in mixed coordinate systems reprojected to one common CRS.
Repair or relabel each file's CRS, then reproject the whole folder to one common system.
import geopandas as gpd
from pathlib import Path

src_dir = Path("data/raw")
out_dir = Path("data/clean")
out_dir.mkdir(parents=True, exist_ok=True)

TARGET_CRS = "EPSG:4326"        # common output CRS for the whole folder
ASSUME_CRS = "EPSG:27700"       # only used when a file has NO crs at all
PATTERNS = ("*.shp", "*.geojson", "*.gpkg")

summary = []
for path in sorted(p for pat in PATTERNS for p in src_dir.glob(pat)):
    try:
        gdf = gpd.read_file(path)
        original = str(gdf.crs)

        if gdf.crs is None:
            gdf = gdf.set_crs(ASSUME_CRS)   # label only, no coordinates move
            action = f"set_crs -> {ASSUME_CRS}"
        else:
            action = "kept"

        if gdf.crs.to_epsg() != 4326:
            gdf = gdf.to_crs(TARGET_CRS)     # reproject, coordinates move
            action += f", to_crs -> {TARGET_CRS}"

        out_path = out_dir / f"{path.stem}.gpkg"
        gdf.to_file(out_path, driver="GPKG")
        summary.append((path.name, original, str(gdf.crs), len(gdf), action))
    except Exception as exc:
        summary.append((path.name, "ERROR", "-", 0, str(exc)))

for name, before, after, n, action in summary:
    print(f"{name:30} {before:14} -> {after:14} {n:7} rows  | {action}")

That is the whole job. The rest of this article explains each decision so you can adapt it safely instead of running it blind.

Step-by-step solution

Discover the files

Use pathlib.Path plus glob to find every vector file. Globbing one pattern at a time and chaining the results keeps things readable and lets you control exactly which extensions you accept. Use rglob instead of glob if your data is nested in subfolders.

from pathlib import Path

src_dir = Path("data/raw")
PATTERNS = ("*.shp", "*.geojson", "*.json", "*.gpkg")

files = sorted(p for pat in PATTERNS for p in src_dir.glob(pat))
for f in files:
    print(f.name)

A couple of notes: globbing *.shp deliberately ignores the .dbf, .shx, and .prj sidecars — GeoPandas reads them automatically from the .shp. And a single .gpkg can hold multiple layers, which we will handle in the edge cases.

Inspect each file's CRS

Before changing anything, look at what you actually have. The two questions that matter are does this file have a CRS? and is it the CRS the data is really in? The first is answered by gdf.crs is None; the second needs human judgement plus a sanity check on coordinates.

import geopandas as gpd
from pathlib import Path

for path in sorted(Path("data/raw").glob("*.shp")):
    gdf = gpd.read_file(path)
    if gdf.crs is None:
        print(f"{path.name}: NO CRS  bounds={gdf.total_bounds}")
    else:
        print(f"{path.name}: {gdf.crs.to_epsg()}  bounds={gdf.total_bounds}")

Read the total_bounds carefully. If a file claims EPSG:4326 but its bounds are [400000, 100000, 600000, 200000], those are metres, not degrees — the label is wrong and you have a mismatch, not a missing CRS. Latitude/longitude values always fall within -180..180 and -90..90.

Set a missing CRS (only if you know the source)

When gdf.crs is None, you must supply the CRS the data was originally created in. set_crs() attaches a label and does not touch a single coordinate. The hard part is knowing the right value — check the data provider's documentation, a sibling .prj file, or the magnitude of the coordinates.

import geopandas as gpd

gdf = gpd.read_file("data/raw/parcels_no_prj.shp")

if gdf.crs is None:
    # We confirmed from the supplier that this is British National Grid.
    gdf = gdf.set_crs("EPSG:27700")

print(gdf.crs)               # EPSG:27700
print(gdf.total_bounds)      # eastings/northings in metres, unchanged

If you guess wrong here, every later step is wrong too. When you genuinely cannot determine the source CRS, do not invent one — log the file as skipped and move on.

Reproject to a common target CRS

Once every file has a correct CRS, reproject them all to one shared CRS with to_crs(). This is the step that actually moves coordinates so everything lines up. You can skip it when the file is already on target to avoid needless work.

import geopandas as gpd

TARGET_CRS = "EPSG:4326"
gdf = gpd.read_file("data/raw/roads_bng.gpkg")   # EPSG:27700

if gdf.crs is None:
    raise ValueError("Set a source CRS before reprojecting")

if gdf.crs.to_epsg() != 4326:
    gdf = gdf.to_crs(TARGET_CRS)

print(gdf.crs)               # EPSG:4326
print(gdf.total_bounds)      # now in degrees

Note the guard: never call to_crs() on a layer whose CRS is None — GeoPandas raises an error because it has no source to project from. Repair the label first, reproject second.

Write out and handle errors per file

Write each cleaned layer to GeoPackage with driver="GPKG". Wrap the read-fix-write cycle for each file in try/except so that a corrupt geometry, a missing sidecar, or an unreadable file produces a logged error rather than a stack trace that stops the whole batch.

import geopandas as gpd
from pathlib import Path

out_dir = Path("data/clean")
out_dir.mkdir(parents=True, exist_ok=True)

for path in sorted(Path("data/raw").glob("*.shp")):
    try:
        gdf = gpd.read_file(path)
        if gdf.crs is None:
            gdf = gdf.set_crs("EPSG:27700")
        gdf = gdf.to_crs("EPSG:4326")
        gdf.to_file(out_dir / f"{path.stem}.gpkg", driver="GPKG")
        print(f"OK   {path.name}")
    except Exception as exc:
        print(f"FAIL {path.name}: {exc}")

Produce a summary report

Collect one record per file as you go, then print a table — or write it to CSV — showing the original CRS, the final CRS, the row count, and the action taken. This is your audit trail; keep it.

import geopandas as gpd
import pandas as pd
from pathlib import Path

rows = []
for path in sorted(Path("data/raw").glob("*.shp")):
    try:
        gdf = gpd.read_file(path)
        before = str(gdf.crs)
        if gdf.crs is None:
            gdf = gdf.set_crs("EPSG:27700")
        gdf = gdf.to_crs("EPSG:4326")
        rows.append({"file": path.name, "crs_before": before,
                     "crs_after": str(gdf.crs), "rows": len(gdf), "status": "ok"})
    except Exception as exc:
        rows.append({"file": path.name, "crs_before": "-",
                     "crs_after": "-", "rows": 0, "status": f"error: {exc}"})

report = pd.DataFrame(rows)
print(report.to_string(index=False))
report.to_csv("data/clean/_crs_report.csv", index=False)

Code examples

Example 1: Count files by CRS before you touch anything

import geopandas as gpd
from collections import Counter
from pathlib import Path

counts = Counter()
for path in sorted(Path("data/raw").glob("*.shp")):
    gdf = gpd.read_file(path, rows=1)        # read 1 row, cheap CRS probe
    counts[str(gdf.crs)] += 1

for crs, n in counts.most_common():
    print(f"{n:3} files  {crs}")

Reading a single row with rows=1 is enough to inspect the CRS and is far faster than loading full geometries when you only want a census of the folder.

Example 2: Detect a mislabelled CRS via bounds

import geopandas as gpd

gdf = gpd.read_file("data/raw/suspect.geojson")
xmin, ymin, xmax, ymax = gdf.total_bounds

looks_like_degrees = (-180 <= xmin <= 180) and (-90 <= ymin <= 90)

if gdf.crs is not None and gdf.crs.to_epsg() == 4326 and not looks_like_degrees:
    print("Labelled EPSG:4326 but coordinates are not degrees -> wrong label")

Example 3: Relabel a wrong CRS without reprojecting

When a file carries the wrong label (e.g. it is really in metres but tagged EPSG:4326), use set_crs(..., allow_override=True) to correct the label, then reproject.

import geopandas as gpd

gdf = gpd.read_file("data/raw/grid_mislabelled.shp")

# It claims 4326 but the values are British National Grid metres.
gdf = gdf.set_crs("EPSG:27700", allow_override=True)   # fix the label
gdf = gdf.to_crs("EPSG:4326")                           # now reproject correctly

Without allow_override=True, GeoPandas refuses to replace an existing CRS and raises an error — a deliberate guard against accidental mislabelling.

Example 4: Recurse into subfolders and mirror the tree

import geopandas as gpd
from pathlib import Path

src_dir = Path("data/raw")
out_dir = Path("data/clean")

for path in sorted(src_dir.rglob("*.shp")):
    rel = path.relative_to(src_dir).with_suffix(".gpkg")
    dest = out_dir / rel
    dest.parent.mkdir(parents=True, exist_ok=True)
    gdf = gpd.read_file(path).to_crs("EPSG:4326")
    gdf.to_file(dest, driver="GPKG")

Example 5: Combine everything into one GeoPackage with many layers

import geopandas as gpd
from pathlib import Path

out_path = Path("data/clean/standardised.gpkg")
out_path.parent.mkdir(parents=True, exist_ok=True)

for path in sorted(Path("data/raw").glob("*.shp")):
    gdf = gpd.read_file(path)
    if gdf.crs is None:
        gdf = gdf.set_crs("EPSG:27700")
    gdf = gdf.to_crs("EPSG:4326")
    gdf.to_file(out_path, layer=path.stem, driver="GPKG")   # one layer per file

A single GeoPackage with one layer per file keeps a project tidy and avoids the multi-file sprawl of shapefiles.

Explanation

set_crs() versus to_crs()

These two methods are constantly confused, and the confusion is where data gets silently corrupted.

  • set_crs() labels. It tells GeoPandas "these coordinates are in this CRS". The numbers in the geometry column are not touched. Use it when the CRS is missing (None) or wrong, and you know the true source. It is metadata only.
  • to_crs() transforms. It takes coordinates that are correctly labelled and recalculates them into a different CRS. The numbers change. Use it to make different files agree.

The mental model: set_crs() writes the unit on the ruler; to_crs() actually measures with a different ruler. If you set_crs() to the wrong value, you have mislabelled the ruler, and every subsequent to_crs() produces garbage that looks plausible.

Why a common CRS matters

GeoPandas refuses to run spatial operations — joins, clips, overlays, distance — across two layers with different CRSs, and it is right to. Coordinates only have meaning relative to their CRS; [400000, 100000] in British National Grid and [51.5, -0.1] in WGS84 cannot be compared until they share a frame. Standardising the whole folder to one CRS means every downstream operation just works, and you never have to reason about per-file projections again.

Choosing the target CRS

Pick based on what you will do with the data, not habit:

  • Web maps, sharing, lat/long storage: EPSG:4326 (WGS84) is the universal interchange CRS, but it is geographic — distances and areas in degrees are meaningless.
  • Measuring distance or area: use a projected CRS in metres. Choose a national grid (e.g. EPSG:27700 for Great Britain) or a UTM zone that covers your data.
  • Web tiles / slippy maps: EPSG:3857 (Web Mercator), but never use it for area calculations — it distorts badly away from the equator.

When in doubt for analysis, a local UTM zone is the safe default because its units are metres and distortion is small over the area it covers.

Edge cases or notes

set_crs() never moves coordinates

This is worth repeating because it is the single most common mistake. If your data is in metres and you call set_crs("EPSG:4326"), the coordinates stay as huge metre values but are now labelled as degrees. Nothing errors, the file saves fine, and it is completely broken. Relabelling fixes metadata; only to_crs() changes positions.

Sanity-check total_bounds after every change

After repairing and reprojecting, print gdf.total_bounds and confirm the numbers are in the range you expect for the target CRS — degrees roughly within -180..180 / -90..90, metric grids in the hundreds of thousands to millions. A bounds value that is wildly off is the fastest way to catch a wrong source-CRS guess before it propagates.

Shapefile versus GeoPackage output

Prefer GeoPackage. Shapefiles split one dataset across four-plus files, truncate field names to 10 characters, cap at 2 GB, and store the CRS in a fragile .prj sidecar that goes missing — which is how you got crs is None in the first place. GeoPackage is a single SQLite file, stores the CRS internally, supports long field names and multiple layers, and has no practical size limit. Write with driver="GPKG".

Files genuinely in different correct CRSs

A folder can legitimately contain files in different CRSs that are all correctly labelled — for example data from several countries. That is fine: do not set_crs() anything (the labels are right), and let to_crs(TARGET_CRS) reconcile them. Your script should only relabel files where gdf.crs is None or where you have positively confirmed a wrong label.

Large folders and memory

gpd.read_file() loads the entire file into memory. For very large files or huge folders, process one file at a time inside the loop (as shown) so only one GeoDataFrame is resident at once, and let it fall out of scope before the next iteration. If a single file is too big to fit, use pyogrio with the rows/bbox read options or process it in chunks rather than relabelling the whole thing at once.

Logging skipped files

Anything you cannot confidently fix — unknown source CRS, unreadable file, empty geometry — should be logged and skipped, not guessed at. Use Python's logging module or simply append a row with a status to your summary. A file silently left out is far worse than one explicitly marked SKIPPED: unknown source CRS, because future-you needs to know it was never cleaned.

FAQ

How do I know what the source CRS of a file with no .crs should be?

Check the data supplier's documentation or metadata first — it is usually stated. Failing that, look at the magnitude of total_bounds: values within -180..180 / -90..90 suggest a geographic CRS like EPSG:4326, while large six- or seven-digit numbers suggest a projected national grid or UTM zone. If you still cannot tell, do not guess; mark the file as skipped.

What happens if I call to_crs() on a GeoDataFrame whose CRS is None?

GeoPandas raises a ValueError because it has no source CRS to project from. You must establish a correct CRS with set_crs() first, then reproject. This is exactly why the repair step always comes before the reprojection step.

Why does set_crs() raise an error on a file that already has a CRS?

By default set_crs() refuses to overwrite an existing CRS to prevent accidental mislabelling. If you genuinely need to correct a wrong label, pass allow_override=True. Only do this when you are certain the existing label is wrong and you know the correct one.

Should I use EPSG:4326 or a projected CRS as my common target?

Use EPSG:4326 for storage, sharing, and web mapping. Use a projected CRS in metres (a UTM zone or national grid) whenever you need to measure distance or area, because calculations in degrees are not meaningful. The right choice depends entirely on what you will do with the data afterwards.

Can one GeoPackage hold all my cleaned files?

Yes. Write each file to the same .gpkg with a distinct layer= name, as in Example 5. This keeps a project to a single file instead of scattering dozens of shapefiles, and every layer stores its own CRS internally.

How do I make sure one corrupt file does not stop the whole batch?

Wrap the read-repair-write cycle for each file in its own try/except block, log the exception, and continue the loop. The bad file ends up in your summary marked as an error while every other file is still processed. Never let a single failure abort an overnight batch.

My reprojected data still does not line up with a basemap — what went wrong?

The most likely cause is a wrong source CRS: you set_crs() to the wrong value, so to_crs() faithfully transformed from the wrong starting frame. Check total_bounds before and after, and confirm the original CRS was actually correct. A mislabel produces output that is internally consistent but geographically wrong.

Is reading with rows=1 safe for checking the CRS?

Yes. The CRS is file-level metadata, so reading a single feature is enough to inspect it and is much faster than loading every geometry. Use it for cheap census passes over a large folder, then do the full read only when you are ready to repair and write.