How to Validate a GeoDataFrame Against a Schema Before Analysis

You loaded a shapefile, ran your analysis, and shipped a map. Weeks later someone notices the numbers are wrong. The cause? A column got renamed upstream, the file arrived in the wrong CRS, or a handful of rows had null geometries that quietly dropped out of a spatial join. None of it raised an error. The pipeline ran green the whole way.

The fix is to validate your data against an explicit schema before you analyze it. You decide up front what "good data" means — which columns must exist, what types they hold, which CRS they use, what the geometry should look like — and you check all of it in one pass. When something is wrong, you fail fast with a clear report instead of producing a confidently incorrect answer.

This article shows you how to build a small, reusable validate(gdf, schema) function in plain GeoPandas, collect every problem into a list, and fail loudly when the data does not match what your analysis assumes.

Problem statement

You are running analysis on a GeoDataFrame, but you have no guarantee the data matches what your code expects. The symptoms are subtle because nothing crashes:

  • A required column was missing or renamed. Your groupby("region") silently produces an empty or wrong result because the column is now region_name.
  • Wrong CRS. You compute areas or buffers in degrees instead of meters, so every distance and area is off by orders of magnitude — but the code runs fine.
  • Unexpected geometry type. You expect polygons but get a mix of polygons and points, so .area returns zeros for the points and skews your totals.
  • Duplicate ids. A join fan-outs and inflates counts because the "unique" key column has repeats.
  • Out-of-range values. A population column contains -1 placeholders or a percentage field has values above 100, dragging your statistics in a direction nobody notices.
  • Null geometries or null attributes. Rows with missing geometry get dropped by spatial operations, so your output silently loses records.

All of these are cheap to catch at the door and expensive to discover after the fact.

Quick answer

Define your expectations as a plain dict, then run one function that checks required columns, dtypes, CRS, and geometry validity, collecting every failure into a list. If the list is non-empty, raise before any analysis runs.

A GeoDataFrame passes through a validation gate to analysis, or fails to a report.
Validation is a gate before analysis: pass and proceed, or fail with a clear report.
import geopandas as gpd
import pandas as pd

def validate(gdf, schema):
    issues = []

    # Required columns
    missing = [c for c in schema.get("required_columns", []) if c not in gdf.columns]
    for c in missing:
        issues.append(f"missing required column: {c}")

    # Dtypes
    for col, expected in schema.get("dtypes", {}).items():
        if col in gdf.columns and not pd.api.types.is_dtype_equal(gdf[col].dtype, expected):
            issues.append(f"column '{col}' has dtype {gdf[col].dtype}, expected {expected}")

    # CRS
    expected_crs = schema.get("crs")
    if expected_crs is not None:
        if gdf.crs is None:
            issues.append("CRS is missing (None)")
        elif gdf.crs.to_epsg() != expected_crs:
            issues.append(f"CRS is EPSG:{gdf.crs.to_epsg()}, expected EPSG:{expected_crs}")

    # Geometry validity and null geometries
    if gdf.geometry.isna().any():
        issues.append(f"{int(gdf.geometry.isna().sum())} null geometries")
    invalid = ~gdf.geometry.is_valid & gdf.geometry.notna()
    if invalid.any():
        issues.append(f"{int(invalid.sum())} invalid geometries")

    return issues

# Usage
schema = {"required_columns": ["id", "pop"], "dtypes": {"pop": "int64"}, "crs": 3857}
problems = validate(gdf, schema)
if problems:
    raise ValueError("GeoDataFrame failed validation:\n" + "\n".join(problems))

Step-by-step solution

The plan: describe your expectations in a schema dict, write one check per concern, and have each check append human-readable strings to a shared issues list. At the end you decide whether to raise.

Define the schema

Keep the schema as a plain Python dict. It is easy to read, easy to diff in version control, and needs no extra dependencies. Each key describes one category of expectation.

schema = {
    "required_columns": ["id", "name", "population", "geometry"],
    "dtypes": {
        "id": "int64",
        "name": "object",
        "population": "int64",
    },
    "crs": 3857,                      # expected EPSG code
    "geometry_type": "Polygon",       # or "MultiPolygon", "Point", etc.
    "ranges": {
        "population": (0, 50_000_000),  # (min, max), inclusive
    },
    "unique": ["id"],                  # columns that must be unique
    "non_null": ["id", "name", "population"],
}

Every check below reads from one of these keys. If a key is absent, the corresponding check is simply skipped, so you can start small and add rules over time.

Check required columns and dtypes

First confirm the columns exist, then check the dtype only for columns that are present (checking the dtype of a missing column would throw).

def check_columns(gdf, schema, issues):
    for col in schema.get("required_columns", []):
        if col not in gdf.columns:
            issues.append(f"missing required column: {col}")

    for col, expected in schema.get("dtypes", {}).items():
        if col not in gdf.columns:
            continue  # already reported as missing
        actual = gdf[col].dtype
        if not pd.api.types.is_dtype_equal(actual, expected):
            issues.append(
                f"column '{col}' has dtype {actual}, expected {expected}"
            )
    return issues

Using pd.api.types.is_dtype_equal is more robust than gdf[col].dtype == expected because it handles string aliases like "int64" and extension dtypes consistently.

Check the CRS

A wrong or missing CRS is one of the most damaging silent errors, because every distance, area, and overlay depends on it. Compare by EPSG code rather than by string, so equivalent definitions still match.

def check_crs(gdf, schema, issues):
    expected = schema.get("crs")
    if expected is None:
        return issues  # no CRS requirement declared

    if gdf.crs is None:
        issues.append("CRS is missing (None)")
        return issues

    actual = gdf.crs.to_epsg()
    if actual != expected:
        issues.append(f"CRS is EPSG:{actual}, expected EPSG:{expected}")
    return issues

gdf.crs.to_epsg() returns the integer EPSG code (or None if the CRS cannot be reduced to one). Comparing integers avoids the trap where "EPSG:4326" and a full WKT string describe the same system but compare unequal as text.

Check geometry type and validity

You usually want a specific geometry type, no null geometries, and all geometries valid (no self-intersections, no malformed rings). Check each separately so the report tells you exactly which problem occurred.

def check_geometry(gdf, schema, issues):
    geom = gdf.geometry

    null_count = int(geom.isna().sum())
    if null_count:
        issues.append(f"{null_count} null geometries")

    present = geom[geom.notna()]

    expected_type = schema.get("geometry_type")
    if expected_type is not None:
        bad_type = present.geom_type != expected_type
        if bad_type.any():
            found = sorted(present.geom_type.unique().tolist())
            issues.append(
                f"{int(bad_type.sum())} geometries are not {expected_type} "
                f"(found types: {found})"
            )

    invalid = ~present.is_valid
    if invalid.any():
        issues.append(f"{int(invalid.sum())} invalid geometries")
    return issues

Check value ranges and uniqueness

Numeric columns often have implicit bounds — a population is never negative, a percentage sits between 0 and 100. Identifier columns must be unique, and some columns must never be null.

def check_values(gdf, schema, issues):
    for col, (low, high) in schema.get("ranges", {}).items():
        if col not in gdf.columns:
            continue
        s = gdf[col]
        out = s[(s < low) | (s > high)]
        if len(out):
            issues.append(
                f"column '{col}' has {len(out)} values outside [{low}, {high}]"
            )

    for col in schema.get("unique", []):
        if col not in gdf.columns:
            continue
        dups = int(gdf[col].duplicated().sum())
        if dups:
            issues.append(f"column '{col}' has {dups} duplicate values")

    for col in schema.get("non_null", []):
        if col not in gdf.columns:
            continue
        nulls = int(gdf[col].isna().sum())
        if nulls:
            issues.append(f"column '{col}' has {nulls} null values")
    return issues

Collect issues into a report and fail fast

Now wire the checks together. Each appends to the same list, so you get every problem in one run instead of fixing them one at a time.

def validate(gdf, schema):
    issues = []
    check_columns(gdf, schema, issues)
    check_crs(gdf, schema, issues)
    check_geometry(gdf, schema, issues)
    check_values(gdf, schema, issues)
    return issues


def validate_or_raise(gdf, schema):
    issues = validate(gdf, schema)
    if issues:
        report = "\n".join(f"  - {i}" for i in issues)
        raise ValueError(f"GeoDataFrame failed validation:\n{report}")
    return gdf


# At the top of your analysis script
gdf = validate_or_raise(gdf, schema)
# ... safe to analyze from here ...

When something is wrong you get output like this, and you know exactly what to fix:

ValueError: GeoDataFrame failed validation:
  - missing required column: name
  - CRS is EPSG:4326, expected EPSG:3857
  - 3 null geometries
  - column 'id' has 2 duplicate values

Code examples

Example 1: Minimal validation at load time

import geopandas as gpd

schema = {
    "required_columns": ["id", "geometry"],
    "crs": 4326,
}

gdf = gpd.read_file("regions.gpkg")
problems = validate(gdf, schema)
if problems:
    raise ValueError("\n".join(problems))

Example 2: Warn instead of raise

During exploration you may want to see issues without stopping. Log them and continue.

import warnings

def validate_or_warn(gdf, schema):
    issues = validate(gdf, schema)
    for i in issues:
        warnings.warn(f"validation: {i}")
    return issues

validate_or_warn(gdf, schema)  # prints warnings, returns the list

Example 3: Validating a non-geometry concern only

The schema keys are independent, so you can run a partial schema — useful in unit tests for a single rule.

range_only = {"ranges": {"share": (0.0, 1.0)}}
print(validate(gdf, range_only))
# ["column 'share' has 5 values outside [0.0, 1.0]"]

Example 4: Checking for an expected coordinate bounds box

A cheap sanity check: do the geometries fall inside the rough bounding box you expect for your study area? This catches data accidentally in the wrong CRS even when .crs is set correctly.

def check_bounds(gdf, expected_bbox, issues):
    minx, miny, maxx, maxy = gdf.total_bounds
    e_minx, e_miny, e_maxx, e_maxy = expected_bbox
    if minx < e_minx or miny < e_miny or maxx > e_maxx or maxy > e_maxy:
        issues.append(
            f"data bounds {tuple(round(b, 2) for b in gdf.total_bounds)} "
            f"fall outside expected {expected_bbox}"
        )
    return issues

issues = []
# Continental US in EPSG:4326, roughly
check_bounds(gdf, (-125.0, 24.0, -66.0, 50.0), issues)
print(issues)

Example 5: Reusing the validator across multiple datasets

Keep one schema per dataset type and loop over inputs in a batch job.

schemas = {
    "regions": {"required_columns": ["id", "name"], "crs": 3857,
                "geometry_type": "MultiPolygon", "unique": ["id"]},
    "stations": {"required_columns": ["id", "kind"], "crs": 3857,
                 "geometry_type": "Point", "unique": ["id"]},
}

for name, path in [("regions", "regions.gpkg"), ("stations", "stations.gpkg")]:
    gdf = gpd.read_file(path)
    problems = validate(gdf, schemas[name])
    if problems:
        raise ValueError(f"{name} failed:\n" + "\n".join(problems))
    print(f"{name}: OK ({len(gdf)} rows)")

Explanation

Why validate before analysis at all? Spatial operations are forgiving in the worst way: they rarely crash on bad input, they just produce wrong output. A spatial join drops rows with null geometry. An area calculation in a geographic CRS returns numbers in square degrees, which look plausible until you compare them. A duplicated id inflates a count. Validation turns these silent failures into loud, immediate errors at a point where the cause is obvious and cheap to fix.

Why collect all failures instead of failing on the first? A bare assert stops at the first problem, so you fix it, re-run, hit the next one, fix that, re-run again — a slow loop. By appending every issue to a shared list and raising once at the end, you see the full picture in a single run. If three columns are missing and the CRS is wrong, you learn all four facts immediately and can fix them in one edit. This is the single biggest practical advantage over scattering assert statements through your code.

Where does validation fit in a pipeline? Think of three stages: load, clean, analyze. Validation belongs at the boundary between cleaning and analysis. You ingest raw data, apply your cleaning steps (fixing types, reprojecting, dropping bad rows), and then validate that the result actually meets your contract before any analysis touches it. If validation fails after cleaning, your cleaning step has a bug — which is exactly what you want to know. You can also validate at ingestion to characterize how dirty the raw data is, but the load-bearing check is the one guarding your analysis.

Edge cases or notes

Validation is not cleaning

Validation reports problems; it does not fix them. Resist the urge to make your validator silently drop null rows or reproject the data. If it both checks and mutates, you lose the guarantee that what you analyzed is what you validated. Run your cleaning steps first, then validate the cleaned result, then analyze. Keep the two responsibilities in separate functions.

CRS equality comparison

Comparing CRS objects by their string representation is fragile because the same system can be described by a short authority code or a long WKT string. Prefer gdf.crs.to_epsg() == 3857, which reduces both sides to an integer. If you must compare against another CRS object, gdf.crs == other_crs uses pyproj's semantic equality, which is more reliable than string matching. Be aware that to_epsg() can return None for custom CRSes that do not map to an EPSG code — handle that case explicitly.

Geometry type can legitimately be mixed

Real datasets often mix Polygon and MultiPolygon, or LineString and MultiLineString, and that is perfectly valid. If your analysis handles both, allow a set of types rather than a single string:

allowed = {"Polygon", "MultiPolygon"}
bad = ~present.geom_type.isin(allowed)

Decide deliberately whether your downstream code cares about the distinction before you reject mixed types.

Numeric tolerance for ranges

Floating-point columns rarely land on exact bounds. A "percentage" stored as a float might read 100.0000001 after a computation. If strict bounds cause spurious failures, widen the range by a small epsilon or round before comparing:

eps = 1e-9
out = s[(s < low - eps) | (s > high + eps)]

For integer columns this is unnecessary; reserve tolerance for floats.

Consider pandera or Great Expectations for big projects

The plain-dict approach is ideal for scripts and small pipelines. For larger projects, dedicated tools pay off. pandera lets you declare a typed schema and includes a geopandas integration (pandera.geopandas) with geometry and CRS checks, plus rich, structured error reports. Great Expectations targets data-quality monitoring across whole pipelines with persisted "expectation suites" and HTML reports. Both add dependencies and a learning curve; reach for them when validation becomes a shared, long-lived asset rather than a few checks at the top of one script.

Keep the schema in version control

Because the schema is just a dict (or a small YAML/JSON file you load into one), commit it alongside your code. When the expected data shape changes, the diff shows exactly what changed and why. A schema that lives only in someone's head is no contract at all. Treat it as documentation that the machine enforces.

FAQ

Should validation raise an exception or return a list?

Do both, in layers. Have validate() return a list of issue strings so it is easy to test, log, or inspect. Then wrap it in a thin validate_or_raise() that raises when the list is non-empty. This separation lets you choose strict failure in production and a softer warn-and-continue mode during exploration.

How do I check the CRS without reprojecting?

Read gdf.crs directly — it is metadata and accessing it does not transform anything. Compare with gdf.crs.to_epsg() == expected_epsg. Only gdf.to_crs(...) actually reprojects coordinates, and validation should never call it. If the CRS is wrong, report it and let the cleaning stage fix it.

What if my GeoDataFrame has no CRS at all?

gdf.crs returns None. If your schema declares an expected CRS, treat None as a failure and report it clearly, as the example checks do. A missing CRS is a real problem: GeoPandas cannot reproject or measure distances meaningfully without one, so you want to catch it before analysis.

How do I validate that geometries are valid?

Use the gdf.geometry.is_valid boolean Series, which flags self-intersections and malformed geometries. Count the False entries and report them. To inspect why, gdf.geometry.make_valid() can repair them in your cleaning stage, but keep repair out of the validator itself.

Can I check column dtypes loosely instead of exactly?

Yes. For numeric columns you often care that a value is numeric, not that it is exactly int64. Use pd.api.types.is_numeric_dtype(gdf[col]) instead of an exact match. Define helper predicates in your schema when "any integer" or "any float" is good enough, and reserve exact dtype checks for cases where the storage type genuinely matters.

How is this different from just using assert statements?

A chain of assert statements stops at the first failure, so you discover problems one re-run at a time. The collect-into-a-list pattern reports every issue in a single pass and gives you readable messages rather than a bare AssertionError. It is the same idea as a linter: surface all the problems at once.

When should I switch from this to pandera?

Switch when validation stops being a few lines at the top of a script and becomes infrastructure — shared across modules, run in CI, or maintained by a team. pandera gives you declarative typed schemas, a geopandas integration for geometry and CRS rules, and structured error objects. For a single analysis script, the plain-dict approach is lighter and just as effective.

Does validation slow down my pipeline?

The checks here are vectorized pandas and GeoPandas operations, so they are fast relative to the analysis that follows. The most expensive check is usually is_valid on large geometry columns, which is still cheap compared to a spatial join or overlay. The time you spend validating is far less than the time you would spend debugging a silently wrong result.