How to Handle Missing and Null Values in Spatial Datasets in Python

Spatial datasets are messy. Field surveys skip questions, sensors drop readings, joins fail to match, and legacy formats encode "no data" as magic numbers like -9999. Before you can map, aggregate, or model anything, you need to find those gaps and decide what to do about them.

This guide is about missing attribute values — the columns of data attached to your features (population, elevation, land-use code, sensor reading). It is not about null or empty geometries, which are a separate problem with separate tools; see the linked geometry guide for that. Here you will learn to detect missing attributes, recognise sentinel values that disguise themselves as real numbers, and apply the three core strategies: drop, fill, or flag.

Problem statement

You load a GeoDataFrame and the attribute columns are not clean. Typical symptoms:

  • Blank cells that import as None or empty strings ("").
  • NaN floats scattered through numeric columns where data was never recorded.
  • "NA", "N/A", "null", "-" strings that pandas read literally as text instead of treating as missing.
  • Sentinel numbers like -9999, 9999, -32768, or 0 standing in for "no measurement" — these poison means, sums, and colour scales because pandas treats them as real values.
  • Missing categories in classification columns (an empty land-use code, a blank country name).
  • Whole columns that are mostly or entirely empty after a join that didn't match.

All of these are attribute problems. The geometry of each row can be perfectly valid while its attributes are full of holes.

Quick answer

Read the file telling pandas which strings and numbers mean "missing", replace any remaining numeric sentinels with NaN, count the gaps, then choose to drop or fill. Never let -9999 survive into a calculation.

Three strategies for a missing attribute value: drop, fill, or flag.
For each gap, choose to drop, fill, or flag — based on how much is missing and how critical it is.
import geopandas as gpd
import pandas as pd
import numpy as np

# 1. Treat known missing tokens as NaN at read time
gdf = gpd.read_file("sites.geojson")

# 2. Replace numeric sentinels with NaN
gdf[["elevation", "rainfall"]] = gdf[["elevation", "rainfall"]].replace(
    [-9999, 9999, -32768], np.nan
)

# 3. Count missing values per column
print(gdf.isna().sum())

# 4a. Drop rows missing a required attribute...
clean = gdf.dropna(subset=["elevation"])

# 4b. ...OR fill with a statistic and keep a flag
gdf["rainfall_was_missing"] = gdf["rainfall"].isna()
gdf["rainfall"] = gdf["rainfall"].fillna(gdf["rainfall"].median())

Step-by-step solution

Detect missing values

Start by measuring the problem. isna() (identical to isnull()) returns a boolean mask; chain .sum() to count per column, and divide by len() for a percentage.

import geopandas as gpd

gdf = gpd.read_file("sites.geojson")

# Count missing per column
print(gdf.isna().sum())

# As a percentage, sorted worst-first
missing_pct = (gdf.isna().mean() * 100).round(1).sort_values(ascending=False)
print(missing_pct)

# Inspect the actual rows with a gap in 'elevation'
print(gdf[gdf["elevation"].isna()].head())

Note that isna() does not flag sentinel values or "NA" strings — they look like valid data until you convert them. That is the next step.

Convert sentinel values to NaN

Sentinels are real-looking values that mean "no data". Find them by inspecting suspicious extremes (min, max, value_counts), then replace them with np.nan.

import numpy as np

# Suspicious? Look at the extremes and frequent values
print(gdf["elevation"].describe())
print(gdf["elevation"].value_counts().head())

# Replace the known sentinels with NaN
sentinels = [-9999, 9999, -32768, 32767]
gdf["elevation"] = gdf["elevation"].replace(sentinels, np.nan)

# Replace across several numeric columns at once
num_cols = ["elevation", "rainfall", "temperature"]
gdf[num_cols] = gdf[num_cols].replace(sentinels, np.nan)

Only replace columns where the sentinel is genuinely invalid. A real value of 0 for rainfall is legitimate; 0 as a sentinel is not. Know your dataset before you convert.

Decide drop vs fill vs flag

There is no universal rule. Pick based on how much is missing and whether the column is required:

  • Drop when a few rows lack a value you cannot reasonably estimate and that you need for the analysis.
  • Fill when the column is required downstream and a defensible estimate (constant, statistic, or spatial neighbour) is acceptable.
  • Flag when you want to keep every row and let downstream code or readers know which values were imputed.

In practice you often combine them: flag first, then fill.

n = len(gdf)
for col in ["elevation", "rainfall", "land_use"]:
    missing = gdf[col].isna().sum()
    print(f"{col}: {missing} missing ({missing / n:.1%})")
# Low % and required -> drop; required downstream -> fill + flag

Drop rows or columns

Use dropna with subset to remove only rows missing a specific required column. Dropping blindly across all columns can wipe out most of your data.

# Drop rows missing the required 'elevation' attribute
clean = gdf.dropna(subset=["elevation"])

# Drop rows missing ANY of several required columns
clean = gdf.dropna(subset=["elevation", "rainfall"])

# Drop a column that is almost entirely empty (e.g. >90% missing)
threshold = 0.9
mostly_empty = gdf.columns[gdf.isna().mean() > threshold]
gdf = gdf.drop(columns=mostly_empty)

print(f"Dropped {len(gdf) - len(clean)} rows / columns: {list(mostly_empty)}")

Avoid a bare gdf.dropna() — with no subset it drops every row that has a gap in any column, which on real spatial data often deletes almost everything.

Fill with a constant or statistic

fillna replaces gaps. Use a constant for categoricals, a statistic for numerics. Prefer the median over the mean for skewed environmental data because it resists outliers.

# Constant for a category
gdf["land_use"] = gdf["land_use"].fillna("unknown")

# Median for a skewed numeric column
gdf["rainfall"] = gdf["rainfall"].fillna(gdf["rainfall"].median())

# Mean for a roughly symmetric column
gdf["temperature"] = gdf["temperature"].fillna(gdf["temperature"].mean())

# Group-wise median (fill within each region)
gdf["rainfall"] = gdf.groupby("region")["rainfall"].transform(
    lambda s: s.fillna(s.median())
)

# Forward-fill for ordered (e.g. time-sorted) data
gdf = gdf.sort_values("date")
gdf["reading"] = gdf["reading"].ffill()

Group-wise fills are usually better than a single global statistic — a regional median is closer to the truth than a national one.

Spatial-aware fill from nearby features

For spatial data, nearby features are often the best estimate (Tobler's first law: near things are more alike). Use sjoin_nearest to borrow a value from the closest feature that has one.

import geopandas as gpd

# Split into rows that have the value and rows that don't
has_val = gdf[gdf["elevation"].notna()]
no_val = gdf[gdf["elevation"].isna()].copy()

# For each missing row, find the nearest row that has elevation
filled = gpd.sjoin_nearest(
    no_val.drop(columns=["elevation"]),
    has_val[["elevation", "geometry"]],
    how="left",
    distance_col="fill_dist",
)

# Write the borrowed value back into the original frame
gdf.loc[filled.index, "elevation"] = filled["elevation"].values

Make sure both frames share a projected CRS (metres) before measuring distance, so "nearest" means nearest on the ground, not in degrees. Reproject with gdf.to_crs(...) first if needed.

Keep an audit flag

Before any fill, record which values were originally missing. A simple boolean column lets you exclude or weight imputed values later and keeps your cleaning honest.

# Flag BEFORE filling
gdf["elevation_was_missing"] = gdf["elevation"].isna()

# Now fill
gdf["elevation"] = gdf["elevation"].fillna(gdf["elevation"].median())

# Later: count imputed rows, or analyse only real measurements
print(gdf["elevation_was_missing"].sum(), "elevation values were imputed")
real_only = gdf[~gdf["elevation_was_missing"]]

Code examples

Example 1: Read a CSV treating tokens as missing

import pandas as pd
import geopandas as gpd

df = pd.read_csv(
    "sites.csv",
    na_values=["NA", "N/A", "null", "-", "", "-9999", "9999"],
)
gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df["lon"], df["lat"]),
    crs="EPSG:4326",
)
print(gdf.isna().sum())

Example 2: A reusable missing-value report

def missing_report(gdf):
    n = len(gdf)
    rep = pd.DataFrame({
        "missing": gdf.isna().sum(),
        "pct": (gdf.isna().mean() * 100).round(1),
        "dtype": gdf.dtypes.astype(str),
    })
    return rep.sort_values("pct", ascending=False)

print(missing_report(gdf))

Example 3: Detect likely sentinels automatically

import numpy as np

def find_sentinels(series, candidates=(-9999, 9999, -32768, 32767)):
    found = {}
    for c in candidates:
        count = (series == c).sum()
        if count:
            found[c] = int(count)
    return found

for col in gdf.select_dtypes("number").columns:
    hits = find_sentinels(gdf[col])
    if hits:
        print(col, hits)
        gdf[col] = gdf[col].replace(list(hits), np.nan)

Example 4: Flag, then group-wise spatial fill

gdf["rainfall_was_missing"] = gdf["rainfall"].isna()

# Fill within each region first; fall back to global median
gdf["rainfall"] = gdf.groupby("region")["rainfall"].transform(
    lambda s: s.fillna(s.median())
)
gdf["rainfall"] = gdf["rainfall"].fillna(gdf["rainfall"].median())

Example 5: Nearest-neighbour fill with a distance cap

proj = gdf.to_crs(gdf.estimate_utm_crs())
has_val = proj[proj["temperature"].notna()]
no_val = proj[proj["temperature"].isna()].copy()

filled = gpd.sjoin_nearest(
    no_val.drop(columns=["temperature"]),
    has_val[["temperature", "geometry"]],
    how="left",
    distance_col="dist_m",
)

# Only accept fills within 5 km; leave the rest as NaN
accept = filled[filled["dist_m"] <= 5000]
proj.loc[accept.index, "temperature"] = accept["temperature"].values
gdf = proj.to_crs(gdf.crs)

Explanation

NaN vs None vs sentinel. NaN is a special floating-point value (numpy.nan) that pandas uses to mark missing numeric data; arithmetic on it propagates (NaN + 5 == NaN), and isna() detects it. None is Python's null object, common in object/string columns; pandas also treats it as missing. A sentinel is an ordinary value (-9999) that a dataset conventionally uses to mean "missing" — but pandas has no way to know that, so it treats -9999 as a real number. The whole point of the sentinel-conversion step is to turn those into genuine NaN so the missing-data machinery can see them.

Why dropping can bias analysis. Missing data is rarely random in the real world. Sensors fail more often in harsh, remote, or extreme locations; surveys are skipped in hard-to-reach areas. If you dropna those rows, you systematically remove a particular kind of place, and your remaining sample no longer represents the study area. A national average computed after dropping all the remote stations will be skewed toward the easy-to-measure ones.

Why a "was_missing" flag matters. Once you fill a value, it becomes indistinguishable from a real measurement — that information loss is permanent unless you record it. A boolean *_was_missing column lets you audit how much of a result rests on imputed data, exclude imputed rows from sensitive statistics, or even use "missingness" itself as a predictor. It costs one column and buys you transparency and reproducibility.

Edge cases or notes

Missing attributes vs null geometries are different problems

A row can have a perfectly valid point and a missing population, or a valid population and a None geometry. The techniques here (isna, fillna, dropna on attribute columns) handle the attribute case. Null or empty geometries need geometry-specific checks like gdf.geometry.isna() and gdf.geometry.is_empty — see the linked geometry guide. Don't conflate the two.

Sentinel values vary by dataset

-9999 is common but far from universal. You'll meet 9999, -32768/32767 (signed 16-bit limits), -1, 255, 1e20, and 0. Always read the dataset's metadata or data dictionary for its declared "NoData" value rather than guessing, and confirm with describe() and value_counts().

Mean-fill distorts distributions

Filling every gap with the column mean (or median) collapses variance: a spike appears at the fill value and the spread shrinks. This can mislead later statistics, regressions, and choropleth class breaks. For a handful of values it's fine; for a large fraction, prefer a model-based or spatial estimate and always keep the audit flag.

Spatial autocorrelation makes nearest-neighbour fill reasonable

Because nearby locations tend to have similar values, borrowing from the closest feature is often more accurate than a global statistic for spatial attributes. This works best where the variable is genuinely smooth in space (elevation, temperature) and worse where it jumps sharply across boundaries (administrative codes, land ownership). Cap the fill distance so you don't borrow from features that are too far away to be relevant.

Document every fill

Record what you did: which columns were filled, with what method, how many values, and the date. Store it in a sidecar note, a commit message, or the *_was_missing flags themselves. Future-you (and reviewers) need to know which numbers were measured and which were invented.

dtype changes after fillna

Watch your dtypes. A clean integer column that contains NaN is promoted to float64, and filling a numeric column with a string ("unknown") turns the whole column into object. After filling, check gdf.dtypes and cast back if needed (gdf["count"] = gdf["count"].astype("Int64") — pandas' nullable integer type tolerates NA).

FAQ

What's the difference between a missing attribute and a null geometry?

A missing attribute is a gap in a data column — no recorded population, elevation, or land-use code for a feature whose location is known. A null geometry is a row whose shape is None or empty, regardless of its attributes. They're detected and fixed differently: attribute gaps with isna/fillna/dropna on the column, geometry nulls with geometry.isna()/is_empty.

How do I find missing values that are hidden as -9999?

Plain isna() won't catch them because -9999 is a valid number to pandas. Inspect each numeric column with describe() and value_counts() to spot implausible extremes or oddly frequent values, then replace([-9999, 9999], np.nan) to convert the real sentinels into genuine NaN.

Should I drop or fill missing data?

Drop when only a few rows lack a value you truly need and can't estimate; fill when the column is required downstream and a defensible estimate exists. If in doubt, flag-and-fill rather than drop, because dropping non-random gaps can bias your results. The right choice depends on how much is missing and how the data will be used.

What does na_values do when reading a file?

na_values tells the reader which strings (or numbers) to treat as missing at load time, so "NA", "null", "-", or "-9999" become NaN immediately instead of being read as literal text or numbers. It's available on pd.read_csv; for vector formats read with GeoPandas, convert sentinels with replace after loading.

Is mean or median better for filling?

Median for skewed data (most environmental variables) and when outliers are present, because it isn't dragged toward extremes; mean for roughly symmetric distributions. Better still, use a group-wise or spatial estimate so the fill reflects local conditions. Whichever you pick, set a *_was_missing flag so the imputation is auditable.

Why did my integer column turn into floats after handling missing values?

NaN is a float, so any integer column containing it is automatically promoted to float64. To keep integer semantics with missing values, use pandas' nullable integer type: gdf["count"] = gdf["count"].astype("Int64") (capital I), which tolerates pd.NA.

How do I fill a value from the nearest feature that has one?

Split the frame into rows that have the value and rows that don't, reproject to a metric CRS, then gpd.sjoin_nearest the missing rows against the populated ones to borrow the closest value. Use distance_col to capture how far each match was, and cap that distance so you don't import values from features that are too far away.

Can I just use fillna(0) for everything?

Rarely a good idea. 0 is a real, meaningful value for most measurements (zero rainfall, zero elevation at sea level), so filling with it silently injects false data and skews sums, means, and maps. Use 0 only when zero is genuinely the correct default, and otherwise prefer a statistic, a spatial estimate, or an explicit "unknown" category — always with an audit flag.