How to Clean and Normalise Attribute Columns in a GeoDataFrame

A GeoDataFrame is just a pandas DataFrame with one (or more) geometry columns bolted on. That means every messy-data problem you have ever hit in pandas — stray whitespace, numbers stored as text, five spellings of the same category — shows up in spatial data too. The good news: you can fix all of it with ordinary pandas operations, as long as you are careful to leave the geometry column alone.

This guide walks through cleaning the attribute (non-geometry) columns of a GeoDataFrame: tidying headers, normalising strings, fixing dtypes, mapping categories to a canonical set, and unifying missing-value tokens. The geometry stays untouched the whole way.

Problem statement

You loaded a GeoDataFrame and the attribute table is a mess. Typical symptoms:

  • Messy headers — column names like " Site Name ", Pop.2020, COUNTY, with leading/trailing spaces, mixed case, punctuation, or outright duplicates.
  • Numbers stored as text — a population column with dtype object because one cell contains "1,234" or "N/A", so you cannot do arithmetic or comparisons.
  • Mixed casing and whitespace in values"Forest", "forest ", " FOREST" are treated as three different categories.
  • Inconsistent categories"NY", "N.Y.", "New York", "new york" all meaning the same thing.
  • Mixed null tokens — missing values represented as "NA", "N/A", "", "null", "-" instead of a real NaN, so isna() misses them.

None of these are geometry problems. The geometry column is fine — it is the attribute columns that need work.

Quick answer

Identify the string columns with select_dtypes(include="object"), explicitly exclude the active geometry column via gdf.geometry.name, then strip whitespace, unify null tokens, and coerce dtypes column by column. Work on a copy and assign back with bracket assignment so you never trip chained-assignment warnings. Geometry is never touched because it is never in the list of columns you transform.

Messy attribute columns before cleaning and tidy, correctly typed columns after.
Cleaning standardises names, types, and missing-value tokens — the geometry is never touched.
import geopandas as gpd
import pandas as pd

gdf = gpd.read_file("sites.gpkg").copy()

# 1. Tidy column names
gdf.columns = (
    gdf.columns.str.strip().str.lower().str.replace(r"[^\w]+", "_", regex=True)
)

# 2. Identify string columns, excluding the geometry column
geom_col = gdf.geometry.name
str_cols = [c for c in gdf.select_dtypes(include="object").columns if c != geom_col]

# 3. Strip whitespace and unify missing-value tokens
null_tokens = ["", "na", "n/a", "null", "none", "-"]
for col in str_cols:
    s = gdf[col].astype("string").str.strip()
    gdf[col] = s.mask(s.str.lower().isin(null_tokens))

# 4. Coerce obvious numeric / date columns (errors -> NaN, never crash)
gdf["population"] = pd.to_numeric(
    gdf["population"].str.replace(",", "", regex=False), errors="coerce"
)
gdf["survey_date"] = pd.to_datetime(gdf["survey_date"], errors="coerce")

# 5. Confirm geometry survived
assert gdf.geometry.name == geom_col
print(gdf.crs, gdf.geometry.geom_type.value_counts().to_dict())

Step-by-step solution

Inspect columns and dtypes

Before changing anything, look at what you have. dtypes tells you which columns pandas thinks are text (object/string) versus numbers or dates, and head() shows you the actual values. Note which column is the active geometry — you will protect it throughout.

import geopandas as gpd
import pandas as pd

gdf = gpd.read_file("sites.gpkg")

print(gdf.shape)
print(gdf.dtypes)
print("geometry column:", gdf.geometry.name)
print("CRS:", gdf.crs)
gdf.head()

Anything that should be a number or a date but shows up as object is a candidate for coercion. Anything object that should stay text is a candidate for whitespace/casing cleanup.

Tidy column names

Clean headers make every later step easier. Strip whitespace, lowercase, replace runs of non-word characters with a single underscore, and then deal with duplicates. Renaming columns never touches the geometry data — it only relabels the geometry column, and GeoPandas tracks it by name, so set the geometry again if you rename it.

# Strip, lowercase, and slugify punctuation/spaces to underscores
gdf.columns = (
    gdf.columns
    .str.strip()
    .str.lower()
    .str.replace(r"[^\w]+", "_", regex=True)
    .str.strip("_")
)

# De-duplicate names: site, site -> site, site_1
def dedupe(names):
    seen = {}
    out = []
    for n in names:
        if n in seen:
            seen[n] += 1
            out.append(f"{n}_{seen[n]}")
        else:
            seen[n] = 0
            out.append(n)
    return out

gdf.columns = dedupe(gdf.columns)
print(list(gdf.columns))

If the geometry column had a non-standard name that got rewritten, re-establish it with gdf = gdf.set_geometry("geometry") (using whatever the new name is).

Strip whitespace and normalise casing

Target only the string columns, and explicitly drop the geometry column from that list. Use the nullable string dtype so empty results become <NA> rather than the literal float nan. Here we strip whitespace, collapse internal runs of spaces, and apply a consistent casing.

geom_col = gdf.geometry.name
str_cols = [c for c in gdf.select_dtypes(include="object").columns if c != geom_col]

for col in str_cols:
    s = gdf[col].astype("string").str.strip()
    s = s.str.replace(r"\s+", " ", regex=True)   # collapse internal whitespace
    gdf[col] = s

# Apply title case to a name-like column, leave codes alone
gdf["site_name"] = gdf["site_name"].str.title()

Pick casing per column. Title case suits place names; upper case suits short codes like "ny" -> "NY"; lower case is safest as an intermediate before category mapping.

Fix numeric and date dtypes

The key tool is pd.to_numeric(..., errors="coerce"), which turns anything it cannot parse into NaN instead of raising. For dates, pd.to_datetime(..., errors="coerce") does the same. Strip thousands separators and stray symbols first.

# Numbers stored as text, possibly with thousands separators
gdf["population"] = pd.to_numeric(
    gdf["population"].astype("string").str.replace(",", "", regex=False),
    errors="coerce",
)

# Percent strings like "12.5%" -> 12.5
gdf["coverage_pct"] = pd.to_numeric(
    gdf["coverage_pct"].astype("string").str.rstrip("%"),
    errors="coerce",
)

# Dates: let pandas infer, coerce failures to NaT
gdf["survey_date"] = pd.to_datetime(gdf["survey_date"], errors="coerce")

print(gdf[["population", "coverage_pct", "survey_date"]].dtypes)

After coercion, check how many values failed: gdf["population"].isna().sum(). A surprise spike tells you a parsing assumption was wrong.

Normalise categorical values

Map every known variant to one canonical value. Build a dictionary (your source of truth), normalise the raw values to a comparable form first (strip + lowercase), then map. Use .map() with care — unmatched values become NaN, which can be useful for spotting unknowns.

landcover_map = {
    "forest": "Forest",
    "woodland": "Forest",
    "grass": "Grassland",
    "grassland": "Grassland",
    "water": "Water",
    "lake": "Water",
}

key = gdf["landcover"].str.strip().str.lower()
gdf["landcover"] = key.map(landcover_map)

# Find values that did not match the source of truth
unmatched = gdf.loc[gdf["landcover"].isna() & key.notna(), :]
print("unmapped originals:", key[unmatched.index].unique())

If you would rather keep unknown values verbatim instead of turning them into NaN, use key.map(landcover_map).fillna(gdf["landcover"]).

Unify missing-value tokens

Different exports use different strings for "missing". Replace them all with a real NaN so isna(), dropna(), and fillna() work. Do this across all string columns at once, comparing case-insensitively.

null_tokens = {"", "na", "n/a", "null", "none", "nan", "-", "--", "unknown"}

for col in str_cols:
    s = gdf[col].astype("string").str.strip()
    gdf[col] = s.mask(s.str.lower().isin(null_tokens))

print(gdf[str_cols].isna().sum())

mask(cond) replaces values where cond is True with NA. Because we use the nullable string dtype, the result is a proper <NA> that the rest of pandas understands.

Verify geometry is intact

You never put the geometry column in any transform list, so it should be untouched — but verify. Confirm the active geometry name, the CRS, the geometry types, and that nothing became invalid or empty.

assert gdf.geometry.name == geom_col, "geometry column changed!"
print("CRS:", gdf.crs)
print("types:", gdf.geometry.geom_type.value_counts().to_dict())
print("empty:", gdf.geometry.is_empty.sum())
print("invalid:", (~gdf.geometry.is_valid).sum())
print("missing:", gdf.geometry.isna().sum())

If CRS, counts, and validity match what you saw at the start, your cleaning touched only the attributes.

Code examples

Example 1 — One-liner header slugify

gdf.columns = (
    gdf.columns.str.strip().str.lower().str.replace(r"[^\w]+", "_", regex=True).str.strip("_")
)

Example 2 — Strip and casing on every string column except geometry

geom_col = gdf.geometry.name
str_cols = [c for c in gdf.select_dtypes(include="object").columns if c != geom_col]

gdf[str_cols] = gdf[str_cols].apply(
    lambda s: s.astype("string").str.strip().str.replace(r"\s+", " ", regex=True)
)

Example 3 — Robust numeric coercion with cleanup

def to_number(s):
    s = s.astype("string").str.strip()
    s = s.str.replace(",", "", regex=False).str.replace("$", "", regex=False)
    return pd.to_numeric(s, errors="coerce")

for col in ["population", "area_ha", "elevation_m"]:
    gdf[col] = to_number(gdf[col])

Example 4 — Category normalisation with an unmatched report

status_map = {"active": "Active", "open": "Active", "closed": "Closed", "inactive": "Closed"}

key = gdf["status"].str.strip().str.lower()
mapped = key.map(status_map)
print("unmapped:", sorted(key[mapped.isna() & key.notna()].unique()))
gdf["status"] = mapped

Example 5 — Unify null tokens across all attribute columns

null_tokens = {"", "na", "n/a", "null", "none", "nan", "-", "--", "unknown"}

for col in str_cols:
    s = gdf[col].astype("string").str.strip()
    gdf[col] = s.mask(s.str.lower().isin(null_tokens))

Explanation

Why columns end up as object dtype. pandas picks one dtype per column. When a column contains any value it cannot store as a number — an empty cell read as "", a stray "N/A", a thousands separator like "1,234" — it falls back to the catch-all object dtype, which holds arbitrary Python objects (usually strings). So a single bad cell among a million good ones is enough to make the whole column text. This is why arithmetic silently fails or string-concatenates instead of adding.

Why one bad cell downgrades a column. Dtypes are column-wide, not per-cell. There is no "mostly integer" dtype; the column is either fully numeric or it is object. pd.to_numeric(errors="coerce") is the cure: it parses what it can and turns the rest into NaN, after which the column becomes a proper float64 (or nullable Int64/Float64) and the bad cells are explicit missing values you can inspect and decide about.

Keeping the geometry column out of transforms. The active geometry column is a geometry dtype, not object, so select_dtypes(include="object") would not normally pick it up — but older files, mixed sources, or a column that lost its geometry dtype can surprise you. The safe habit is to always read gdf.geometry.name and exclude it explicitly from any list of columns you loop over. You never call .str or to_numeric on the geometry, so it cannot be corrupted. When you rename columns, GeoPandas keeps tracking geometry by name, so re-set it with set_geometry if the name changed.

Edge cases or notes

Never transform the geometry column

Do not apply string or numeric operations to the geometry column, and do not include it when you bulk-process columns. Always derive your working list as [c for c in ... if c != gdf.geometry.name]. If you accidentally cast geometry to text, you will need to re-parse it with gpd.GeoSeries.from_wkt or reload the file.

select_dtypes("object") targets strings — but verify

select_dtypes(include="object") is the right way to grab text columns, and it excludes the geometry dtype by design. Still, exclude gdf.geometry.name explicitly: a column read from some formats can arrive as generic object, and you do not want to assume. If you have moved to nullable strings, use include=["object", "string"].

Thousands separators and locale

"1,234.5" is one number in US/UK convention but "1.234,5" in much of Europe. pd.to_numeric does not localise. Strip the grouping separator yourself (str.replace(",", "")) for US-style data, or for European-style swap separators first: s.str.replace(".", "", regex=False).str.replace(",", ".", regex=False). Decide based on the source, not a guess.

Category mapping needs a source of truth

A mapping is only as good as the dictionary behind it. Keep the canonical set in one place (a dict, a small lookup table, or a config file), normalise raw values before mapping (strip + lowercase), and always report values that did not match so you can extend the map rather than silently dropping them to NaN.

Encoding on read

Garbled characters like é instead of é are an encoding problem, not a cleaning problem — fix them at read time. Pass the right encoding: gpd.read_file("data.shp", encoding="utf-8") (or latin-1 for legacy shapefiles). A shapefile's .cpg sidecar declares its encoding; if it is wrong or missing, override it. Cleaning after the fact is far harder than reading correctly.

Chained assignment — use .loc and .copy()

Start from gdf = gpd.read_file(...).copy() if the GeoDataFrame is a slice of another, and assign with gdf[col] = ... or gdf.loc[mask, col] = ... rather than gdf[col][mask] = .... Chained indexing can write to a temporary copy and silently do nothing (and raises SettingWithCopyWarning). On pandas 3.x with Copy-on-Write, chained assignment never works, so the .loc habit is mandatory.

Preserve nullable dtypes deliberately

After coercion you may want Int64 (nullable integer) instead of float64 so counts stay integers despite NaNs: gdf["population"] = gdf["population"].astype("Int64"). The nullable string dtype likewise keeps <NA> distinct from the literal string "nan", which matters when you unify null tokens.

FAQ

Does cleaning attribute columns ever change the geometry?

No, as long as you never include the geometry column in your transforms. String and numeric operations apply only to the columns you name or select, and select_dtypes(include="object") does not return the geometry dtype. Always exclude gdf.geometry.name explicitly as a safety net, then verify CRS and geometry types afterward.

Why is my number column an object dtype?

Because at least one value cannot be stored as a number — commonly an empty string, a token like "N/A", or a thousands separator such as "1,234". pandas falls back to object for the whole column. Run pd.to_numeric(col, errors="coerce") after stripping separators to convert it and turn the unparseable cells into NaN.

How do I convert messy text into a real NaN?

Build a set of the missing-value tokens your data uses ("NA", "N/A", "", "null", "-", etc.), then use s.mask(s.str.lower().isin(tokens)) on the nullable string dtype. That replaces each matching value with <NA>, after which isna(), dropna(), and fillna() all work as expected.

What is the difference between object and string dtype?

object is pandas' generic container for any Python object and is what you usually get from a file read. The newer string dtype is dedicated to text, gives you a proper <NA> for missing values (instead of float nan mixed into a text column), and behaves more predictably with .str methods. Convert with .astype("string") when you want clean missing-value handling.

How do I rename columns without breaking the geometry?

Renaming is safe: GeoPandas tracks the active geometry by its column name, so reassigning gdf.columns only relabels it. If your cleanup renames the geometry column itself, re-establish it with gdf = gdf.set_geometry("new_name") so gdf.geometry keeps pointing at the right column.

Should I use Int64 or float64 for integer columns with missing values?

Use Int64 (capital I, the nullable integer dtype) when you need values to stay integers despite missing entries — for example ID codes or counts. Plain float64 will display 1.0 and admits floating-point fuzz. Coerce first with pd.to_numeric(..., errors="coerce"), then .astype("Int64").

How do I fix garbled characters like é?

That is an encoding mismatch, so fix it when reading rather than cleaning afterward. Pass the correct encoding to gpd.read_file(path, encoding="utf-8"), or latin-1 for older shapefiles. Check the shapefile's .cpg sidecar — if it declares the wrong encoding, override it explicitly on read.

Why does my assignment seem to do nothing?

You are probably hitting chained assignment, like gdf[col][mask] = value, which writes to a temporary copy. Use gdf.loc[mask, col] = value instead, and start from a .copy() if your GeoDataFrame is a slice of another. On pandas 3.x with Copy-on-Write, the .loc form is the only one that works.