How to Find and Remove Duplicate Geometries in GeoPandas

Duplicate geometries are one of the most common data-quality problems in spatial work, and they are easy to miss because they often look fine on a map. This guide shows you how to find and remove exact duplicates, attribute duplicates, and near-duplicates in a GeoDataFrame using current GeoPandas (1.x) and Shapely (2.x).

Problem statement

Duplicate features creep into spatial datasets through ordinary, everyday operations. You usually do not introduce them on purpose, but they accumulate fast.

Common ways duplicates appear:

  • Merged datasets. Concatenating tiles, regional extracts, or vendor deliveries that overlap at the edges produces the same feature twice.
  • Repeated imports. Re-running an ETL job or loading the same file into a database table more than once.
  • Joins and spatial joins. A one-to-many sjoin or attribute join can multiply rows so a single geometry appears many times.
  • Digitising mistakes. An editor clicks "save" twice or copies a feature in place.
  • Format round-trips. Exporting and re-importing can reorder vertices or shift precision, creating geometries that are equal in space but not byte-identical.

The problems they cause:

  • Inflated counts. "How many parcels?" returns the wrong number.
  • Wrong aggregations. Sums of area, length, or population double-count.
  • sjoin blowups. Duplicates on both sides of a spatial join produce a combinatorial explosion of output rows.
  • Misleading statistics and maps. Stacked identical polygons distort choropleths and density analysis.
  • Wasted storage and slow processing. Every downstream operation pays for rows that should not exist.

Quick answer

To remove exact duplicate geometries, compare the WKB (well-known binary) representation rather than the geometry objects themselves, then call drop_duplicates. WKB gives you a hashable, byte-stable key so identical geometries collapse reliably. Verify the new count and export.

Two identical geometries stored as separate rows collapse into one unique feature after converting to WKB and dropping duplicates.
Comparing canonical WKB keys collapses identical shapes into a single feature.
import geopandas as gpd

# Load
gdf = gpd.read_file("data/roads.gpkg")

# Drop exact duplicate geometries using a stable WKB key
gdf["_wkb"] = gdf.geometry.to_wkb()
deduped = gdf.drop_duplicates(subset="_wkb").drop(columns="_wkb")

# Verify
print(f"Before: {len(gdf)}  After: {len(deduped)}  Removed: {len(gdf) - len(deduped)}")

# Export
deduped.to_file("data/roads_deduped.gpkg", driver="GPKG")

This catches byte-identical geometries. For geometries that differ only by vertex order or precision, keep reading — you will need normalize() or set_precision() first.

Step-by-step solution

Load the data

Read your layer with gpd.read_file. Check the row count and CRS up front, because CRS matters for any comparison you do later.

import geopandas as gpd

gdf = gpd.read_file("data/parcels.gpkg")

print(len(gdf), "rows")
print(gdf.crs)
print(gdf.geom_type.value_counts())

Find exact duplicate geometries

You might expect gdf.geometry.duplicated() to just work, and sometimes it does. But comparing Shapely geometry objects directly is unreliable: two geometries can represent the same shape yet compare as different because their vertices are stored in a different order, start at a different point, or differ in floating-point precision. Object identity and == do not behave the way you want for deduplication.

The fix is to convert each geometry to a canonical, comparable form first. Two good options:

  • gdf.geometry.to_wkb() produces a bytes value per row. Identical geometries (same vertices, same order) yield identical bytes, which drop_duplicates and duplicated handle perfectly.
  • gdf.geometry.normalize() reorders rings, vertices, and components into a canonical layout. Run it before to_wkb() so that geometries equal in space but stored differently also collapse.
# Naive object comparison (works only for byte-identical objects, fragile)
dupe_mask_naive = gdf.geometry.duplicated(keep=False)

# Reliable: hash the WKB bytes
wkb = gdf.geometry.to_wkb()
dupe_mask = wkb.duplicated(keep=False)

print("Naive duplicates flagged:", dupe_mask_naive.sum())
print("WKB duplicates flagged:  ", dupe_mask.sum())

# Inspect the duplicated rows
duplicates = gdf[dupe_mask]
print(duplicates.sort_values(by=list(gdf.columns[:1])).head())

Remove exact duplicates

Once you have a stable key, dropping is a one-liner. By default drop_duplicates keeps the first occurrence; pass keep="last" to keep the last, or keep=False to drop every duplicated row entirely.

gdf["_wkb"] = gdf.geometry.to_wkb()

deduped = gdf.drop_duplicates(subset="_wkb", keep="first").drop(columns="_wkb")

print(f"Removed {len(gdf) - len(deduped)} exact duplicate geometries")

If you also want geometries that differ only by vertex order to count as duplicates, normalize first:

import shapely

gdf["_wkb"] = shapely.normalize(gdf.geometry.values).to_wkb()
deduped = gdf.drop_duplicates(subset="_wkb", keep="first").drop(columns="_wkb")

Handle duplicates that share geometry AND attributes vs geometry only

There are two different questions, and you must decide which you mean:

  • Geometry only. Same shape, possibly different attribute values. This is what the steps above handle.
  • Geometry AND attributes. The entire row is a duplicate, every column included. These are almost always safe to drop with no information loss.

For full-row duplicates, include the attribute columns in subset alongside the WKB key.

gdf["_wkb"] = gdf.geometry.to_wkb()

# Full-row duplicates: geometry plus selected attribute columns
key_cols = ["road_id", "name", "_wkb"]
full_dupes = gdf.drop_duplicates(subset=key_cols).drop(columns="_wkb")

# Geometry-only duplicates (ignores attributes)
geom_only = gdf.drop_duplicates(subset="_wkb").drop(columns="_wkb")

print("Full-row dedupe rows:", len(full_dupes))
print("Geometry-only dedupe rows:", len(geom_only))

If the geometry repeats but attributes disagree, dropping silently picks one row and discards the rest. That is a decision you should make consciously — see the aggregate step below.

Detect near-duplicates

Near-duplicates are geometries that are almost identical, off by tiny coordinate differences from rounding, reprojection, or separate digitising. WKB comparison will not catch these because the bytes differ. Snap the coordinates to a grid with shapely.set_precision(geom, grid_size) so that points closer than grid_size collapse onto the same location, then compare the snapped WKB.

Choose grid_size in your CRS units. For a projected CRS in metres, 1.0 snaps to the nearest metre; 0.001 to the millimetre.

import shapely

# Snap coordinates to a 1-metre grid (CRS units), then normalize and compare
snapped = shapely.set_precision(gdf.geometry.values, grid_size=1.0)
snapped = shapely.normalize(snapped)

gdf["_snap_wkb"] = snapped.to_wkb()
near_dupe_mask = gdf["_snap_wkb"].duplicated(keep=False)

print(f"{near_dupe_mask.sum()} rows are near-duplicates at 1 m precision")

Decide which row to keep / aggregate attributes

When duplicated geometries carry different attributes, you have three choices: keep the first, keep based on a rule (e.g. the row with the most complete data), or aggregate the attributes into one row.

To aggregate, group by the geometry key and apply functions per column. dissolve is the GeoPandas-native way to merge by a key while keeping a valid geometry, with aggfunc controlling each attribute.

gdf["_wkb"] = gdf.geometry.to_wkb()

# Keep the row with the largest population, drop the rest
gdf_sorted = gdf.sort_values("population", ascending=False)
keep_best = gdf_sorted.drop_duplicates(subset="_wkb", keep="first").drop(columns="_wkb")

# Or aggregate attributes across identical geometries
aggregated = gdf.dissolve(
    by="_wkb",
    aggfunc={"population": "sum", "name": "first"},
    as_index=False,
).drop(columns="_wkb")

print(aggregated.head())

Verify the result

Never trust a dedupe blindly. Confirm that no duplicate keys remain, that the row count dropped by the expected amount, and that totals still make sense.

check = deduped.geometry.to_wkb()
assert not check.duplicated().any(), "Duplicates still present!"

print("Rows before:", len(gdf))
print("Rows after: ", len(deduped))
print("Unique geometries:", check.nunique())

Code examples

Example 1: Count duplicate geometries

import geopandas as gpd

gdf = gpd.read_file("data/roads.gpkg")

wkb = gdf.geometry.to_wkb()
n_dupe_rows = wkb.duplicated(keep=False).sum()   # all rows involved
n_extra = wkb.duplicated(keep="first").sum()     # rows that would be removed

print(f"{n_dupe_rows} rows participate in duplicates")
print(f"{n_extra} rows are redundant and can be removed")

Example 2: Drop exact duplicates by WKB

import geopandas as gpd

gdf = gpd.read_file("data/roads.gpkg")

gdf["_wkb"] = gdf.geometry.to_wkb()
clean = gdf.drop_duplicates(subset="_wkb", keep="first").drop(columns="_wkb")

clean.to_file("data/roads_clean.gpkg", driver="GPKG")

Example 3: Dedupe on a subset of columns plus geometry

import geopandas as gpd

gdf = gpd.read_file("data/sites.gpkg")
gdf["_wkb"] = gdf.geometry.to_wkb()

# A row is a duplicate only if site_id, category AND geometry all match
clean = gdf.drop_duplicates(
    subset=["site_id", "category", "_wkb"],
    keep="first",
).drop(columns="_wkb")

print(len(gdf), "->", len(clean))

Example 4: Near-duplicates with shapely.set_precision

import geopandas as gpd
import shapely

gdf = gpd.read_file("data/buildings.gpkg")

# Snap to a 0.5 m grid and normalize so tiny differences and vertex
# order both collapse, then dedupe on the snapped key
snapped = shapely.normalize(shapely.set_precision(gdf.geometry.values, 0.5))
gdf["_key"] = snapped.to_wkb()

clean = gdf.drop_duplicates(subset="_key", keep="first").drop(columns="_key")
print(f"Removed {len(gdf) - len(clean)} near-duplicates")

Example 5: Merge attributes with dissolve / groupby

import geopandas as gpd

gdf = gpd.read_file("data/parcels.gpkg")
gdf["_wkb"] = gdf.geometry.to_wkb()

merged = gdf.dissolve(
    by="_wkb",
    aggfunc={
        "value": "sum",
        "owner": "first",
        "tags": lambda s: ", ".join(sorted(set(s.dropna()))),
    },
    as_index=False,
).drop(columns="_wkb")

print(merged.head())

Explanation

Why object equality is unreliable. When you compare Shapely geometries with ==, you are not asking "do these occupy the same space?" Python falls back to comparisons that depend on the exact stored representation. A polygon whose ring starts at a different vertex, winds the other way, or has its coordinates rounded a fraction differently is a different object even though it is the same shape. drop_duplicates relies on hashing and equality, so it inherits this fragility. That is why you convert to a canonical string or byte form first.

What to_wkb() does. Well-known binary is a standard, deterministic serialisation of a geometry's type and coordinates. Two geometries with the same type, same vertices, in the same order produce byte-identical WKB. Bytes are hashable and compare cheaply, which is exactly what pandas needs to detect duplicates fast and correctly.

What normalize() does. shapely.normalize (or the .normalize() method on a geometry) rewrites a geometry into a canonical form: rings get a consistent orientation, vertices a consistent starting point, and multi-part geometries a consistent component order. Run it before to_wkb() and geometries that were equal in space but stored differently now produce identical bytes. This makes deduplication order-independent.

What set_precision() does. shapely.set_precision(geom, grid_size) snaps coordinates to a regular grid of spacing grid_size. Coordinates that fall within the same grid cell collapse to the same value, which removes the tiny floating-point and digitising differences that separate near-duplicates. After snapping, the previously "almost equal" geometries become genuinely equal and dedupe cleanly.

Exact vs near duplicates. An exact duplicate is the same shape down to the vertex (modulo order, after normalize). A near duplicate is the same shape only within some tolerance. Exact duplicates are detected with normalize plus to_wkb; near duplicates require set_precision first to define how close counts as "the same." Pick the tolerance deliberately — too coarse a grid will merge features that are genuinely distinct.

Edge cases or notes

Floating-point precision

Coordinates stored as 64-bit floats rarely round-trip exactly through reprojection, format conversion, or arithmetic. Two geometries that "should" be identical can differ in the 12th decimal place, defeating WKB comparison. If exact dedupe finds fewer duplicates than you expect, switch to set_precision with a small grid (for example 1e-6 in degrees or 0.001 in metres) before comparing.

MultiPolygon vs Polygon equivalence

A single Polygon and a MultiPolygon containing only that one polygon describe the same area but have different geometry types and different WKB. They will not be treated as duplicates by a WKB comparison. If you need them to match, normalise the type first (for example explode multiparts) before building the key.

CRS must match before comparing

Coordinate values only mean the same thing within the same CRS. If you concatenate layers in different CRSs, identical real-world features will have different coordinates and never compare equal, and any metre-based grid_size is meaningless across mixed CRSs. Reproject everything to one CRS with gdf.to_crs(...) before deduplicating, and confirm with gdf.crs.

Duplicates with different attributes need a keep/aggregate decision

When the same geometry appears with conflicting attribute values, dropping duplicates silently discards data. Decide explicitly whether to keep the first, keep by a rule (sort then keep="first"), or aggregate with dissolve/groupby. There is no universally correct answer — it depends on what the attributes mean.

Empty and invalid geometries

set_precision and normalize can behave unexpectedly on invalid or empty geometries, and empties may all serialise to the same WKB and collapse into one row. Clean these out first (see the linked guides) so they do not distort your duplicate detection.

Performance on large layers

to_wkb() and set_precision are vectorised over the underlying GEOS arrays and scale to millions of rows, but the temporary WKB column consumes memory. For very large layers, drop the helper column as soon as you finish, process in chunks if needed, and prefer a coarse grid_size only where you actually need near-duplicate matching, since snapping is more expensive than a plain WKB hash.

FAQ

Why does drop_duplicates not catch my duplicate geometries?

Because comparing geometry objects directly relies on their exact stored representation, not their shape. Two geometries can be the same shape but differ in vertex order, starting point, or precision, so pandas sees them as distinct. Convert with gdf.geometry.to_wkb() (and shapely.normalize for order independence) and run drop_duplicates on that key instead.

How do I remove near-duplicate polygons?

Snap coordinates to a grid with shapely.set_precision(gdf.geometry.values, grid_size), then normalize, convert to WKB, and dedupe on that key. The grid_size is in your CRS units and defines how close two geometries must be to count as the same. Start with a tolerance matching your data's accuracy and inspect the results before committing.

Does CRS affect duplicate detection?

Yes. Coordinate comparisons are only meaningful within a single CRS, and any distance-based grid_size depends on the CRS units. Reproject all layers to one CRS with to_crs before comparing, otherwise identical real-world features in different CRSs will never match.

What is the difference between exact and near duplicates?

Exact duplicates have identical geometry down to the vertex (after normalising for order); near duplicates are the same shape only within a tolerance. Use normalize + to_wkb for exact, and add set_precision for near. The near-duplicate case requires you to choose how much coordinate difference is acceptable.

Should I keep the first row or aggregate attributes?

If the duplicated rows are truly identical, keep the first — it loses nothing. If they share a geometry but carry different attribute values, decide whether one row is authoritative (sort then keep="first") or whether the values should be combined with dissolve/groupby using an aggfunc. Aggregating prevents silently throwing away real data.

Why are MultiPolygons and Polygons not matching as duplicates?

They have different geometry types and therefore different WKB, even when they cover the same area. A WKB comparison treats type as part of the identity. Explode multiparts or otherwise normalise geometry types before building your dedupe key if you need them to match.

How do I count duplicates without removing them?

Build a WKB (or snapped WKB) key and call .duplicated(keep=False).sum() to count every row involved in a duplicate, or .duplicated(keep="first").sum() to count only the redundant rows that would be removed. This lets you audit a dataset before deciding how to clean it.

Is to_wkb faster than comparing geometries in a loop?

Yes, substantially. to_wkb() is vectorised over GEOS and produces a hashable key that pandas deduplicates in one pass, whereas a Python loop calling .equals() is O(n²) and orders of magnitude slower. Always prefer the vectorised WKB approach on anything beyond a handful of rows.