GeoPandas Spatial Join Returns Duplicate Rows: How to Fix It

Problem statement

A common surprise with geopandas.sjoin() is that the result has more rows than the left layer, or that the same feature appears several times. You start with 1,000 points, run a spatial join, and end up with 1,180 rows. The same point_id shows up two or three times, each time matched to a different polygon.

This usually happens for one of these reasons:

  • a left geometry intersects more than one right polygon (overlapping zones)
  • a point sits exactly on a shared boundary between two polygons
  • the right layer contains duplicate or overlapping polygons
  • the predicate is permissive enough to match several features at once
  • the right layer has multipart or stacked geometries covering the same area

The important thing to understand first: this is expected behavior, not a bug. A spatial join is inherently one-to-many. When a feature on the left genuinely relates to several features on the right, GeoPandas returns one row per match.

This guide shows how to confirm the one-to-many relationship and then choose the right fix: deduplicate, aggregate, or repair the source polygons.

Quick answer

A point sitting inside two overlapping zones, matching both and producing two joined rows.
A point inside two overlapping zones matches both, producing a duplicate row per zone.

If geopandas.sjoin() returns more rows than the left layer:

  1. confirm the extra rows come from a one-to-many spatial relationship
  2. find which left features matched multiple polygons
  3. decide whether to keep one match, aggregate all matches, or fix the data
  4. to keep one match, sort then drop_duplicates on the original ID
  5. to keep all matches, groupby the original ID and aggregate the polygon values

To keep exactly one row per left feature:

import geopandas as gpd

points = gpd.read_file("data/points.gpkg")
zones = gpd.read_file("data/zones.shp")

if points.crs != zones.crs:
    points = points.to_crs(zones.crs)

joined = gpd.sjoin(points, zones, how="left", predicate="within")

# keep one row per point, preferring the highest-priority zone
deduped = (
    joined.sort_values("zone_priority", ascending=False)
    .drop_duplicates(subset="point_id", keep="first")
)

The duplicates are real matches. Whether you drop or combine them depends on what you want each output row to mean.

Choosing what each row should mean

Flowchart — Deciding what one output row should represent after a one-to-many join.
Deciding what one output row should represent after a one-to-many join.

Step-by-step solution

Confirm the result really is one-to-many

Before changing anything, compare row counts. If the join produced more rows than the left layer, at least one left feature matched several right polygons.

import geopandas as gpd

points = gpd.read_file("data/points.gpkg")
zones = gpd.read_file("data/zones.shp")

if points.crs != zones.crs:
    points = points.to_crs(zones.crs)

joined = gpd.sjoin(points, zones, how="left", predicate="within")

print("Left rows:", len(points))
print("Joined rows:", len(joined))

If Joined rows is larger than Left rows, the extra rows are additional matches, not corrupted data.

Make sure the left layer has a stable unique ID

To find and resolve duplicates reliably, you need a column that uniquely identifies each left feature. Do not rely on the name field or on row position after the join.

if "point_id" not in points.columns:
    points = points.reset_index(drop=True)
    points["point_id"] = points.index + 1

sjoin() also adds an index_right column holding the index of the matched right feature. That column is how you tell which polygon each row came from.

Inspect which features matched multiple polygons

Group by the left ID and count how many rows each feature produced. Anything above one is a one-to-many match.

match_counts = joined.groupby("point_id").size()
multi = match_counts[match_counts > 1]

print("Points with multiple matches:", len(multi))
print(multi.head())

To see the actual duplicate rows, including which polygon each one came from:

duplicate_rows = joined[joined["point_id"].isin(multi.index)]
print(duplicate_rows[["point_id", "index_right", "zone_name"]].head(20))

This tells you whether the duplicates are caused by overlapping zones, boundary points, or repeated polygons.

Deduplicate by keeping one match

If you only want one zone per point, drop the extra matches. Sort first so the row you keep is the one you actually want, then drop duplicates on the unique ID.

deduped = (
    joined.sort_values("zone_priority", ascending=False)
    .drop_duplicates(subset="point_id", keep="first")
)

print("Rows after dedupe:", len(deduped))

Now each point_id appears once. The sort key controls which match survives. Without a meaningful sort, drop_duplicates keeps whichever row happens to come first, which is arbitrary.

Aggregate the many matches instead of dropping them

If every match matters, keep them all but collapse them into one row per left feature. Use groupby on the left ID and aggregate the right-layer values.

aggregated = (
    joined.groupby("point_id")
    .agg(
        zone_count=("zone_name", "size"),
        zone_names=("zone_name", lambda s: ", ".join(sorted(s.dropna().unique()))),
        max_priority=("zone_priority", "max"),
    )
    .reset_index()
)

print(aggregated.head())

This produces one row per point that records how many zones matched, the joined zone names, and the highest priority among them. You lose no information and you keep a single row per feature.

Fix overlapping or duplicate right polygons

Sometimes the duplicates come from the data, not the geography. If the right layer has polygons that overlap or repeat the same area, dissolve them so each location is covered once.

print("Zone rows before dissolve:", len(zones))

zones_clean = zones.dissolve(by="zone_name").reset_index()

print("Zone rows after dissolve:", len(zones_clean))

joined = gpd.sjoin(points, zones_clean, how="left", predicate="within")
print("Joined rows:", len(joined))

dissolve(by="zone_name") merges all polygons that share a name into a single geometry. After dissolving, a point inside one named zone matches once instead of once per fragment.

Code examples

Example 1: Confirm and quantify the duplicates

import geopandas as gpd

points = gpd.read_file("data/points.gpkg")
zones = gpd.read_file("data/zones.shp")

if points.crs != zones.crs:
    points = points.to_crs(zones.crs)

points = points.reset_index(drop=True)
points["point_id"] = points.index + 1

joined = gpd.sjoin(points, zones, how="left", predicate="within")

print("Left rows:", len(points))
print("Joined rows:", len(joined))

dupes = joined.groupby("point_id").size()
print("Points matched more than once:", (dupes > 1).sum())

Example 2: Keep one match per point with a priority rule

joined = gpd.sjoin(points, zones, how="left", predicate="within")

deduped = (
    joined.sort_values(["point_id", "zone_priority"], ascending=[True, False])
    .drop_duplicates(subset="point_id", keep="first")
)

print("Rows after dedupe:", len(deduped))
print(deduped[["point_id", "zone_name", "zone_priority"]].head())

Each point keeps the highest-priority zone it falls in. Use any column as the sort key, such as area, rank, or date.

Example 3: Aggregate all matches into one row per point

joined = gpd.sjoin(points, zones, how="left", predicate="within")

summary = (
    joined.groupby("point_id")
    .agg(
        zone_count=("zone_name", "size"),
        zones=("zone_name", lambda s: "; ".join(sorted(s.dropna().unique()))),
    )
    .reset_index()
)

# join the summary back to the original point geometries
result = points.merge(summary, on="point_id", how="left")
print(result[["point_id", "zone_count", "zones"]].head())

This keeps every point geometry and records every zone it touched.

Example 4: Remove overlapping right polygons with dissolve

import geopandas as gpd

points = gpd.read_file("data/points.gpkg")
zones = gpd.read_file("data/zones.shp")

if points.crs != zones.crs:
    points = points.to_crs(zones.crs)

# collapse polygons that share a name into one geometry each
zones_clean = zones.dissolve(by="zone_name").reset_index()

joined = gpd.sjoin(points, zones_clean, how="left", predicate="within")
print("Joined rows:", len(joined))

If the duplication was caused by fragmented or repeated polygons, this removes it at the source.

Explanation

A spatial join is not the same as a merge on a key column. An attribute merge on a unique key produces at most one match per row. A spatial join asks a geometric question instead: which right features satisfy the predicate with this left feature? That question can have several true answers.

If a point falls inside two overlapping zones, both relationships are valid. GeoPandas does not silently pick one. It returns one row per match, with a different index_right on each, just as a SQL join returns one row per matching pair. The result growing past the left row count is the join faithfully reporting every relationship it found.

That is why the row count alone does not tell you something is broken. The right response is to decide what one output row should represent:

  • one row per left feature, keeping a chosen match: sort and drop_duplicates
  • one row per left feature, summarizing all matches: groupby().agg()
  • one match because the data should not overlap: fix the right layer with dissolve()

The predicate also affects how many matches you get. A point exactly on a shared edge can be within neither, touches both, or intersects both, depending on geometry precision. Choosing a stricter predicate sometimes reduces a point to a single match, but it can also drop legitimate matches, so change the predicate for correctness, not just to shrink the row count.

Edge cases or notes

Points exactly on shared boundaries

A point that lands on the line between two adjacent polygons can match both with intersects or touches. With within it often matches neither, because it is not strictly inside either polygon. If boundary points must resolve to exactly one zone, deduplicate after the join rather than relying on the predicate to break the tie.

Overlapping source polygons

If the right layer is supposed to be a clean partition with no overlaps, duplicate matches are a sign the data overlaps where it should not. Check for overlap before blaming the join, and dissolve or clean the polygons so each location is covered once.

Predicate choice changes match counts

intersects is the most permissive common predicate and produces the most one-to-many matches. within is stricter and often yields single matches for interior points. Pick the predicate that matches your real rule, then handle any remaining duplicates explicitly.

Keeping versus dropping index_right

sjoin() adds index_right to record which right feature each row matched. It is useful for inspecting and deduplicating, but you usually do not want it in the final output. Drop it once you are done:

result = deduped.drop(columns="index_right")

If you run a second sjoin() on the result while index_right is still present, GeoPandas can raise a column-name conflict, so remove it between joins.

FAQ

Why does my spatial join have more rows than the input layer?

Because at least one left feature matched more than one right polygon. A spatial join returns one row per matching pair, so overlapping zones or boundary points naturally produce extra rows.

Is duplicate rows after sjoin() a bug?

No. It is expected behavior when a geometry genuinely relates to several features. The fix is to decide whether to keep one match, aggregate all matches, or clean overlapping source polygons.

How do I keep only one match per feature?

Sort the joined GeoDataFrame by a meaningful column, then call drop_duplicates(subset="point_id", keep="first"). The sort decides which match survives, so without it the kept row is arbitrary.

How is a spatial join different from a merge?

A merge matches on key equality and gives at most one match per unique key. A spatial join matches on a geometric predicate, which can be true for several features at once, so it can produce a one-to-many result.

Why does my left join keep rows that matched nothing, and how do I find them?

With how="left", unmatched left features are retained with NaN in the right-hand columns, including index_right. Find them with joined[joined["index_right"].isna()]; switch to how="inner" if you want to drop non-matching features instead.

Does sjoin_nearest() also produce duplicate rows?

Yes. If two right features are tied at the same minimum distance, gpd.sjoin_nearest() returns one row per tie, so a single left feature can appear multiple times. Deduplicate the same way: sort by a tie-breaker column, then drop_duplicates(subset="point_id").

Can I count how many polygons each point matched without aggregating away the geometry?

Yes. Compute the counts separately and merge them back: counts = joined.groupby("point_id").size().rename("match_count"), then points.merge(counts, on="point_id"). This keeps the original point geometries while adding the match tally.

Why do I get a "column index_right already exists" error on my second spatial join?

The first sjoin() left its index_right column in the result, and the second join tries to add it again. Drop it between joins with result = result.drop(columns="index_right") before calling sjoin() a second time.