How to Clean Messy CSV Coordinates into a Reliable GeoDataFrame
Raw coordinate CSVs almost never arrive analysis-ready. They come out of spreadsheets, web forms, GPS loggers, and legacy databases, and they carry every defect those systems can produce: numbers stored as text, European comma decimals, stray degree symbols, blank cells, and swapped latitude/longitude columns. If you hand that straight to GeoPandas, you either get an exception or — worse — a GeoDataFrame full of points that silently land in the ocean.
This guide is the cleaning-focused companion to reading a CSV with coordinates as a GeoDataFrame. Here you will take a genuinely messy CSV and turn it into a clean, valid, correctly-projected GeoDataFrame with a defensible, repeatable pipeline.
Problem statement
You have a CSV with point coordinates and you want a GeoDataFrame you can trust. The trouble is the data. In the real world, a coordinate CSV typically contains some mix of:
- Coordinates stored as strings — pandas reads the column as
object(text), notfloat, so arithmetic and geometry construction fail or behave strangely. - Comma decimal separators —
"48,8566"instead of48.8566, common in European locales and Excel exports. - Stray characters and whitespace — leading/trailing spaces, degree symbols (
°),N/S/E/Wsuffixes, thousands separators. - Blank or missing values — empty cells,
NaN,"null","NA","-", or"None"as text. - Non-numeric junk — typos like
"12.3.4","unknown", or a header row that got duplicated into the data. - Swapped latitude and longitude — the columns are mislabeled or the source put them in the wrong order.
- Out-of-range values — latitude beyond ±90 or longitude beyond ±180, from corrupt data or a units mistake.
- Duplicate points — the same location repeated, sometimes exactly, sometimes with trivial precision differences.
- No CRS information — the file just has numbers; nothing tells you the coordinate reference system.
The goal is a pipeline that handles all of these without crashing, drops what is unsalvageable, flags what is suspicious, and produces a GeoDataFrame with an explicitly set CRS.
Quick answer
Read the CSV with pandas, coerce the coordinate columns to numeric with errors="coerce" (turning junk into NaN), drop or filter the bad rows, then build geometry with gpd.points_from_xy — passing longitude first — and set crs="EPSG:4326". Export to a GeoPackage.
import pandas as pd
import geopandas as gpd
# 1. Read everything as-is
df = pd.read_csv("data/stations.csv")
# 2. Clean common string mess, then coerce to numeric (junk -> NaN)
for col in ["lat", "lon"]:
df[col] = (
df[col]
.astype(str)
.str.strip()
.str.replace("°", "", regex=False)
.str.replace(",", ".", regex=False) # comma decimals -> dots
)
df[col] = pd.to_numeric(df[col], errors="coerce")
# 3. Drop rows with missing coordinates
df = df.dropna(subset=["lat", "lon"])
# 4. Range-check: lat in [-90, 90], lon in [-180, 180]
valid = df["lat"].between(-90, 90) & df["lon"].between(-180, 180)
df = df[valid].copy()
# 5. Build geometry — LONGITUDE (x) FIRST, latitude (y) second
gdf = gpd.GeoDataFrame(
df,
geometry=gpd.points_from_xy(df["lon"], df["lat"]),
crs="EPSG:4326",
)
# 6. Drop duplicate locations
gdf = gdf.drop_duplicates(subset=["lon", "lat"]).reset_index(drop=True)
# 7. Export
gdf.to_file("data/stations_clean.gpkg", driver="GPKG")
The rest of this guide breaks each step apart so you can inspect, log, and tune it for your data.
Step-by-step solution
Load and inspect the CSV
Always read first, look second. Do not assume the dtypes. Load the file and check what pandas actually inferred — if a coordinate column shows up as object, you have string mess to clean.
import pandas as pd
df = pd.read_csv("data/stations.csv")
print(df.shape)
print(df.dtypes) # are lat/lon float64 or object?
print(df.head(10))
print(df[["lat", "lon"]].describe(include="all"))
If lat or lon reports object, the values are text. If they are float64 already, you may only need the range and duplicate checks later.
Clean coordinate strings
Strip whitespace, remove degree symbols and direction letters, and convert comma decimals to dots. Work on the string representation so the operations are safe even for already-numeric columns.
def clean_coord_strings(s: pd.Series) -> pd.Series:
return (
s.astype(str)
.str.strip()
.str.replace("°", "", regex=False)
.str.replace(r"[NSEW]", "", regex=True) # drop direction letters
.str.replace(r"\s+", "", regex=True) # remove inner whitespace
.str.replace(",", ".", regex=False) # comma decimal -> dot
)
df["lat"] = clean_coord_strings(df["lat"])
df["lon"] = clean_coord_strings(df["lon"])
Be careful with the comma replacement if your numbers use commas as thousands separators (e.g. "1,234.5") — see the locale note in Edge cases. For lat/lon in degrees, thousands separators almost never appear, so a straight comma-to-dot swap is safe.
Convert to numeric with pd.to_numeric(errors="coerce")
This is the single most important step. errors="coerce" turns anything that cannot be parsed — "unknown", "12.3.4", empty strings, "NA" — into NaN instead of raising. That converts a fragile crash into a clean, countable signal.
df["lat"] = pd.to_numeric(df["lat"], errors="coerce")
df["lon"] = pd.to_numeric(df["lon"], errors="coerce")
# How much did we lose to unparseable junk?
bad = df["lat"].isna() | df["lon"].isna()
print(f"{bad.sum()} of {len(df)} rows have unparseable coordinates")
Drop rows with missing coordinates
A point needs both an x and a y. Drop any row where either coordinate is NaN after coercion. Keep a record of how many you removed.
before = len(df)
df = df.dropna(subset=["lat", "lon"]).copy()
print(f"Dropped {before - len(df)} rows with missing coordinates")
Range-check latitude/longitude and flag bad rows
Latitude must fall within ±90 and longitude within ±180. Values outside those bounds are corrupt, swapped, or in different units. Flag them so you can inspect before discarding.
df["lat_ok"] = df["lat"].between(-90, 90)
df["lon_ok"] = df["lon"].between(-180, 180)
df["coords_ok"] = df["lat_ok"] & df["lon_ok"]
# Inspect the suspicious rows before deciding what to do
print(df.loc[~df["coords_ok"], ["lat", "lon"]])
clean = df[df["coords_ok"]].copy()
Detect and fix swapped lat/lon
A classic symptom: a lat value above 90 (impossible for latitude) that would be perfectly valid as a longitude. If a row's lat is out of latitude range but in longitude range — and vice versa for lon — the two are probably swapped. Apply a targeted fix rather than dropping the row.
# Rows where lat looks like a longitude and lon looks like a latitude
swapped = (
~df["lat"].between(-90, 90)
& df["lat"].between(-180, 180)
& df["lon"].between(-90, 90)
)
df.loc[swapped, ["lat", "lon"]] = df.loc[swapped, ["lon", "lat"]].values
print(f"Repaired {swapped.sum()} swapped lat/lon rows")
# Re-run the range check after the fix
clean = df[df["lat"].between(-90, 90) & df["lon"].between(-180, 180)].copy()
This heuristic only catches the unambiguous cases. If both values are within ±90 (for example most of the populated world), you cannot tell from numbers alone whether they are swapped — you need a known location or a sanity plot. See Edge cases.
Build the GeoDataFrame and set the CRS
Now build the geometry. gpd.points_from_xy expects x first, then y — for geographic coordinates that means longitude, then latitude. Set the CRS explicitly to EPSG:4326, which is plain lon/lat degrees on the WGS 84 datum.
import geopandas as gpd
gdf = gpd.GeoDataFrame(
clean,
geometry=gpd.points_from_xy(clean["lon"], clean["lat"]),
crs="EPSG:4326",
)
print(gdf.crs) # EPSG:4326
print(gdf.geometry.head())
Never skip the crs argument. A GeoDataFrame with crs=None cannot be reprojected, plotted on a basemap, or joined with other layers reliably. If you later discover the data was in a different system, see how to fix a CRS mismatch in GeoPandas.
Remove duplicate points
Drop rows that share the same coordinates. Deduplicating on the numeric lon/lat columns is more robust than comparing geometry objects, and you can round first if near-identical points should be treated as the same place.
# Exact-coordinate duplicates
gdf = gdf.drop_duplicates(subset=["lon", "lat"]).reset_index(drop=True)
# Or treat points within ~1 m as duplicates by rounding to 5 decimals
gdf["lon_r"] = gdf["lon"].round(5)
gdf["lat_r"] = gdf["lat"].round(5)
gdf = gdf.drop_duplicates(subset=["lon_r", "lat_r"]).drop(columns=["lon_r", "lat_r"])
gdf = gdf.reset_index(drop=True)
Export
Write the result to a GeoPackage. It is a single-file, open OGC format that preserves the CRS, column types, and geometry far better than a shapefile (no 10-character field limit, no .dbf/.shx sidecar sprawl).
gdf.to_file("data/stations_clean.gpkg", driver="GPKG", layer="stations")
Code examples
Example 1 — Coerce and count the bad rows
import pandas as pd
df = pd.read_csv("data/stations.csv")
df["lat"] = pd.to_numeric(df["lat"], errors="coerce")
df["lon"] = pd.to_numeric(df["lon"], errors="coerce")
n_bad = (df["lat"].isna() | df["lon"].isna()).sum()
print(f"{n_bad} rows failed numeric parsing out of {len(df)}")
Example 2 — Clean comma decimals and degree symbols
import pandas as pd
df = pd.read_csv("data/stations.csv", dtype=str) # read as text to preserve mess
for col in ["lat", "lon"]:
df[col] = (
df[col]
.str.strip()
.str.replace("°", "", regex=False)
.str.replace(",", ".", regex=False)
)
df[col] = pd.to_numeric(df[col], errors="coerce")
print(df[["lat", "lon"]].head())
Example 3 — Range filter
mask = df["lat"].between(-90, 90) & df["lon"].between(-180, 180)
print(f"Keeping {mask.sum()} of {len(df)} rows after range check")
clean = df[mask].copy()
Example 4 — Swapped-axis heuristic
# A latitude can never exceed 90. If it does but fits as a longitude, swap.
swapped = ~df["lat"].between(-90, 90) & df["lat"].between(-180, 180) \
& df["lon"].between(-90, 90)
df.loc[swapped, ["lat", "lon"]] = df.loc[swapped, ["lon", "lat"]].values
print(f"Swapped {swapped.sum()} rows")
Example 5 — Build and save the GeoDataFrame
import geopandas as gpd
gdf = gpd.GeoDataFrame(
clean,
geometry=gpd.points_from_xy(clean["lon"], clean["lat"]), # x=lon, y=lat
crs="EPSG:4326",
)
gdf = gdf.drop_duplicates(subset=["lon", "lat"]).reset_index(drop=True)
gdf.to_file("data/stations_clean.gpkg", driver="GPKG")
Explanation
Why coordinates arrive as strings. CSV is a text format with no type system. When a producing tool writes a coordinate, it can attach formatting — a degree symbol, a comma decimal, a N/S suffix — or simply leave a blank cell. The moment any non-numeric character lands in a column, pandas infers the whole column as object (string), because it cannot represent mixed types as float64. So even one stray "N/A" in 10,000 clean numbers downgrades the entire column to text.
Why errors="coerce" matters. pd.to_numeric has different error modes. The default "raise" throws on the first unparseable value, killing your whole pipeline over a single bad cell. "coerce" instead converts each unparseable value to NaN individually, leaving the good values intact. That gives you a clean, vectorized way to find, count, and drop bad rows with standard pandas tools (isna, dropna) instead of writing fragile try/except loops.
Why the CRS must be set explicitly. Coordinates are just numbers until you say what they mean. EPSG:4326 (WGS 84) declares that x is longitude in degrees and y is latitude in degrees on a specific datum. Without a CRS, GeoPandas cannot reproject, cannot align your points with other layers, and cannot place them on a web basemap (which usually wants EPSG:3857). Setting crs="EPSG:4326" at construction time records the assumption you are already making when you treat the numbers as lon/lat. For converting to another system afterward, see how to reproject spatial data in GeoPandas.
The lon=x / lat=y ordering trap. This is the most common mistake in the entire workflow. People say and write "lat/lon," but geometry libraries are built around the mathematical (x, y) convention, and x is longitude. gpd.points_from_xy(x, y) therefore wants points_from_xy(lon, lat). Pass them in spoken order — points_from_xy(lat, lon) — and every point silently lands at the wrong place: your data ends up rotated and mirrored across the globe, often in the ocean, with no error raised. Always pass longitude first.
Edge cases or notes
points_from_xy takes x = longitude first
Worth repeating because it is the bug you will hit most. The signature is points_from_xy(x, y, z=None). For geographic data, x is longitude and y is latitude. If your points appear flipped or scattered, this is the first thing to check.
Locale comma decimals and thousands separators
A blanket .str.replace(",", ".") works when commas are decimal separators but corrupts numbers that use commas as thousands separators ("1,234.56" would become "1.234.56", then coerce to NaN). If your file mixes conventions, prefer reading with the right locale up front: pd.read_csv("data/stations.csv", decimal=",") tells pandas the comma is the decimal point. Inspect a few raw values before choosing an approach.
Coordinates in projected units, not lat/lon
Not every coordinate CSV is in degrees. Values like easting = 500000, northing = 4649776 are projected (for example UTM, in metres), not lon/lat. The giveaway is magnitude: numbers in the hundreds of thousands or millions are not degrees. In that case, do not set EPSG:4326 — set the actual projected CRS (e.g. crs="EPSG:32633" for UTM zone 33N) and reproject to EPSG:4326 afterward if you need lon/lat. Setting the wrong CRS is a classic CRS mismatch source.
(0, 0) null-island artifacts
A point at exactly longitude 0, latitude 0 sits in the Gulf of Guinea off West Africa — "Null Island." It is almost always an artifact of missing data being filled with zeros rather than a real location. Unless your study area genuinely includes that spot, drop or flag (0, 0) rows: df = df[~((df["lon"] == 0) & (df["lat"] == 0))].
Precision and rounding
Coordinate precision is finite and meaningful: 5 decimal places of a degree is roughly 1 metre, 6 is roughly 0.1 metre. Storing 12 decimals implies a precision your GPS never had and bloats files. Rounding to 5-6 decimals before deduplication is a good way to collapse points that are "the same place" recorded at slightly different times. Choose the precision to match your data's real accuracy.
Mixed or text null markers
errors="coerce" handles "NA", "null", "none", and blanks, but only once they reach to_numeric as strings. If you read with dtype=str, they survive the string-cleaning step and get coerced cleanly. You can also pass them to read_csv up front: pd.read_csv(path, na_values=["NA", "null", "none", "-", "unknown"]).
Internal links
- How to Read a CSV with Coordinates as a GeoDataFrame
- How to Reproject Spatial Data in Python (GeoPandas)
- How to Fix CRS Mismatch in GeoPandas
- What Is a GeoDataFrame? Structure and Key Concepts
FAQ
Why are my points in the wrong place?
The two usual causes are swapped axes and the wrong CRS. If you passed points_from_xy(lat, lon) instead of points_from_xy(lon, lat), every point is flipped. If the numbers are large (hundreds of thousands), they are projected coordinates and need their real CRS, not EPSG:4326. Plot a few known locations to diagnose which it is.
Why do I get NaN coordinates?
pd.to_numeric(..., errors="coerce") converts any value it cannot parse into NaN. That includes blanks, text like "unknown", malformed numbers like "12.3.4", and comma decimals you did not convert to dots. The NaN is a feature: it lets you find and drop bad rows with df.dropna(subset=["lat", "lon"]).
Which comes first, latitude or longitude?
For human-readable address-style notation, latitude is usually spoken first. For geometry construction in GeoPandas and Shapely, longitude comes first because it is the x coordinate: gpd.points_from_xy(lon, lat). Mixing these up is the single most common error in coordinate workflows.
How do I detect swapped lat/lon when both values are valid?
You often cannot from numbers alone — if both fall within ±90, either ordering is numerically plausible. Use a known reference point, plot the data against a country boundary, or check that the points land in your expected study area. The numeric heuristic only catches cases where one value exceeds the valid latitude range.
Should I drop out-of-range coordinates or fix them?
Flag first, then decide. A latitude of 120 cannot be a real latitude, but it may be a swapped longitude you can repair. Values that make no sense in either column (like 9999) are corrupt and should be dropped. Keep a log of what you removed so the cleaning is auditable.
What CRS should I use for plain lon/lat degrees?
EPSG:4326 (WGS 84). That is the standard for unprojected longitude/latitude in degrees, which is what GPS devices and most "lat/lon" CSVs report. If you need to plot on a web map tile background or measure distances in metres, reproject to EPSG:3857 or an appropriate local projected CRS afterward.
Why use GeoPackage instead of shapefile for output?
GeoPackage (driver="GPKG") is a single-file SQLite-based format that preserves full column names, data types, and the CRS without the shapefile's limitations: shapefiles truncate field names to 10 characters, scatter data across multiple sidecar files, and have weaker type support. For modern GeoPandas workflows, prefer GeoPackage.
How do I handle duplicate points that are slightly different?
Round the coordinates to a precision that matches your real accuracy (5-6 decimal places is metre-level), then deduplicate on the rounded values: gdf.drop_duplicates(subset=["lon_r", "lat_r"]). This collapses points that represent the same location recorded with trivial floating-point or GPS jitter, while keeping genuinely distinct nearby points.