Geospatial vector data is a way of representing geographic features in a digital format using points, lines, and polygons. Unlike raster data, which represents geographic data as a grid of cells or pixels, vector data represents features more precisely with distinct shapes and boundaries. Each vector feature can have associated attributes, such as names, types, or other descriptive information.
Types of Geospatial Vector Data
Points: Represent discrete locations such as cities, landmarks, or individual trees. Each point has a specific location defined by coordinates (e.g., latitude and longitude).
Lines (or polylines): Represent linear features such as roads, rivers, or boundaries. Lines are composed of a series of connected points.
Polygons (or multipolygons): Represent areas or shapes such as lakes, parks, or country borders. Polygons are defined by a series of points that create a closed shape.
Shapefiles are one of the most common formats for vector data. They store points, lines, and polygons along with attribute information. The US Census Bureau makes a number of shapefiles available here. In this notebook, we’ll walkthrough how to load shapefiles into GeoPandas, plotting the boundaries and create a choropleth map based on a second dataset (choropleth maps are those where the color of each shape is based on the value of an associated variable).
The 500k files are the most detailed, but also the largest. The 20m files are the smallest, but at the cost of some dramatic simplification. The 5m files fall somewhere between the other two. We will work with the 500k files.
Once downloaded, the shapefile can be loaded into a GeoPandas DataFrame as follows:
import numpy as npimport pandas as pdimport geopandas as gpdshp_path ="cb_2018_us_state_500k.zip"dfshp = gpd.read_file(shp_path)dfshp.head(5)
STATEFP
STATENS
AFFGEOID
GEOID
STUSPS
NAME
LSAD
ALAND
AWATER
geometry
0
28
01779790
0400000US28
28
MS
Mississippi
00
121533519481
3926919758
MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ...
1
37
01027616
0400000US37
37
NC
North Carolina
00
125923656064
13466071395
MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
2
40
01102857
0400000US40
40
OK
Oklahoma
00
177662925723
3374587997
POLYGON ((-103.00257 36.52659, -103.00219 36.6...
3
51
01779803
0400000US51
51
VA
Virginia
00
102257717110
8528531774
MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
4
54
01779805
0400000US54
54
WV
West Virginia
00
62266474513
489028543
POLYGON ((-82.64320 38.16909, -82.64300 38.169...
The geometry column is a special column in a GeoDataFrame that stores the geometric shapes associated with each row (in this case, the shapes in latitude-longitude pairs that define the boundary of each state). This column contains the vector data that defines the spatial features in the dataset. Some states have boundaries defined by a MULTIPOLYGON, such as Hawaii, whose boundary consists of multiple closed POLYGONS. If it isn’t already present, the geometry column needs to be defined.
We can plot the data present in the present in the shapefile by calling the GeoDataFrame’s plot method:
dfshp.plot()
Let’s zoom in and focus on a map of the lower 48 states only:
exclude = ["American Samoa", "Alaska", "Hawaii", "Guam", "United States Virgin Islands","Commonwealth of the Northern Mariana Islands", "Puerto Rico"]dfshp48 = dfshp[~dfshp.NAME.isin(exclude)].reset_index(drop=True)dfshp48.plot()
We can get a better view of the boundaries of each state by calling boundary.plot:
dfshp48.boundary.plot()
By default, the plots rendered via GeoPandas are smaller than we might like. We can increase the size of the rendered map, suppress ticklabels, change the boundary color and add a title as follows:
In the shapefile, ALAND and AWATER represent the land and water area of each state in square meters. To create a choropleth map based on the natural log of AWATER, include the column argument to the plot method:
# Compute natural log of AWATER to get better separation by state.dfshp48["log_AWATER"] = np.log(dfshp48["AWATER"])dfshp48.plot(column="log_AWATER", cmap="plasma")
We can reformat the map as before, while also adding a legend to give context the difference in colors by state. Options for colormaps are available here:
For variety, let’s download the Congressional District shapefile and plot the boundaries. It is available at the same link as above, and is identified as cb_2018_us_cd116_500k.zip. Reading the file into GeoPandas and displaying the first 5 rows yields:
We’d like to focus on the lower 48 states again, but this time the shapefile doesn’t have a NAME column. How should we proceed?
One approach is to define a bounding box that encloses the lower 48 states, then filter the shapefile to retain only those congressional districts whose geometry intersects the bounding box. GeoPandas provides coordinate based indexing with the cx indexer, which slices using a bounding box. Geometries in the GeoSeries or GeoDataFrame that intersect the bounding box will be returned.
For the lower 48 states bounding box, we’ll use (-125, 24.6), (-65, 50), southwest to northeast. We also include a circle marker at the center of each congressional district:
C:\Users\jtriv\AppData\Local\Temp\ipykernel_8996\3296541533.py:9: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
dfc48.geometry.centroid.plot(ax=ax, markersize=6, color="red")
GeoJSON
Working with GeoJSON is much the same as working with shapefiles, one difference being that with GeoJSON, vector data is contained within a single file as opposed to an archive of multiple file types. See here for an example.
But once read into GeoPandas, we work with it the same way. We can load US state boundary files as GeoJSON from GitHub via: