Getting Started With GeoPandas
This notebook serves as a starter guide or template for doing geographical analysis using the GeoPandas library.
The dataset we will be using in this tutorial is from Analyze Boston. Analyze Boston is the City of Boston’s data hub and is a great resource for data sets regarding the city.
We will be working with two datasets. First is the 2020 CENSUS TRACTS IN BOSTON dataset, and second is the WICKED FREE WIFI LOCATIONS dataset. The first dataset contains the geographical boundaries of each census tract in Boston, and the second dataset contains the geographical location of each free wifi location in Boston.
# Start by importing the necessary packages
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon
# for visualizations : 
import seaborn as sns
import matplotlib.pyplot as plt
When working with geographical data, .shp files are the most common format, however these files are often accompanied by other files with the same name but different extensions. For example, a .shp file might come with a .dbf, .prj, .shx, and .cpg file. These are necessary files and are used in the background. GeoPandas (and other GIS software) typically expect these files to be in the same directory as the .shp file and to share the same base filename.
# read in the census data
census = gpd.read_file('census/Census2020_BlockGroups.shp')
census.head()
| OBJECTID | STATEFP20 | COUNTYFP20 | TRACTCE20 | BLKGRPCE20 | GEOID20 | NAMELSAD20 | MTFCC20 | FUNCSTAT20 | ALAND20 | AWATER20 | INTPTLAT20 | INTPTLON20 | Shape_STAr | Shape_STLe | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 025 | 040600 | 1 | 250250406001 | Block Group 1 | G5030 | S | 1265377.0 | 413598.0 | +42.3833695 | -071.0707743 | 1.807118e+07 | 29256.866068 | POLYGON ((769378.692 2964626.314, 769385.078 2... | 
| 1 | 2 | 25 | 025 | 051101 | 1 | 250250511011 | Block Group 1 | G5030 | S | 220626.0 | 0.0 | +42.3882285 | -071.0046816 | 2.374654e+06 | 9142.174252 | POLYGON ((788317.786 2966115.262, 788838.563 2... | 
| 2 | 3 | 25 | 025 | 051101 | 4 | 250250511014 | Block Group 4 | G5030 | S | 227071.0 | 270.0 | +42.3913407 | -071.0020343 | 2.446949e+06 | 11579.546171 | POLYGON ((789538.125 2967889.427, 789539.214 2... | 
| 3 | 4 | 25 | 025 | 981600 | 1 | 250259816001 | Block Group 1 | G5030 | S | 586981.0 | 158777.0 | +42.3886205 | -070.9934424 | 8.026752e+06 | 16626.718823 | POLYGON ((790938.417 2966482.118, 790974.619 2... | 
| 4 | 5 | 25 | 025 | 010204 | 3 | 250250102043 | Block Group 3 | G5030 | S | 145888.0 | 0.0 | +42.3459611 | -071.1020344 | 1.570220e+06 | 5510.560013 | POLYGON ((762928.668 2951612.031, 763063.607 2... | 
What makes a .shp file unique is the “geometry” column. The geometry column typically contains the coordinates or geometric properties of the features represented in the shapefile, such as points, lines, or polygons. For us, each row in our shapefile represents one census block group, and the geometry column stores a vectorized polygon boundaries of each block group that lets us see a visual representation of the block group.
census['geometry'][0]
We can plot all of these block groups together to get a visual representation of the city of Boston.
census.plot()
plt.title('Boston Census Block Groups')
plt.show()

Now let’s read in our Wicked Free Wifi Locations shapefile
# read in the wifi locations shapefile
wfw = gpd.read_file('wifi/Wicked_Free_WiFi_Locations.shp')
wfw
| neighborho | neighbor_1 | device_ser | device_con | device_add | device_lat | device_lon | device_tag | etl_update | is_current | org1 | org2 | inside_out | landmark | ObjectId | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | N_568579452955534204 | Dorchester | Q2CK-H48M-P2E9 | DOT-BFD21-AP3 | 641 Columbia Rd., Dorchester, MA | 42.318336 | -71.063236 | BFD Engine-21 Outside | 2023-06-12 | 1 | BFD | NaN | Outside | Engine 21 | 1 | POINT (-7910723.209 5208784.596) | 
| 1 | N_568579452955537848 | South Boston | Q2CK-LEG8-FAFN | SB-HIGHSCHOOL-AP2 | 95 G St., South Boston, MA | 42.332869 | -71.044891 | recently-added | 2023-06-12 | 1 | NaN | NaN | NaN | NaN | 2 | POINT (-7908681.044 5210972.710) | 
| 2 | N_568579452955538316 | Hyde Park | Q2CK-NERF-4JX7 | HP-BPDE18-AP2 | 1249 Hyde Park Ave., Hyde Park, MA | 42.256473 | -71.124219 | recently-added | 2023-06-12 | 1 | NaN | NaN | NaN | NaN | 3 | POINT (-7917511.803 5199475.676) | 
| 3 | N_579275502070532581 | Roxbury | Q2CK-DSFT-7HTL | ROX-TROTTER-AP3 | Humbolt & Waumbeck St., Roxbury, MA | 42.313249 | -71.089180 | BPS Outside Trotter-School | 2023-06-12 | 1 | BPS | NaN | Outside | Trotter School | 4 | POINT (-7913611.287 5208018.690) | 
| 4 | N_579275502070532581 | Roxbury | Q2CK-2D9P-VKGR | ROX-BFDE42-AP1 | 1870 Columbus Ave., Roxbury, MA | 42.318412 | -71.097758 | BFD Engine-42 Outside | 2023-06-12 | 1 | BFD | NaN | Outside | Engine 42 | 5 | POINT (-7914566.172 5208795.947) | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 255 | N_568579452955534204 | Dorchester | Q2CK-G9M2-VTK4 | DOT-BFD21-AP2 | 641 Columbia Rd., Dorchester, MA | 42.318336 | -71.063236 | BFD Engine-21 Outside | 2023-06-12 | 1 | BFD | NaN | Outside | Engine 21 | 256 | POINT (-7910723.209 5208784.596) | 
| 256 | N_568579452955527921 | Parks | Q2CK-MP93-YDN3 | PARKS-COMMONEAST-AP1 | 139 Tremont St, Boston, MA 02111 | 42.355436 | -71.063827 | Boston-Common | 2023-06-12 | 1 | NaN | NaN | NaN | Boston Common | 257 | POINT (-7910789.066 5214371.556) | 
| 257 | L_601230550253962688 | Navy Yard | Q2EK-HAXQ-8XW4 | NavyYard-AP2 | Navy yard Cambridge | 42.373720 | -71.053272 | NaN | 2023-06-12 | 1 | NaN | NaN | NaN | NaN | 258 | POINT (-7909614.112 5217126.360) | 
| 258 | N_568579452955534204 | Dorchester | Q2CK-NJQU-RB55 | DOT-MATHER-AP3 | 24 Parrish St, Dorchester | 42.308397 | -71.061017 | recently-added | 2023-06-12 | 1 | NaN | NaN | NaN | NaN | 259 | POINT (-7910476.256 5207288.280) | 
| 259 | N_579275502070532581 | Roxbury | Q2CK-NEZE-VRYJ | ROX-VINE-AP3 | 339 Dudley St, Roxbury | 42.326811 | -71.076707 | recently-added | 2023-06-12 | 1 | NaN | NaN | NaN | NaN | 260 | POINT (-7912222.881 5210060.516) | 
260 rows × 16 columns
This time, our geometry column are single POINTs rather than whole POLYGONs, which makes sense because each row represents a single wifi location. After plotting them all together we see just a bunch of dots, which is not very helpful. We can make this plot more useful by adding the census block groups to the plot as well.
wfw.plot()
plt.title('Wifi Locations')
plt.show()

Data Cleaning
print(census.shape)
print(wfw.shape)
(581, 16)
(260, 16)
census.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 581 entries, 0 to 580
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   OBJECTID    581 non-null    int64   
 1   STATEFP20   581 non-null    object  
 2   COUNTYFP20  581 non-null    object  
 3   TRACTCE20   581 non-null    object  
 4   BLKGRPCE20  581 non-null    object  
 5   GEOID20     581 non-null    object  
 6   NAMELSAD20  581 non-null    object  
 7   MTFCC20     581 non-null    object  
 8   FUNCSTAT20  581 non-null    object  
 9   ALAND20     581 non-null    float64 
 10  AWATER20    581 non-null    float64 
 11  INTPTLAT20  581 non-null    object  
 12  INTPTLON20  581 non-null    object  
 13  Shape_STAr  581 non-null    float64 
 14  Shape_STLe  581 non-null    float64 
 15  geometry    581 non-null    geometry
dtypes: float64(4), geometry(1), int64(1), object(10)
memory usage: 72.8+ KB
wfw.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   neighborho  260 non-null    object  
 1   neighbor_1  260 non-null    object  
 2   device_ser  260 non-null    object  
 3   device_con  250 non-null    object  
 4   device_add  245 non-null    object  
 5   device_lat  248 non-null    float64 
 6   device_lon  248 non-null    float64 
 7   device_tag  215 non-null    object  
 8   etl_update  260 non-null    object  
 9   is_current  260 non-null    int64   
 10  org1        119 non-null    object  
 11  org2        5 non-null      object  
 12  inside_out  162 non-null    object  
 13  landmark    168 non-null    object  
 14  ObjectId    260 non-null    int64   
 15  geometry    248 non-null    geometry
dtypes: float64(2), geometry(1), int64(2), object(11)
memory usage: 32.6+ KB
Doesn’t look like there’s much to clean in the census data, but in the wifi dataset let’s just drop the organization columns because they contain the most nulls, and we don’t need them for our analysis.
wfw.drop(['org1', 'org2'], axis=1, inplace=True)
Showing the wifi locations on our census block group map
So we can see a map of Boston and a map of dots representing wifi locations separately, but how do I overlay these dots on top of the map of Boston to actuall see where in Boston these wifi locations are? First, check that the Coordinate Reference Systems (CRS) of both geodataframes are the same. The CRS specifies the coordinate system and reference frame used to define the spatial positions of the features represented in the file, and in order to overlay two geodataframes, they must have the same CRS.
# Check the CRS of both GeoDataFrames
print("Census CRS: ", census.crs)
print("Wifi CRS: ", wfw.crs)
Census CRS:  PROJCS["NAD83 / Massachusetts Mainland (ftUS)",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic_2SP"],PARAMETER["latitude_of_origin",41],PARAMETER["central_meridian",-71.5],PARAMETER["standard_parallel_1",41.7166666666667],PARAMETER["standard_parallel_2",42.6833333333333],PARAMETER["false_easting",656166.666666667],PARAMETER["false_northing",2460625],UNIT["US survey foot",0.304800609601219,AUTHORITY["EPSG","9003"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]
Wifi CRS:  EPSG:3857
Since they’re different, we need to use the to_crs method to make sure they’re all in the same coordinate system. We’ll use the EPSG:4326 coordinate system, which is one of the most common coordinate systems for geographical data. Then, we can overlay the two plots using matplotlib.
# If they're different, reproject the wifi data to match the census data
if census.crs != wfw.crs:
    census = census.to_crs(wfw.crs)
# Create a new figure and axis
fig, ax = plt.subplots(1, 1)
# Plot the census data
census.plot(ax=ax, color='gray')
# Plot the wifi locations
wfw.plot(ax=ax, color='red', markersize=10, alpha=0.2)
plt.title("Boston Census Block Groups and Wifi Locations")
plt.show()

If you want to zoom in on a specific area of a map in GeoPandas, you need to add the lines: plt.xlim() and plt.ylim(). These functions take in the xmin and xmax, and ymin and ymax values of the area you want to zoom in on. You can find these values by looking at the x and y axes of the original plot.
# Same as before
if census.crs != wfw.crs:
    census = census.to_crs(wfw.crs)
fig, ax = plt.subplots(1, 1)
census.plot(ax=ax, color='gray')
wfw.plot(ax=ax, color='red', markersize=10, alpha=0.2)
plt.title("Zoomed in Boston Census Block Groups and Wifi Locations")
# Set the x and y limits to zoom in on Boston
plt.xlim([-7.915e6, -7.91e6])  # xmin and xmax
plt.ylim([5.2075e6, 5.2125e6])  # ymin and ymax
plt.show()

Make a Choropleth (color each census block according to the number of WiFi locations in it):
Now that both of our GeoDataFrames are on the same CRS, we can do a spatial join to find out how many WiFi locations are in each census tract. A spatial join is a type of join that merges two GeoDataFrames based on the spatial relationship between the geometries of the two GeoDataFrames. In this case, we want to know how many WiFi locations are in each census tract, so we will use the contains operation to find all of the WiFi locations that are within each census tract.
# Perform a spatial join between the census tracts and WiFi locations
wifi_per_tract = gpd.sjoin(census, wfw, predicate='contains').groupby('TRACTCE20').size()
wifi_per_tract.head()
TRACTCE20
000301     1
000603     3
000604     1
000805     1
030302    35
dtype: int64
The “contains” operation (as well as other spatial operations like “intersects”, “within”, etc.) uses geometric calculations based on the shapes’ coordinates (which have the same CRS) to determine spatial relationships.
For the “contains” operation specifically, it checks if all points of the other geometric object are within the current geometric object, and that their boundaries do not coincide.
For example:
from shapely.geometry import Point, Polygon
# Create a 5x5 square (tract)
polygon = Polygon([(0, 0), (5, 0), (5, 5), (0, 5)])
# Create two points (WiFi location)
point1 = Point(3, 3)  # Inside the polygon
point2 = Point(6, 6)  # Outside the polygon
# Check containment
print(polygon.contains(point1))
print(polygon.contains(point2))  
True
False
We can now groupby the unique ‘TRACTCE20’ values and take .size() to get the number of wifi locations in each tract. Then, we can merge this with our census dataframe to get a new column with the number of wifi locations in each census tract.
# Add a new column to the census GeoDataFrame with the count of WiFi locations per tract
census = census.merge(wifi_per_tract.rename('wifi_count'), how='left', left_on='TRACTCE20', right_index=True)
# Replace NaN values (tracts without any WiFi location) with 0
census['wifi_count'].fillna(0, inplace=True)
# Now, create a new figure and axis
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
# Plot the census data, color-coded by the number of WiFi locations
census.plot(column='wifi_count', ax=ax, legend=True, cmap='cividis', 
            legend_kwds={'label': "Number of WiFi Locations per Tract",
                         'orientation': "horizontal"})
plt.title("Number of WiFi Locations per Tract")
plt.show()

census[census['wifi_count'] == census['wifi_count'].max()]['wifi_count']
478    35.0
Name: wifi_count, dtype: float64
Looks like there’s 35 free wifi locations in this bright yellow tract, maybe the ‘landmarks’ column will tell us why…
wfw['landmark'].value_counts().head(5).sort_values().plot(kind='barh', figsize=(10, 10))
plt.title('Number of WiFi Locations by Landmark')
plt.xlabel('Number of WiFi Locations')
plt.ylabel('Landmark')
plt.show()

So it could be the City Hall, let’s graph it!
city_hall_wifi = wfw[wfw['landmark'] == 'City Hall']
if census.crs != city_hall_wifi.crs:
    city_hall_wifi = city_hall_wifi.to_crs(census.crs)
fig, ax = plt.subplots()
census.plot(ax=ax, color='grey', edgecolor='black')
city_hall_wifi.plot(ax=ax, color='red', markersize=10)
plt.title('WiFi Locations at City Hall')
plt.show()

Looks like we’re right! After plotting the subset of our dataframe where Landmark == “City Hall”, we can see that all of these points are clustered in that bright yellow census tract. Looking at the listed address of these WiFi locations, we can see that they are all located at 1 City Hall Square, which is the address of Boston City Hall.
Here are the unique device addresses of these city hall wifi locations:
city_hall_wifi['device_add'].unique()
array(['1 City Hall Square, Boston, MA 02201',
       '1 City Hall Plaza, Pavilion\nBoston, MA  02201',
       'One City Hall Sq, Pavilion Bldg\nBoston, MA 02201', nan,
       '1 City Hall Plaza, \nBoston, MA  02201',
       'One City Hall Sq, Pavilion Bld, IWF Room,\nBoston, MA 02201'],
      dtype=object)
Inside vs Outside Wifi locations
fig, ax = plt.subplots(1, 1)
census.plot(ax=ax, color='gray')
# Plot the wifi locations with color based on "inside_out" column
wfw.plot(ax=ax, column='inside_out', cmap='cool', markersize=10, legend=True, alpha=0.2)
plt.title("Boston Census Block Groups and Wifi Locations")
plt.show()

wfw['inside_out'].value_counts().plot(kind='barh')
plt.show()

Looks like most of our WiFi locations are outside. There’s probably a lot of “regular” EDA that can be done here, but since this is a GeoPandas/geographical analysis notebook, we’ll just leave it at that.
Closest Wifi location to BU
from geopy.distance import geodesic
# Coordinates of Boston University (accordign to Google)
BU_latitude = 42.3505
BU_longitude = -71.1054
# Create a copy of the WiFi locations and drop rows with missing coordinates
wfw = wfw.copy()
wfw.dropna(subset=['device_lat', 'device_lon'], inplace=True)
# Function to calculate distances to BU
def calculate_distance(row):
    point = (row['device_lat'], row['device_lon'])
    return geodesic(point, (BU_latitude, BU_longitude)).miles
wfw['distance_to_BU'] = wfw.apply(calculate_distance, axis=1)
# Find the WiFi location with the minimum distance
wfw[wfw['distance_to_BU'] == min(wfw['distance_to_BU'])]
| neighborho | neighbor_1 | device_ser | device_con | device_add | device_lat | device_lon | device_tag | etl_update | is_current | inside_out | landmark | ObjectId | geometry | distance_to_BU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36 | N_601230550253961587 | BCYF Tremont | Q2FD-4SE4-JW2S | BCYF-3rd FL GED | 1483 Tremont St, Boston, MA | 42.332317 | -71.09865 | BCYF Inside employee | 2023-06-12 | 1 | Inside | employee | 37 | POINT (-7914665.525 5210889.576) | 1.301774 | 
In order to find the closest wifi location to BU, we need to find the distance between each wifi location and BU. We can do this using the geodesic function from geopy, which returns the distance between two longitudes/latitudes. Then, we can find the row where the distance is the smallest, which will be the closest wifi location to BU.
The closest wifi location to BU is the one at 1483 Tremont St, Boston, MA. It’s about 1.3 miles away from BU. Let’s plot it!
fig, ax = plt.subplots(1, 1)
# Plot the census data
census.plot(ax=ax, color='gray')
# Plot the wifi locations
wfw.plot(ax=ax, color='blue', markersize=10, alpha=0.2)
# Plot the WiFi location closest to BU in red
wfw.query("device_ser == 'Q2FD-4SE4-JW2S'").plot(ax=ax, color='red', markersize=20)
plt.title("1483 Tremont Street and other Wifi Locations")
plt.show()

Wrap Up, Next Steps
There’s a lot more you can do with these datasets, for example, you could try finding the average distance between wifi locations, or incorporate another dataset with demographic information and try to find a relationship between income, education level, population density, etc and wifi locations. but this notebook should serve as a good starting point for your geographical analysis.
Click here to download this Jupyter Notebook and data (make sure you are signed in with your BU email)!