Relation b/w median income, average household size & cases per capita in NYC
Geospatial plot of cases per capita in NYC and an interactive scatter plot finding the relationship between median income, average household size & cases per capita.
Today we will make the plots that are present in this webpage that tracks the situation in New York City - New York City Coronavirus Map and Case Count
There are two main plots -
- Cases per capita
- Plot analyzing the relationship b/w average household size, income and cases per capita
They have since corrected the graph -


#hide_output
import pandas as pd
import altair as alt
import geopandas as gpd
alt.renderers.set_embed_options(actions=False)
Fortunately NYC Department of Health and Mental Hygiene publishes their data in their GitHub Repo. It has all the data from cases to the shapefiles too.
But we will use the data from NYC Open Data which is equally good. We will use the NYC Department of Health and Mental Hygiene GitHub repo only for their COVID data per MODZCTA.
The location of the dataset on NYC Open Data portal is https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk which we will export as geojson.
import requests
resp = requests.get('https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=GeoJSON')
data = resp.json()
nyc_zip = gpd.GeoDataFrame.from_features(data['features'])
nyc_zip.head()
# If you have the data locally
#modzcta = 'MODZCTA/Modified Zip Code Tabulation Areas.geojson'
#nyc_zip_shp = gpd.read_file(modzcta)
This is how it looks when colored based on population -
alt.Chart(nyc_zip).mark_geoshape().encode(
color='pop_est:Q'
)
Now we will get the Median Household Income data from Census Reporter's wonderful website. In particular the url query for the 5 boroughs in NYC is https://censusreporter.org/data/table/?table=B19013&geo_ids=860%7C05000US36061,860%7C05000US36047,860%7C05000US36081,860%7C05000US36005,860%7C05000US36085 which I came to know thanks to this comment in NYC Health covid data repo.
Download it and export it to work with it further (I chose the Shapefile, but I suppose others would be fine too).
median_income_path = 'MODZCTA/median_income_nyc/acs2018_5yr_B19013_86000US11417/acs2018_5yr_B19013_86000US11417.shp' #push it to datasets repo and link here
median_income = gpd.read_file(median_income_path)
median_income.head()
The regions in median income data that are not part of NTC's modzcta can be seen as follows(basically we are finding out if both datasets show the same regions) -
median_income[median_income['name'].isin(nyc_zip['modzcta']) == False].plot()
Fun Fact - the areas marked above are also NOT SHOWN in NYT's Charts
If you want you can plot the Median Income Data too using -
alt.Chart(median_income).mark_geoshape().encode(
color='B19013001:Q'
)
Now we will get the COVID data per MODZCTA
covid_zip_uri = 'https://raw.githubusercontent.com/nychealth/coronavirus-data/master/data-by-modzcta.csv'
covid_zip = pd.read_csv(covid_zip_uri)
covid_zip.head()
Merge the covid data with NYC shapefile data to be able to see the number of cases per modzcta
def modify_merge(nyc_zip, covid_zip):
covid_zip = covid_zip.loc[:, ['MODIFIED_ZCTA', 'BOROUGH_GROUP', 'COVID_CASE_COUNT', 'COVID_DEATH_COUNT', 'NEIGHBORHOOD_NAME']]
nyc_zip['pop_est'] = nyc_zip['pop_est'].astype(int)
nyc_zip['modzcta'] = nyc_zip['modzcta'].astype(int)
covid_zip.rename(columns = {'MODIFIED_ZCTA': 'modzcta'}, inplace=True)
covid_nyc_zip = nyc_zip.merge(covid_zip, how='left', on='modzcta')
return covid_nyc_zip
covid_nyc_zip = modify_merge(nyc_zip, covid_zip)
covid_nyc_zip.head()
Customary covid cases visualization per modfied zip code tabulation area -
alt.Chart(covid_nyc_zip).mark_geoshape().encode(
color='COVID_CASE_COUNT'
)