Today we will make the plots that are present in this webpage that tracks the situation in New York City - New York City Coronavirus Map and Case Count

There are two main plots -

Cases per capita
Plot analyzing the relationship b/w average household size, income and cases per capita

They have since corrected the graph -

Old

old chart

New

new chart

#hide_output
import pandas as pd
import altair as alt
import geopandas as gpd
alt.renderers.set_embed_options(actions=False)

Fortunately NYC Department of Health and Mental Hygiene publishes their data in their GitHub Repo. It has all the data from cases to the shapefiles too.

But we will use the data from NYC Open Data which is equally good. We will use the NYC Department of Health and Mental Hygiene GitHub repo only for their COVID data per MODZCTA.

The location of the dataset on NYC Open Data portal is https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk which we will export as geojson.

import requests
resp = requests.get('https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=GeoJSON')
data = resp.json()
nyc_zip = gpd.GeoDataFrame.from_features(data['features'])
nyc_zip.head()

# If you have the data locally
#modzcta = 'MODZCTA/Modified Zip Code Tabulation Areas.geojson'
#nyc_zip_shp = gpd.read_file(modzcta)

This is how it looks when colored based on population -

alt.Chart(nyc_zip).mark_geoshape().encode(
    color='pop_est:Q'
)

Now we will get the Median Household Income data from Census Reporter's wonderful website. In particular the url query for the 5 boroughs in NYC is https://censusreporter.org/data/table/?table=B19013&geo_ids=860%7C05000US36061,860%7C05000US36047,860%7C05000US36081,860%7C05000US36005,860%7C05000US36085 which I came to know thanks to this comment in NYC Health covid data repo.

Download it and export it to work with it further (I chose the Shapefile, but I suppose others would be fine too).

median_income_path = 'MODZCTA/median_income_nyc/acs2018_5yr_B19013_86000US11417/acs2018_5yr_B19013_86000US11417.shp' #push it to datasets repo and link here
median_income = gpd.read_file(median_income_path)
median_income.head()

The regions in median income data that are not part of NTC's modzcta can be seen as follows(basically we are finding out if both datasets show the same regions) -

median_income[median_income['name'].isin(nyc_zip['modzcta']) == False].plot()

<AxesSubplot:>

Fun Fact - the areas marked above are also NOT SHOWN in NYT's Charts

If you want you can plot the Median Income Data too using -

alt.Chart(median_income).mark_geoshape().encode(
    color='B19013001:Q'
)

Now we will get the COVID data per MODZCTA

covid_zip_uri = 'https://raw.githubusercontent.com/nychealth/coronavirus-data/master/data-by-modzcta.csv'
covid_zip = pd.read_csv(covid_zip_uri)
covid_zip.head()

Merge the covid data with NYC shapefile data to be able to see the number of cases per modzcta

def modify_merge(nyc_zip, covid_zip):
    covid_zip = covid_zip.loc[:, ['MODIFIED_ZCTA', 'BOROUGH_GROUP', 'COVID_CASE_COUNT', 'COVID_DEATH_COUNT', 'NEIGHBORHOOD_NAME']]
    nyc_zip['pop_est'] = nyc_zip['pop_est'].astype(int)
    nyc_zip['modzcta'] = nyc_zip['modzcta'].astype(int)
    covid_zip.rename(columns = {'MODIFIED_ZCTA': 'modzcta'}, inplace=True)
    covid_nyc_zip = nyc_zip.merge(covid_zip, how='left', on='modzcta')
    return covid_nyc_zip

covid_nyc_zip = modify_merge(nyc_zip, covid_zip)
covid_nyc_zip.head()

Customary covid cases visualization per modfied zip code tabulation area -

alt.Chart(covid_nyc_zip).mark_geoshape().encode(
    color='COVID_CASE_COUNT'
)

Now we will merge median income data with the covid cases per zip code tabulation area data we just derieved above -

def modify_merge(covid_nyc_zip, median_income):
    median_income.rename(columns={'name': 'modzcta', 'B19013001': 'median_income'}, inplace=True)
    median_income['modzcta'] = median_income['modzcta'].astype(int)
    median_income = median_income.drop(columns=['geometry', 'B19013001e'])
    covid_income_zip = covid_nyc_zip.merge(median_income, how='inner', on='modzcta')
    covid_income_zip =  covid_income_zip.assign(
        #case_per_people = covid_income_zip['COVID_CASE_COUNT']/covid_income_zip['POP_DENOMINATOR'], #their earlier calculation
        case_per_people = covid_income_zip['COVID_CASE_COUNT']/covid_income_zip['pop_est'], # current calculation
    )
    return covid_income_zip

covid_income_zip = modify_merge(covid_nyc_zip, median_income)
covid_income_zip.head()

Now let's plot this data and we will find that it exactly MATCHES NYT's chart for cases per people v/s median income -

alt.Chart(covid_income_zip).mark_circle().encode(
    color=alt.Color('median_income:Q', sort = "descending"),
    x=alt.X('median_income:Q'),
    y='case_per_people:Q'
)

Getting the number of people per household data

This data can be calculated from Household Type by Household Size data, again from Census Reporter - the url query would be https://censusreporter.org/data/table/?table=B11016&geo_ids=860|05000US36061,860|05000US36047,860|05000US36081,860|05000US36005,860|05000US36085

Getting data like these honestly deserves its own blog post. It took me a very long time to find the proper data. Perhaps I will write about it some other time.

household_path = 'MODZCTA/household_type_by_household_size/csvofsame/acs2018_5yr_B11016_86000US11417.csv'
household = pd.read_csv(household_path)
household.head()

def consolidated_household_per_zip(df):
    household_data = pd.DataFrame()
    household_data = household_data.assign(
        one_person = df['1-person nonfamily household'],
        two_person = df['2-person household'] + df['2-person nonfamily household'],
        three_person = df['3-person household'] + df['3-person nonfamily household'],
        four_person = df['4-person household'] + df['4-person nonfamily household'],
        five_person = df['5-person household'] + df['5-person nonfamily household'],
        six_person = df['6-person household'] + df['6-person nonfamily household'],
        seven_or_more_person = df['7 or more-person household'] + df['7 or more-person nonfamily household'],
        modzcta = df['name'],
        total = df['Total']
        #avg_people_per_household = df.apply(lambda x: (x[1]+2*x[2]+3*x[3]+4*x[4]+5*x[5]+6*x[6]+7*x[7])/(x['total']), axis=1)
    )
    return household_data

household_data = consolidated_household_per_zip(household)
household_data.head()

Now we will merge the household data too to arrive at the final data for plotting -

plot_data = covid_income_zip.merge(household_data, how='inner', on='modzcta')

Calculating Average people per household is simple enough, jsut divide the population with the number of households -

plot_data['avg_person_per_household'] = plot_data['pop_est']/plot_data['total']

Finally let's plot the data side by side - horizontal concatenatioon using | operator -

income = alt.Chart(plot_data).mark_circle().encode(
    color=alt.Color('median_income:Q', sort = "descending"),
    x=alt.X('median_income:Q'),
    y='case_per_people:Q'
)

household_size = alt.Chart(plot_data).mark_circle().encode(
    #color=alt.Color('median_income:Q', sort = "descending"),
    x=alt.X('avg_person_per_household:Q'),
    y='case_per_people:Q'
)

income|household_size

Let's beautify it a little and add interactivity so that whenever you select an area in one of the graphs, the corresponding points gets highlighted in the second graph and you can drag the selection around to see that actually that the outbreak is worse in poorer areas where more people live together.

brush = alt.selection_interval(encodings=['x'])

chart = alt.Chart(plot_data).mark_circle(size=100).encode(
    color=alt.condition(brush, 'BOROUGH_GROUP:N', alt.value('lightgray'), legend=alt.Legend(title="Borough")),
    y=alt.Y('case_per_people:Q')
).add_selection(brush).properties(width=525, height=480)

alt.hconcat(
    chart.encode(
        x=alt.X('median_income:Q', axis=alt.Axis(tickCount=3, format="$d",titleOpacity=0.6, title='Higher Median Income ->')), 
        y=alt.Y('case_per_people:Q', axis=alt.Axis(format="%",titleOpacity=0.6, tickCount=4, title="Cases in % of population"))
    ),
    chart.encode(
        x=alt.X('avg_person_per_household:Q', scale=alt.Scale(zero=False), axis=alt.Axis(tickCount=3,titleOpacity=0.6, title='More People per Household ->')), 
        y=alt.Y('case_per_people:Q', axis=alt.Axis(format="%", tickCount=4, title=None))
    )  
).configure_view(strokeWidth=0).configure_axis(grid=True,gridOpacity=0.6,labelOpacity=0.5)

Way to reproduce the old incorrect graph -

Get the total population data as 1*(1 person households) + .... + 7*(7 or more person households)

Hope you noticed the error over there, but it perfectly captures the old graph as you can see in my old tweet which I had implemented that way -

Working on my next set of #dataviz. The regions in NYC with high cases per capita are also regions with the lowest median incomes and largest average household size. Inspired by @nytgraphics and made using #Python and #altair
Follow the blog 👉 https://t.co/JfXtHy5hPh pic.twitter.com/QawWEEgb4B
— Shantam Raj (@RajShantam) July 4, 2020

Gif of the above implementation -
old map gif

Now let's make the geospatial chloropleth plot cases per capita like the second chart at the top of the post

Since we have the data in order, - we will use the covid_income_zip data beause it already has the "case_per_people" field that we have to plot - potting is as simple as -

alt.Chart(covid_income_zip).mark_geoshape().encode(
    color='case_per_people'
)

Using their color scheme -

def color_it(x):
    if x<(1/50):
        return "less than 1/50"
    elif x<(1/40):
        return "1/50 to 1/40"
    elif x<(1/30):
        return "1/40 to 1/30"
    else:
        return "more than 1/30"


covid_income_zip = covid_income_zip.assign(share_of_population=covid_income_zip['case_per_people'].apply(color_it))

alt.Chart(covid_income_zip).mark_geoshape().encode(
    color=alt.Color('share_of_population',  scale=alt.Scale(domain=['less than 1/50', '1/50 to 1/40', '1/40 to 1/30', 'more than 1/30'], range=['#f2df91', '#ffae43', '#ff6e0b', '#ce0a05']))
).properties(width=600, height=600)

How do we add borough names on top?

I love this part - time for some actual spatial analysis. To show the Borough names we will actually dissolve the areas by borough and get the centroid and finally place our text mark there

borough = covid_income_zip.dissolve(by='BOROUGH_GROUP')
borough['lon'] = borough['geometry'].centroid.x
borough['lat'] = borough['geometry'].centroid.y
borough = borough.reset_index()
borough

Let's plot it and put the legend on top -

a = alt.Chart(covid_income_zip).mark_geoshape().encode(
    color=alt.Color(
        'share_of_population',  
        scale=alt.Scale(domain=['less than 1/50', '1/50 to 1/40', '1/40 to 1/30', 'more than 1/30'], range=['#f2df91', '#ffae43', '#ff6e0b', '#ce0a05']),
        legend=alt.Legend(title="Share of Population", orient='top', symbolType='square', columnPadding=20, symbolSize=200)
    ),
    tooltip = ['BOROUGH_GROUP', 'NEIGHBORHOOD_NAME', 'modzcta', 'COVID_CASE_COUNT', 'COVID_DEATH_COUNT']
).properties(width=600, height=600)

b = alt.Chart(borough).mark_text().encode(
    longitude='lon',
    latitude='lat',
    text='BOROUGH_GROUP:N'
)
a+b

	geometry	label	modzcta	pop_est	zcta
0	MULTIPOLYGON (((-73.98774 40.74407, -73.98819 ...	10001, 10118	10001	23072	10001, 10119, 10199
1	MULTIPOLYGON (((-73.99750 40.71407, -73.99709 ...	10002	10002	74993	10002
2	MULTIPOLYGON (((-73.98864 40.72293, -73.98876 ...	10003	10003	54682	10003
3	MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ...	10004	10004	3028	10004
4	MULTIPOLYGON (((-74.00783 40.70309, -74.00786 ...	10005	10005	8831	10005, 10271

	geoid	name	B19013001	B19013001e	geometry
0	86000US10001	10001	88526.0	8060.0	POLYGON ((-74.00828 40.75027, -74.00783 40.751...
1	86000US10002	10002	35859.0	2149.0	POLYGON ((-73.99750 40.71407, -73.99709 40.714...
2	86000US10003	10003	112131.0	13190.0	POLYGON ((-73.99937 40.73132, -73.99911 40.731...
3	86000US10004	10004	157645.0	17195.0	MULTIPOLYGON (((-73.99814 40.70152, -73.99617 ...
4	86000US10005	10005	173333.0	33390.0	POLYGON ((-74.01251 40.70677, -74.01195 40.707...

	MODIFIED_ZCTA	NEIGHBORHOOD_NAME	BOROUGH_GROUP	COVID_CASE_COUNT	COVID_CASE_RATE	POP_DENOMINATOR	COVID_DEATH_COUNT	COVID_DEATH_RATE	PERCENT_POSITIVE	TOTAL_COVID_TESTS
0	10001	Chelsea/NoMad/West Chelsea	Manhattan	440	1867.33	23563.03	26	110.34	7.18	6124
1	10002	Chinatown/Lower East Side	Manhattan	1318	1717.14	76755.41	161	209.76	8.33	15831
2	10003	East Village/Gramercy/Greenwich Village	Manhattan	534	992.54	53801.62	35	65.05	3.91	13671
3	10004	Financial District	Manhattan	40	1095.71	3650.61	1	27.39	5.33	751
4	10005	Financial District	Manhattan	94	1119.57	8396.11	2	23.82	4.91	1914

	geometry	label	modzcta	pop_est	zcta	BOROUGH_GROUP	COVID_CASE_COUNT	COVID_DEATH_COUNT	NEIGHBORHOOD_NAME
0	MULTIPOLYGON (((-73.98774 40.74407, -73.98819 ...	10001, 10118	10001	23072	10001, 10119, 10199	Manhattan	440.0	26.0	Chelsea/NoMad/West Chelsea
1	MULTIPOLYGON (((-73.99750 40.71407, -73.99709 ...	10002	10002	74993	10002	Manhattan	1318.0	161.0	Chinatown/Lower East Side
2	MULTIPOLYGON (((-73.98864 40.72293, -73.98876 ...	10003	10003	54682	10003	Manhattan	534.0	35.0	East Village/Gramercy/Greenwich Village
3	MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ...	10004	10004	3028	10004	Manhattan	40.0	1.0	Financial District
4	MULTIPOLYGON (((-74.00783 40.70309, -74.00786 ...	10005	10005	8831	10005, 10271	Manhattan	94.0	2.0	Financial District

	geometry	label	modzcta	pop_est	zcta	BOROUGH_GROUP	COVID_CASE_COUNT	COVID_DEATH_COUNT	NEIGHBORHOOD_NAME	geoid	median_income	case_per_people
0	MULTIPOLYGON (((-73.98774 40.74407, -73.98819 ...	10001, 10118	10001	23072	10001, 10119, 10199	Manhattan	440.0	26.0	Chelsea/NoMad/West Chelsea	86000US10001	88526.0	0.019071
1	MULTIPOLYGON (((-73.99750 40.71407, -73.99709 ...	10002	10002	74993	10002	Manhattan	1318.0	161.0	Chinatown/Lower East Side	86000US10002	35859.0	0.017575
2	MULTIPOLYGON (((-73.98864 40.72293, -73.98876 ...	10003	10003	54682	10003	Manhattan	534.0	35.0	East Village/Gramercy/Greenwich Village	86000US10003	112131.0	0.009766
3	MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ...	10004	10004	3028	10004	Manhattan	40.0	1.0	Financial District	86000US10004	157645.0	0.013210
4	MULTIPOLYGON (((-74.00783 40.70309, -74.00786 ...	10005	10005	8831	10005, 10271	Manhattan	94.0	2.0	Financial District	86000US10005	173333.0	0.010644

	geoid	name	Total	Total, Error	Family Households	Family Households, Error	2-person household	B11016003, Error	3-person household	B11016004, Error	...	3-person nonfamily household	B11016012, Error	4-person nonfamily household	B11016013, Error	5-person nonfamily household	B11016014, Error	B11016015, Error	B11016016, Error
0	86000US10001	10001	12431	521	3838	413	2330	338	819	295	...	253	97	63	55	0	22	22	22
1	86000US10002	10002	33540	614	16565	738	8090	620	4287	432	...	621	157	84	87	0	28	28	28
2	86000US10003	10003	26124	703	7946	551	5182	472	1413	316	...	586	212	94	70	38	38	28	28
3	86000US10004	10004	1659	238	709	183	410	160	196	106	...	42	31	0	12	0	12	12	12
4	86000US10005	10005	4374	337	1614	316	944	260	358	182	...	199	96	15	25	0	17	17	17

	one_person	two_person	three_person	four_person	five_person	six_person	seven_or_more_person	modzcta	total
0	6710	3897	1072	429	307	16	0	10001	12431
1	14319	10041	4908	2796	1162	95	219	10002	33540
2	14377	8265	1999	1367	109	7	0	10003	26124
3	794	524	238	95	8	0	0	10004	1659
4	1674	1816	557	314	13	0	0	10005	4374

	BOROUGH_GROUP	geometry	label	modzcta	pop_est	zcta	COVID_CASE_COUNT	COVID_DEATH_COUNT	NEIGHBORHOOD_NAME	geoid	median_income	case_per_people	share_of_population	lon	lat
0	Bronx	MULTIPOLYGON (((-73.84290 40.82857, -73.84287 ...	10451	10451	47798	10451	1737.0	152.0	Concourse/Melrose	86000US10451	28921.0	0.036340	more than 1/30	-73.866550	40.853688
1	Brooklyn	POLYGON ((-73.93190 40.59469, -73.93193 40.594...	11201	11201	62823	11201	818.0	95.0	Brooklyn Heights/DUMBO/Downtown Brooklyn	86000US11201	124996.0	0.013021	less than 1/50	-73.949561	40.645380
2	Manhattan	MULTIPOLYGON (((-74.00760 40.70299, -74.00765 ...	10001, 10118	10001	23072	10001, 10119, 10199	440.0	26.0	Chelsea/NoMad/West Chelsea	86000US10001	88526.0	0.019071	less than 1/50	-73.966228	40.778413
3	Queens	MULTIPOLYGON (((-73.86496 40.56663, -73.86509 ...	11004, 11005	11004	19216	11004, 11005, 11040	633.0	68.0	Bellerose/Douglaston-Little Neck	86000US11004	92657.0	0.032941	1/40 to 1/30	-73.819049	40.710385
4	Staten Island	POLYGON ((-74.20999 40.51055, -74.21013 40.510...	10301	10301	38733	10301	1314.0	107.0	Silver Lake/St. George	86000US10301	55802.0	0.033925	more than 1/30	-74.153793	40.581313