ā¹ļø Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.2 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://snakebear.science/09-DataVisualization/index.html |
| Last Crawled | 2026-04-01 18:33:41 (5 days ago) |
| First Indexed | 2021-03-31 23:58:03 (5 years ago) |
| HTTP Status Code | 200 |
| Meta Title | 9. Visualization with Seaborn ā The Python and Pandas Field Guide |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using
Binder
.
9.1.
Introduction
¶
There are a variety of data visualization libraries available in Python.
There is a lot of commonality in these libraries, but they do take
different approaches and offer somewhat different visualization tools.
A library called
Matplotlib
was developed
in 2002. Matplotlib has been designed to work with Numpy and
Scipy
. Matplotlib underlies
many Python visualization packages, including the one we will learn
called
Seaborn
. Seaborn is an advanced
library, it is powerful and straightforward, which makes it a good place
to start. However, if you continue to work with these kinds of tools,
consider learning Matplotlib as well.
By convention, Seaborn is imported and given the abbreviation
sns
.
When you see a call that is prefaced with
sns
, such as
sns.lineplot()
, you are using the Seaborn library.
Here are some image galleries you can take a quick look at, so you can
get a sense of what these packages can do: *
Seaborn
gallery
*
Matplotlib gallery
*
Plotly
gallery
We will only be scratching the surface of Seaborn, but once you get used
to the basics you should be able to start learning the rest for
yourself.
We will be using calls to Matplotlib to tweak and display the plotting
object we have built with Seaborn. In particular, we will be using a
module within the library called
matplotlib.pyplot
. By convention,
matplotlib.pyplot
is imported as
plot
. When you see a function
call prefaced with
plot
, such as
plot.show()
, you are using the
Matplotlib library.
9.1.1.
Preliminaries: Imports and Dataframe Creation
¶
We are going to be using the CORGIS state crime data set. Each row in
the dataset represents one year and one U.S. state.
# imports
import
matplotlib.pyplot
as
plt
import
seaborn
as
sns
import
pandas
as
pd
# ignore this code entirely
%
matplotlib
inline
%
config
InlineBackend.figure_formats = ['svg']
# read the data and create the dataframe
urlg
=
'https://raw.githubusercontent.com/'
repo
=
'bsheese/CSDS125ExampleData/master/'
fnme1
=
'data_corgis_state_crime.csv'
df_original
=
pd
.
read_csv
(
urlg
+
repo
+
fnme1
)
df
=
df_original
[
df_original
[
'State'
]
!=
'United States'
]
# read the data and create a supplemental dataframe
fnme2
=
'data_state_info.csv'
df_stateinfo
=
pd
.
read_csv
(
urlg
+
repo
+
fnme2
)
# remove totals and just keep crime rates
column_mask
=
~
df
.
columns
.
str
.
contains
(
'Totals'
)
df
=
df
[
df
.
columns
[
column_mask
]]
#drop rows with empty values
df
=
df
.
dropna
(
axis
=
0
,
how
=
'any'
)
# create a decade column
df
.
loc
[:,
'Decade'
]
=
(
df
.
loc
[:,
'Year'
]
//
10
*
10
)
.
astype
(
int
)
# merge state crime data with supplemental state classification data
df_stateinfo
=
df_stateinfo
.
reset_index
()
df
=
pd
.
merge
(
df
,
df_stateinfo
)
# normalize population for better plotting
df
.
loc
[:,
'Population_Mil'
]
=
df
.
loc
[:,
'Population'
]
/
1000000
#check dataframe
df
.
head
(
3
)
Population
Rates.Property.All
Rates.Property.Burglary
Rates.Property.Larceny
Rates.Property.Motor
Rates.Violent.All
Rates.Violent.Assault
Rates.Violent.Murder
Rates.Violent.Rape
Rates.Violent.Robbery
State
Year
Decade
index
State Code
Region
Division
Population_Mil
0
3266740
1035.4
355.9
592.1
87.3
186.6
138.1
12.4
8.6
27.5
Alabama
1960
1960
1
AL
South
East South Central
3.26674
1
3302000
985.5
339.3
569.4
76.8
168.5
128.9
12.9
7.6
19.1
Alabama
1961
1960
1
AL
South
East South Central
3.30200
2
3358000
1067.0
349.1
634.5
83.4
157.3
119.0
9.4
6.5
22.5
Alabama
1962
1960
1
AL
South
East South Central
3.35800
# abbreviated descriptives
df
.
describe
()
.
T
[[
'mean'
,
'min'
,
'max'
]]
.
round
(
1
)
mean
min
max
Population
4751877.0
226167.0
38041430.0
Rates.Property.All
3683.2
573.1
9512.1
Rates.Property.Burglary
929.5
182.6
2906.7
Rates.Property.Larceny
2395.2
293.3
5833.8
Rates.Property.Motor
358.5
48.3
1839.9
Rates.Violent.All
398.9
9.5
2921.8
Rates.Violent.Assault
235.6
3.6
1557.6
Rates.Violent.Murder
6.7
0.2
80.6
Rates.Violent.Rape
28.3
0.8
102.2
Rates.Violent.Robbery
128.2
1.9
1635.1
Year
1986.0
1960.0
2012.0
Decade
1981.7
1960.0
2010.0
index
25.0
0.0
50.0
Population_Mil
4.8
0.2
38.0
9.2.
Basic Figure Creation with Seaborn
¶
To create a figure or graph we are typically going to:
call some specific function within Seaborn, such as
sns.lineplot()
, to create a plotting object
include arguments in the call that specify the data to plot and any
options that we need
write additional code to make tweaks to the object, such as adding a
title, legend, or axis labels.
call
plt.show()
to display the object
Displaying a plot will typically involve a few lines of code. The
initial call will create an object that we canāt see yet, but that knows
what type of plot it is and what data it holds. We then add to the
object and tweak things. Finally, when we want to see what weāve built,
we can tell the object to show itself.
This is a slightly different way of interacting with objects than you
might be familiar with. It takes some getting used to. The thing to
remember is that when you call the initial function you are creating an
object. You can then do things with that object, such as put it in a
list, assign it to a variable, or display it.
Note: Usually plot objects donāt display themselves until you ask, but
Colab (and other forms of Jupyter notebooks) try to be helpful and may
display your objects even when you havenāt asked. You might be tempted
to rely on that convenience, but Iād like you not to get into that
habit, as it can cause quite a bit of confusion later down the road.
9.2.1.
Seaborn Patterns
¶
Letās look at some code to get a sense of the overall pattern for making
plots with Seaborn. Note: Remember that
sns
is the alias for the
Seaborn library.
# set figure size
sns
.
set
(
rc
=
{
'figure.figsize'
:(
10
,
8
)})
# data for our plot
yearlist
=
[
1970
,
1980
,
1990
,
2000
,
2010
,
2020
,
2030
]
zombiecount
=
[
0
,
0
,
0
,
0
,
10
,
20
,
50
]
# slowest zombie spread of all time
# plot starts here
sns
.
lineplot
(
x
=
yearlist
,
y
=
zombiecount
)
# called a line plot, lists as x and y
plt
.
show
()
# displays the plot
Letās add a few more components to our figure.
# data for our plot
yearlist
=
[
1970
,
1980
,
1990
,
2000
,
2010
,
2020
,
2030
]
zombiecount
=
[
0
,
0
,
0
,
0
,
10
,
20
,
50
]
# slowest zombie spread of all time
# plot starts here
sns
.
lineplot
(
x
=
yearlist
,
y
=
zombiecount
)
# called a line plot, lists as x and y
plt
.
title
(
'Slow Zombie Apocalypse Chart'
)
# added a title
plt
.
xlabel
(
'Year'
)
# added a x-axis label
plt
.
ylabel
(
'Zombies'
)
# added a y-axis label
plt
.
show
()
# displays the plot
Here we will plot two sets of data on the same figure. Notice we are
calling
sns.lineplot()
twice, but we are not getting back two plots.
Instead, the second call adds to the existing object.
# data for our plot
yearlist
=
[
1970
,
1980
,
1990
,
2000
,
2010
,
2020
,
2030
]
zombiecount
=
[
0
,
0
,
0
,
0
,
10
,
20
,
50
]
# slowest zombie spread of all time
bunnycount
=
[
45
,
35
,
25
,
15
,
10
,
5
,
1
]
# bunnies have a rough time
# plot starts here
sns
.
lineplot
(
x
=
yearlist
,
y
=
zombiecount
,
label
=
'zombies'
)
# called a line plot, lists as x and y
sns
.
lineplot
(
x
=
yearlist
,
y
=
bunnycount
,
label
=
'bunnies'
)
# add a line to the existing plot
plt
.
show
()
# displays the plot
Again, here is the general pattern:
Have the data in some form
Call a Seaborn function to plot the data, supply necessary arguments
Tweak or add to the plot with calls to
plt
(which is an alias for
another library called matplotlib)
Display the result
Make sure you can recognize how the code above accomplishes each step.
9.2.2.
Seaborn Patterns with Pandas
¶
If you have your data in a dataframe, you can hand specific series from
your dataframe to Seaborn for plotting.
For example:
# create a zombie bunny dataframe
# don't worry about how this works
dfzb
=
pd
.
DataFrame
(
zip
(
yearlist
,
zombiecount
,
bunnycount
))
dfzb
.
columns
=
[
'year'
,
'zombies'
,
'bunnies'
]
# plot the data
sns
.
lineplot
(
x
=
'year'
,
y
=
'zombies'
,
data
=
dfzb
)
# specifies series and dataframe
plt
.
show
()
Notice that Seaborn has taken the series names and used them as
axis-labels. We can override this with a subsequent call to
plt
.
sns
.
lineplot
(
x
=
'year'
,
y
=
'zombies'
,
data
=
dfzb
)
# specify series and dataframe
plt
.
ylabel
(
'Zombies!!!!'
)
plt
.
show
()
We have options for supplying our data to Seaborn. We can feed it the
data directly, for example, by passing it a series, or we use
data
=
to specify a dataframe, and then it will look for x and y in the column
names. Students commonly blend the two approaches and have difficulty
figuring out why it fails. Always check that youāve specified a data
source, if you are trying to plot a series from a dataframe.
9.3.
Line Plots
¶
Up until now we have been making lineplots where there was only one y
value for each x value in the line. For example, in 2030 there was one 1
bunny left. However, often when we make a lineplot the line is actually
representing a summary of lots of data points. For example, letās say we
were plotting weekly Broadway revenue. A line plot would then produce a
single point for each week representing the average revenue of lots of
Broadway productions. In a sense, Seaborn is acting a lot like
pd.groupby()
and it is automatically aggregating over groups of data
to produce the plots. If we hand Seaborn data where there are many
values for each point we want to plot, and ask it to make a line plot,
it will assume we want the mean plotted, and automatically handle
aggregating, calculating the mean, and then plotting the line.
In addition, Seaborn will add a band of color around the line. Letās see
what that looks like:
sns
.
lineplot
(
x
=
"Decade"
,
y
=
"Rates.Violent.All"
,
data
=
df
)
plt
.
show
()
# still need this to show the plot
In this dataset we have data from different states for each decade.
Seaborn represents the mean value with the darker blue line. The lighter
blue line shows you a statistic called the 95% confidence interval. The
confidence interval depends on both the sample size and the variance of
the data itself. We will talk more about this later. If you are
interested,
read more about confidence intervals
here
.
We can add a second line to plot by adding a second function call to our
code.
Notice that the data in the plot above is the same as the data in the
lower line in the plot below. If I showed you the one above and asked
you to come to a conclusion about violent crime rates you would likely
come to a different conclusion than if I showed you the one below. The
only difference between the two is the scaling of the y-axis. Property
crime is much more common, so Seaborn changed the y-axis to accommodate
the second set of data.
Also notice that the confidence intervals are still there but may be
difficult to see due to the scaling and the size of the figure display.
sns
.
lineplot
(
x
=
"Decade"
,
y
=
"Rates.Violent.All"
,
data
=
df
)
sns
.
lineplot
(
x
=
"Decade"
,
y
=
"Rates.Property.All"
,
data
=
df
)
plt
.
show
()
9.4.
Categorical Plots
¶
9.4.1.
Bar Charts
¶
We can use
sns.barplot()
to create bar charts.
By default, Seaborn represents the mean of the data as the height of the
bar and represents the dispersion of the data with a small grey line
that crosses through the top of the bar. The top and bottom of that line
represent the 95% confidence interval.
# create a view into the dataframe restricted to States that start with 'I'
dfi
=
df
.
loc
[
df
.
loc
[:,
'State'
]
.
str
.
startswith
(
'I'
)]
sns
.
barplot
(
x
=
"State"
,
y
=
"Rates.Violent.All"
,
data
=
dfi
)
plt
.
show
()
To get a horizontal bar chart we just flip the variable assigned to the
axes.
sns
.
barplot
(
y
=
"State"
,
x
=
"Rates.Violent.All"
,
data
=
dfi
)
plt
.
show
()
9.4.2.
Grouped Bar Charts
¶
By adding the
hue
argument we can create grouped bar charts.
dfi
=
df
.
loc
[
df
.
loc
[:,
'State'
]
.
str
.
startswith
(
'I'
)]
sns
.
barplot
(
x
=
"State"
,
y
=
"Rates.Violent.All"
,
hue
=
'Decade'
,
data
=
dfi
)
plt
.
show
()
dfi
=
df
.
loc
[
df
.
loc
[:,
'State'
]
.
str
.
startswith
(
'I'
)]
sns
.
barplot
(
x
=
"Decade"
,
y
=
"Rates.Violent.All"
,
hue
=
'State'
,
data
=
dfi
)
plt
.
show
()
9.4.3.
Count Plots
¶
We can make plots of the frequency of categorical data using
sns.countplot()
.
Note that we are only supplying an argument for the category we want to
be counted. Seaborn handles actually doing the count.
# countplot
sns
.
countplot
(
x
=
"Region"
,
data
=
df_stateinfo
)
plt
.
show
()
We can make a horizontal version of the frequency count by specifying
the categorical data using the argument
y
rather than
x
.
# horizontal countplot
sns
.
countplot
(
y
=
"Division"
,
data
=
df_stateinfo
)
plt
.
show
()
9.5.
Histograms
¶
You can use Seabornās
.histplot()
method to create a histogram,
which provides frequency counts for continuous data. This method uses
the argument
bins=
to specify the number of bins in the histogram.
Note: In the plot below, the y-axis represents the count of values
falling within each bin.
# view of states with lower population
df_small
=
df
.
loc
[
df
.
loc
[:,
'Population_Mil'
]
<
20
]
# create histogram
sns
.
histplot
(
x
=
'Population_Mil'
,
data
=
df_small
,
bins
=
15
)
plt
.
show
()
Two overlapping distributions to compare frequencies.
# create overlapping histogram with two calls
sns
.
histplot
(
x
=
'Rates.Property.Motor'
,
data
=
df_small
,
bins
=
15
)
sns
.
histplot
(
x
=
'Rates.Property.Burglary'
,
data
=
df_small
,
bins
=
15
,
color
=
'purple'
)
plt
.
show
()
9.6.
Box Plots
¶
Another way of showing both central tendency and dispersion is the
boxplot. In Seaborn, we can create boxplots by using the
kind='box'
argument.
# create boxplot
sns
.
boxplot
(
x
=
"State"
,
y
=
"Rates.Violent.All"
,
data
=
dfi
)
plt
.
show
()
# create horizontal boxplot
sns
.
boxplot
(
y
=
"State"
,
x
=
"Rates.Violent.All"
,
data
=
dfi
)
plt
.
show
()
# create masks
smalli_mask
=
dfi
.
loc
[:,
'State'
]
.
isin
([
'Idaho'
,
'Iowa'
])
recent_mask
=
dfi
.
loc
[:,
'Decade'
]
>
1990
# create boxplots
sns
.
boxplot
(
x
=
"State"
,
y
=
"Rates.Violent.All"
,
hue
=
'Decade'
,
data
=
dfi
.
loc
[
smalli_mask
&
recent_mask
])
plt
.
show
()
9.7.
Scatter Plots
¶
Seabornās
.scatterplot()
method can be used to make scatter plots of
data.
df
.
loc
[:,
[
"Population"
,
"Population_Mil"
,
"Rates.Violent.All"
,
"State"
]]
.
groupby
(
by
=
'State'
)
.
mean
()
Population
Population_Mil
Rates.Violent.All
State
Alabama
4.011637e+06
4.011637
434.183019
Alaska
4.858700e+05
0.485870
492.749057
Arizona
3.553305e+06
3.553305
487.341509
Arkansas
2.347026e+06
2.347026
376.058491
California
2.715494e+07
27.154940
642.071698
Colorado
3.298696e+06
3.298696
383.209434
Connecticut
3.183763e+06
3.183763
299.032075
Delaware
6.626316e+05
0.662632
478.762264
District of Columbia
6.524541e+05
0.652454
1652.773585
Florida
1.171569e+07
11.715686
737.747170
Georgia
6.455632e+06
6.455632
456.471698
Hawaii
1.027689e+06
1.027689
213.001887
Idaho
1.041935e+06
1.041935
206.798113
Illinois
1.162432e+07
11.624323
633.822642
Indiana
5.594049e+06
5.594049
318.428302
Iowa
2.873040e+06
2.873040
197.867925
Kansas
2.479876e+06
2.479876
319.211321
Kentucky
3.686424e+06
3.686424
264.737736
Louisiana
4.122317e+06
4.122317
594.069811
Maine
1.159225e+06
1.159225
122.618868
Maryland
4.572824e+06
4.572824
677.616981
Massachusetts
5.924470e+06
5.924470
444.211321
Michigan
9.263410e+06
9.263410
567.367925
Minnesota
4.332084e+06
4.332084
225.577358
Mississippi
2.579543e+06
2.579543
291.556604
Missouri
5.122876e+06
5.122876
478.056604
Montana
8.173031e+05
0.817303
194.183019
Nebraska
1.607538e+06
1.607538
243.671698
Nevada
1.244390e+06
1.244390
601.279245
New Hampshire
1.001637e+06
1.001637
112.858491
New Jersey
7.686547e+06
7.686547
404.007547
New Mexico
1.462050e+06
1.462050
580.150943
New York
1.837103e+07
18.371025
731.075000
North Carolina
6.635885e+06
6.635885
442.022642
North Dakota
6.470186e+05
0.647019
81.926415
Ohio
1.089756e+07
10.897562
345.690566
Oklahoma
3.086300e+06
3.086300
395.273585
Oregon
2.787653e+06
2.787653
353.279245
Pennsylvania
1.198127e+07
11.981271
326.607547
Rhode Island
9.775871e+05
0.977587
275.481132
South Carolina
3.390524e+06
3.390524
617.522642
South Dakota
7.213471e+05
0.721347
152.688679
Tennessee
4.864989e+06
4.864989
497.443396
Texas
1.653946e+07
16.539457
479.654717
Utah
1.709371e+06
1.709371
220.392453
Vermont
5.278923e+05
0.527892
104.839623
Virginia
5.939604e+06
5.939604
289.549057
Washington
4.660820e+06
4.660820
338.537736
West Virginia
1.834482e+06
1.834482
184.901887
Wisconsin
4.867310e+06
4.867310
183.911321
Wyoming
4.452619e+05
0.445262
221.607547
# group by state, with mean aggregation
pop_violence_list
=
[
"State"
,
"Population"
,
"Population_Mil"
,
"Rates.Violent.All"
,
"Rates.Property.All"
]
df_grouped_state_mean
=
df
.
loc
[:,
pop_violence_list
]
.
groupby
(
by
=
'State'
)
.
mean
()
# create list of the largest population states
big_states
=
df_grouped_state_mean
\
.
loc
[:,
'Population'
]
\
.
sort_values
()[
-
21
:]
\
.
index
.
to_list
()
# use big state list to index df_grouped
dfbig
=
df_grouped_state_mean
\
.
loc
[
df_grouped_state_mean
.
index
.
isin
(
big_states
)]
# create scatterplot
sns
.
scatterplot
(
x
=
"Population_Mil"
,
y
=
"Rates.Violent.All"
,
data
=
dfbig
)
plt
.
show
()
Among the twenty largest states, there does not appear to be a simple
linear association between population and rates of violent crime. It
would be nice to know what state each point represents. Letās try adding
the hue argument to see how that goes.
# create scatterplot with labelled data points
sns
.
scatterplot
(
x
=
'Population_Mil'
,
y
=
'Rates.Violent.All'
,
hue
=
dfbig
.
index
,
data
=
dfbig
)
plt
.
show
()
Thatās an all new kind of terrible chart that weāve made. The States
appear in the legend in alphabetical order, but thereās so many itās
impossible to tell one shade of State from another. Letās try it again
with a smaller selection of states.
# create list of the largest population states
big_six
=
df_grouped_state_mean
\
.
loc
[:,
'Population'
]
\
.
sort_values
()[
-
8
:
-
1
]
\
.
index
\
.
to_list
()
# use big state list to index df_grouped
dfbig6
=
df_grouped_state_mean
\
.
loc
[
df_grouped_state_mean
.
index
.
isin
(
big_six
)]
# create plot
sns
.
scatterplot
(
x
=
'Population_Mil'
,
y
=
'Rates.Violent.All'
,
hue
=
dfbig6
.
index
,
data
=
dfbig6
)
plt
.
show
()
An improvement, but still ugly. We can use a matplotlib method to
control where the legend is displayed.
# create plot
sns
.
scatterplot
(
x
=
'Population_Mil'
,
y
=
'Rates.Violent.All'
,
hue
=
dfbig6
.
index
,
data
=
dfbig6
)
# use a matplotlib method to control the legend display
plt
.
legend
(
bbox_to_anchor
=
(
1.05
,
1
),
loc
=
2
,
borderaxespad
=
0.
)
plt
.
show
()
Thatās a bit better. Now that the plot is cleaned up, letās look at what
it says about our States.
Unfortunately, Florida shows that smaller populations (relative to NY
and TX) can have violent crime rates that match or exceed bigger states.
We can examine the same association by region.
# add the state supplemental information into the grouped state mean df
df_grouped_state_mean
=
pd
.
merge
(
df_grouped_state_mean
.
reset_index
(),
df_stateinfo
)
# plot the association with Region specified as the hue
sns
.
scatterplot
(
x
=
'Population_Mil'
,
y
=
'Rates.Violent.All'
,
hue
=
'Region'
,
data
=
df_grouped_state_mean
)
plt
.
show
()
Thatās not a particularly good visualization. Too many brightly colored
dots all next to one another.
Letās try breaking up the plots by region.
for
region
in
df_grouped_state_mean
.
loc
[:,
'Region'
]
.
unique
():
sns
.
scatterplot
(
x
=
'Population_Mil'
,
y
=
'Rates.Violent.All'
,
data
=
df_grouped_state_mean
\
.
loc
[
df_grouped_state_mean
[
'Region'
]
==
region
])
plt
.
title
(
region
)
plt
.
show
()
So there does appear to be a positive association between population
size and rates of violence in the Northeast and Midwest.
The plots for the South and the West both have outliers (data points
that appear to be quite distant from the others). In the South, we have
a relatively small state with a very high rate of violent crime. (Itās
not actually a state, despite having a population larger than several
states, but we will come back to that). In the West, we have a very
large state (California), with a rate of crime very similar to much less
populus states.
9.7.1.
Scatter Plots with Regression Lines
¶
Regression is a technique for fitting a model to your data that can help
you better understand how to variables in your data are related. It can
also be used to make predictions. Think of it as moving one step beyond
just looking at correlations.
āSimple linear regressionā is used to fit a straight line through the
data. You use this technique when you think two variables have a simple
linear (not curved) relation with one another.
Seaborn includes visualizations of regression lines. Typically, these
show a basic scatter plot with the regression line added. Seabornās
regression visualization also includes a band around the line indicating
the confidence interval.
Note: If you are looking for formal regression statistics for your
model, you should be aware that Seabornās documentation states that
their regression functionality is only intended to help with
visualization and that formal regression statistics should be created
with a different package, such as
SciPy
.
Letās use regression visualizations to look at the following question:
In large population states, is higher property crime associated with
higher violent crime rates?
# scatter plot with regression line
sns
.
lmplot
(
x
=
'Rates.Violent.All'
,
y
=
'Rates.Property.All'
,
data
=
dfbig
)
plt
.
show
()
How about overall?
# scatterplot with regression line
sns
.
lmplot
(
x
=
'Rates.Violent.All'
,
y
=
'Rates.Property.All'
,
data
=
df_grouped_state_mean
)
plt
.
show
()
What about the ten states with the highest levels of violence?
# create list of the highest violence states by sorting, then indexing the list,
# and then grabbing the state names from the index
highest_violence
=
df_grouped_state_mean
\
.
loc
[:,
'Rates.Violent.All'
]
\
.
sort_values
()
\
.
index
\
.
to_list
()[
-
10
:]
# use high violence list to index df_grouped using .isin()
dfhighv
=
df_grouped_state_mean
\
.
loc
[
df_grouped_state_mean
\
.
index
\
.
isin
(
highest_violence
)]
# plot
sns
.
lmplot
(
x
=
"Population_Mil"
,
y
=
"Rates.Violent.All"
,
data
=
dfhighv
)
plt
.
show
()
These visualizations show a single outlier (notice that datapoint in the
top left of the graph) that needs to be explored. We will come back to
that soon.
Letās look at the association between property crime and population by
region. Some Seaborn plotting functions, but not all, take an argument,
col
=
, that allows us to create groups of plots. Earlier we did this
with a loop. Using
col
and
col_wrap=
, which specifies how many
plots you want per row, makes things a bit easier for us than writing a
loop.
# scatter plots with regression lines in columns and rows
sns
.
lmplot
(
y
=
'Population_Mil'
,
x
=
'Rates.Property.All'
,
col
=
'Region'
,
# creates a separate plot for each region
col_wrap
=
2
,
# wraps columns so not all side by side
data
=
df_grouped_state_mean
)
plt
.
show
()
The columns and rows of plots look nice, but the plots themselves are a
bit ugly. Here, the scaling is consistent across the subplots, but most
regions donāt have data points across the full range of values so the
result is short, stunted looking, regression lines. Although we will
lose the advantage of having common scaling between graphs, we can go
back to our loop to make individual plots so Seaborn will scale each
plot individually:
# scatter plots with loop
for
region
in
df
[
'Region'
]
.
unique
():
# loop through the regions
# create a subset of the data with just one region at a time
regiondata
=
df_grouped_state_mean
\
.
loc
[
df_grouped_state_mean
[
'Region'
]
==
region
]
# make the plot
sns
.
lmplot
(
y
=
'Population_Mil'
,
x
=
'Rates.Property.All'
,
data
=
regiondata
)
plt
.
title
(
region
)
plt
.
show
()
Thatās a bit better. Now letās clean it up and provide labels and such.
# same as above, but with title and labels
for
region
in
df
[
'Region'
]
.
unique
():
regiondata
=
df_grouped_state_mean
\
.
loc
[
df_grouped_state_mean
[
'Region'
]
==
region
]
sns
.
lmplot
(
y
=
'Population_Mil'
,
x
=
'Rates.Property.All'
,
data
=
regiondata
)
plt
.
ylabel
(
'Population in Millions'
)
plt
.
xlabel
(
'Rate of Property Crime'
)
plt
.
title
(
f
'Region =
{
region
}
'
)
plt
.
show
()
This second approach makes for better individual charts to look at the
regression lines, but the difference in scaling between them might
create problems for folks who donāt pay close attention to these things
(which is most of us, most of the time). So there is a trade-off for
plotting things this way. Be careful with these sorts of scaling issues
and consider your audience when making your visualizations.
We will explore the state crime data a bit more in an upcoming exercise.
If you are interested in more recent data about overall trends in the
U.S. take a look at
this analysis by the Pew Research
Center
. |
| Markdown | [šš¼](https://snakebear.science/index.html)
- [Book](https://snakebear.science/09-DataVisualization/index.html)
- [Table of Contents](https://snakebear.science/index.html)
- [Book Index](https://snakebear.science/genindex.html)
- [Scratch ActiveCode]()
- [Chapters](https://snakebear.science/index.html)
- - [1\. Introduction](https://snakebear.science/01-Introduction/toctree.html)
- [2\. Values and Variables (Single Pieces of Data)](https://snakebear.science/02-ValuesAndVariables/toctree.html)
- [3\. Control Structures](https://snakebear.science/03-ControlStructures/toctree.html)
- [4\. Functions](https://snakebear.science/04-Functions/toctree.html)
- [5\. Strings, Lists, and Files (Multiple Pieces of Data)](https://snakebear.science/05-StringsListsAndFiles/toctree.html)
- [6\. Pandas: Series](https://snakebear.science/06-PandasSeries/toctree.html)
- [7\. Pandas: Dataframes](https://snakebear.science/07-PandasDataframes/toctree.html)
- [8\. Statistics](https://snakebear.science/08-Statistics/toctree.html)
- [9\. Visualization with Seaborn](https://snakebear.science/09-DataVisualization/index.html)
- [10\. Data Cleaning](https://snakebear.science/10-DataCleaning/index.html)
- [11\. Classes (Defining New Kinds of Objects)](https://snakebear.science/11-Classes/toctree.html)
- [Contributions](https://snakebear.science/AA-Contributions.html)
- [License](https://snakebear.science/AB-License.html)
Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using [Binder](https://mybinder.org/v2/gh/CS-DS-125/125book/master?urlpath=lab/tree/source/09-DataVisualization/notebooks/09_1_Visualization_Introduction_with_Seaborn.ipynb).
# 9\. Visualization with Seaborn[¶](https://snakebear.science/09-DataVisualization/index.html#visualization-with-seaborn "Permalink to this heading")
## 9\.1. Introduction[¶](https://snakebear.science/09-DataVisualization/index.html#introduction "Permalink to this heading")
There are a variety of data visualization libraries available in Python. There is a lot of commonality in these libraries, but they do take different approaches and offer somewhat different visualization tools.
A library called [Matplotlib](https://en.wikipedia.org/wiki/Matplotlib) was developed in 2002. Matplotlib has been designed to work with Numpy and [Scipy](https://en.wikipedia.org/wiki/SciPy). Matplotlib underlies many Python visualization packages, including the one we will learn called [Seaborn](https://seaborn.pydata.org/). Seaborn is an advanced library, it is powerful and straightforward, which makes it a good place to start. However, if you continue to work with these kinds of tools, consider learning Matplotlib as well.
By convention, Seaborn is imported and given the abbreviation `sns`. When you see a call that is prefaced with `sns`, such as `sns.lineplot()`, you are using the Seaborn library.
Here are some image galleries you can take a quick look at, so you can get a sense of what these packages can do: \* [Seaborn gallery](https://seaborn.pydata.org/examples/index.html) \* [Matplotlib gallery](https://matplotlib.org/gallery.html) \* [Plotly gallery](https://plotly.com/python/)
We will only be scratching the surface of Seaborn, but once you get used to the basics you should be able to start learning the rest for yourself.
We will be using calls to Matplotlib to tweak and display the plotting object we have built with Seaborn. In particular, we will be using a module within the library called `matplotlib.pyplot`. By convention, `matplotlib.pyplot` is imported as `plot`. When you see a function call prefaced with `plot`, such as `plot.show()`, you are using the Matplotlib library.
### 9\.1.1. Preliminaries: Imports and Dataframe Creation[¶](https://snakebear.science/09-DataVisualization/index.html#preliminaries-imports-and-dataframe-creation "Permalink to this heading")
We are going to be using the CORGIS state crime data set. Each row in the dataset represents one year and one U.S. state.
```
# imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# ignore this code entirely
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
# read the data and create the dataframe
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme1 = 'data_corgis_state_crime.csv'
df_original = pd.read_csv(urlg + repo + fnme1)
df = df_original[df_original['State'] != 'United States']
# read the data and create a supplemental dataframe
fnme2 = 'data_state_info.csv'
df_stateinfo = pd.read_csv(urlg + repo + fnme2)
# remove totals and just keep crime rates
column_mask = ~df.columns.str.contains('Totals')
df = df[df.columns[column_mask]]
#drop rows with empty values
df = df.dropna(axis=0, how='any')
# create a decade column
df.loc[:, 'Decade'] = (df.loc[:, 'Year']//10 * 10).astype(int)
# merge state crime data with supplemental state classification data
df_stateinfo = df_stateinfo.reset_index()
df = pd.merge(df, df_stateinfo)
# normalize population for better plotting
df.loc[:, 'Population_Mil'] = df.loc[:, 'Population']/1000000
#check dataframe
df.head(3)
```
| | Population | Rates.Property.All | Rates.Property.Burglary | Rates.Property.Larceny | Rates.Property.Motor | Rates.Violent.All | Rates.Violent.Assault | Rates.Violent.Murder | Rates.Violent.Rape | Rates.Violent.Robbery | State | Year | Decade | index | State Code | Region | Division | Population\_Mil |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3266740 | 1035\.4 | 355\.9 | 592\.1 | 87\.3 | 186\.6 | 138\.1 | 12\.4 | 8\.6 | 27\.5 | Alabama | 1960 | 1960 | 1 | AL | South | East South Central | 3\.26674 |
| 1 | 3302000 | 985\.5 | 339\.3 | 569\.4 | 76\.8 | 168\.5 | 128\.9 | 12\.9 | 7\.6 | 19\.1 | Alabama | 1961 | 1960 | 1 | AL | South | East South Central | 3\.30200 |
| 2 | 3358000 | 1067\.0 | 349\.1 | 634\.5 | 83\.4 | 157\.3 | 119\.0 | 9\.4 | 6\.5 | 22\.5 | Alabama | 1962 | 1960 | 1 | AL | South | East South Central | 3\.35800 |
```
# abbreviated descriptives
df.describe().T[['mean', 'min', 'max']].round(1)
```
| | mean | min | max |
|---|---|---|---|
| Population | 4751877\.0 | 226167\.0 | 38041430\.0 |
| Rates.Property.All | 3683\.2 | 573\.1 | 9512\.1 |
| Rates.Property.Burglary | 929\.5 | 182\.6 | 2906\.7 |
| Rates.Property.Larceny | 2395\.2 | 293\.3 | 5833\.8 |
| Rates.Property.Motor | 358\.5 | 48\.3 | 1839\.9 |
| Rates.Violent.All | 398\.9 | 9\.5 | 2921\.8 |
| Rates.Violent.Assault | 235\.6 | 3\.6 | 1557\.6 |
| Rates.Violent.Murder | 6\.7 | 0\.2 | 80\.6 |
| Rates.Violent.Rape | 28\.3 | 0\.8 | 102\.2 |
| Rates.Violent.Robbery | 128\.2 | 1\.9 | 1635\.1 |
| Year | 1986\.0 | 1960\.0 | 2012\.0 |
| Decade | 1981\.7 | 1960\.0 | 2010\.0 |
| index | 25\.0 | 0\.0 | 50\.0 |
| Population\_Mil | 4\.8 | 0\.2 | 38\.0 |
## 9\.2. Basic Figure Creation with Seaborn[¶](https://snakebear.science/09-DataVisualization/index.html#basic-figure-creation-with-seaborn "Permalink to this heading")
To create a figure or graph we are typically going to:
- call some specific function within Seaborn, such as `sns.lineplot()`, to create a plotting object
- include arguments in the call that specify the data to plot and any options that we need
- write additional code to make tweaks to the object, such as adding a title, legend, or axis labels.
- call `plt.show()` to display the object
Displaying a plot will typically involve a few lines of code. The initial call will create an object that we canāt see yet, but that knows what type of plot it is and what data it holds. We then add to the object and tweak things. Finally, when we want to see what weāve built, we can tell the object to show itself.
This is a slightly different way of interacting with objects than you might be familiar with. It takes some getting used to. The thing to remember is that when you call the initial function you are creating an object. You can then do things with that object, such as put it in a list, assign it to a variable, or display it.
Note: Usually plot objects donāt display themselves until you ask, but Colab (and other forms of Jupyter notebooks) try to be helpful and may display your objects even when you havenāt asked. You might be tempted to rely on that convenience, but Iād like you not to get into that habit, as it can cause quite a bit of confusion later down the road.
### 9\.2.1. Seaborn Patterns[¶](https://snakebear.science/09-DataVisualization/index.html#seaborn-patterns "Permalink to this heading")
Letās look at some code to get a sense of the overall pattern for making plots with Seaborn. Note: Remember that `sns` is the alias for the Seaborn library.
```
# set figure size
sns.set(rc={'figure.figsize':(10,8)})
# data for our plot
yearlist = [1970, 1980, 1990, 2000, 2010, 2020, 2030]
zombiecount = [0, 0, 0, 0, 10, 20, 50] # slowest zombie spread of all time
# plot starts here
sns.lineplot(x = yearlist, y = zombiecount) # called a line plot, lists as x and y
plt.show() # displays the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_7_0.svg)
Letās add a few more components to our figure.
```
# data for our plot
yearlist = [1970, 1980, 1990, 2000, 2010, 2020, 2030]
zombiecount = [0, 0, 0, 0, 10, 20, 50] # slowest zombie spread of all time
# plot starts here
sns.lineplot(x = yearlist, y = zombiecount) # called a line plot, lists as x and y
plt.title('Slow Zombie Apocalypse Chart') # added a title
plt.xlabel('Year') # added a x-axis label
plt.ylabel('Zombies') # added a y-axis label
plt.show() # displays the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_9_0.svg)
Here we will plot two sets of data on the same figure. Notice we are calling `sns.lineplot()` twice, but we are not getting back two plots. Instead, the second call adds to the existing object.
```
# data for our plot
yearlist = [1970, 1980, 1990, 2000, 2010, 2020, 2030]
zombiecount = [0, 0, 0, 0, 10, 20, 50] # slowest zombie spread of all time
bunnycount = [45, 35, 25, 15, 10, 5, 1] # bunnies have a rough time
# plot starts here
sns.lineplot(x = yearlist,
y = zombiecount,
label = 'zombies') # called a line plot, lists as x and y
sns.lineplot(x = yearlist,
y = bunnycount,
label = 'bunnies') # add a line to the existing plot
plt.show() # displays the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_11_0.svg)
Again, here is the general pattern:
1. Have the data in some form
2. Call a Seaborn function to plot the data, supply necessary arguments
3. Tweak or add to the plot with calls to `plt` (which is an alias for another library called matplotlib)
4. Display the result
Make sure you can recognize how the code above accomplishes each step.
### 9\.2.2. Seaborn Patterns with Pandas[¶](https://snakebear.science/09-DataVisualization/index.html#seaborn-patterns-with-pandas "Permalink to this heading")
If you have your data in a dataframe, you can hand specific series from your dataframe to Seaborn for plotting.
For example:
```
# create a zombie bunny dataframe
# don't worry about how this works
dfzb = pd.DataFrame(zip(yearlist, zombiecount, bunnycount))
dfzb.columns = ['year', 'zombies', 'bunnies']
# plot the data
sns.lineplot(x = 'year', y = 'zombies', data = dfzb) # specifies series and dataframe
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_14_0.svg)
Notice that Seaborn has taken the series names and used them as axis-labels. We can override this with a subsequent call to `plt`.
```
sns.lineplot(x = 'year', y = 'zombies', data = dfzb) # specify series and dataframe
plt.ylabel('Zombies!!!!')
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_16_0.svg)
We have options for supplying our data to Seaborn. We can feed it the data directly, for example, by passing it a series, or we use `data =` to specify a dataframe, and then it will look for x and y in the column names. Students commonly blend the two approaches and have difficulty figuring out why it fails. Always check that youāve specified a data source, if you are trying to plot a series from a dataframe.
## 9\.3. Line Plots[¶](https://snakebear.science/09-DataVisualization/index.html#line-plots "Permalink to this heading")
Up until now we have been making lineplots where there was only one y value for each x value in the line. For example, in 2030 there was one 1 bunny left. However, often when we make a lineplot the line is actually representing a summary of lots of data points. For example, letās say we were plotting weekly Broadway revenue. A line plot would then produce a single point for each week representing the average revenue of lots of Broadway productions. In a sense, Seaborn is acting a lot like `pd.groupby()` and it is automatically aggregating over groups of data to produce the plots. If we hand Seaborn data where there are many values for each point we want to plot, and ask it to make a line plot, it will assume we want the mean plotted, and automatically handle aggregating, calculating the mean, and then plotting the line.
In addition, Seaborn will add a band of color around the line. Letās see what that looks like:
```
sns.lineplot(x = "Decade",
y = "Rates.Violent.All",
data = df)
plt.show() # still need this to show the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_19_0.svg)
In this dataset we have data from different states for each decade. Seaborn represents the mean value with the darker blue line. The lighter blue line shows you a statistic called the 95% confidence interval. The confidence interval depends on both the sample size and the variance of the data itself. We will talk more about this later. If you are interested, [read more about confidence intervals here](https://en.wikipedia.org/wiki/Confidence_interval).
We can add a second line to plot by adding a second function call to our code.
Notice that the data in the plot above is the same as the data in the lower line in the plot below. If I showed you the one above and asked you to come to a conclusion about violent crime rates you would likely come to a different conclusion than if I showed you the one below. The only difference between the two is the scaling of the y-axis. Property crime is much more common, so Seaborn changed the y-axis to accommodate the second set of data.
Also notice that the confidence intervals are still there but may be difficult to see due to the scaling and the size of the figure display.
```
sns.lineplot(x = "Decade", y = "Rates.Violent.All", data = df)
sns.lineplot(x = "Decade", y = "Rates.Property.All", data = df)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_22_0.svg)
## 9\.4. Categorical Plots[¶](https://snakebear.science/09-DataVisualization/index.html#categorical-plots "Permalink to this heading")
### 9\.4.1. Bar Charts[¶](https://snakebear.science/09-DataVisualization/index.html#bar-charts "Permalink to this heading")
We can use `sns.barplot()` to create bar charts.
By default, Seaborn represents the mean of the data as the height of the bar and represents the dispersion of the data with a small grey line that crosses through the top of the bar. The top and bottom of that line represent the 95% confidence interval.
```
# create a view into the dataframe restricted to States that start with 'I'
dfi = df.loc[df.loc[:, 'State'].str.startswith('I')]
sns.barplot(x = "State",
y = "Rates.Violent.All",
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_25_0.svg)
To get a horizontal bar chart we just flip the variable assigned to the axes.
```
sns.barplot(y = "State",
x = "Rates.Violent.All",
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_27_0.svg)
### 9\.4.2. Grouped Bar Charts[¶](https://snakebear.science/09-DataVisualization/index.html#grouped-bar-charts "Permalink to this heading")
By adding the `hue` argument we can create grouped bar charts.
```
dfi = df.loc[df.loc[:, 'State'].str.startswith('I')]
sns.barplot(x = "State",
y = "Rates.Violent.All",
hue= 'Decade',
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_29_0.svg)
```
dfi = df.loc[df.loc[:, 'State'].str.startswith('I')]
sns.barplot(x = "Decade",
y = "Rates.Violent.All",
hue= 'State',
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_30_0.svg)
### 9\.4.3. Count Plots[¶](https://snakebear.science/09-DataVisualization/index.html#count-plots "Permalink to this heading")
We can make plots of the frequency of categorical data using `sns.countplot()`.
Note that we are only supplying an argument for the category we want to be counted. Seaborn handles actually doing the count.
```
# countplot
sns.countplot(x = "Region",
data = df_stateinfo)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_32_0.svg)
We can make a horizontal version of the frequency count by specifying the categorical data using the argument `y` rather than `x`.
```
# horizontal countplot
sns.countplot(y = "Division",
data = df_stateinfo)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_34_0.svg)
## 9\.5. Histograms[¶](https://snakebear.science/09-DataVisualization/index.html#histograms "Permalink to this heading")
You can use Seabornās `.histplot()` method to create a histogram, which provides frequency counts for continuous data. This method uses the argument `bins=` to specify the number of bins in the histogram.
Note: In the plot below, the y-axis represents the count of values falling within each bin.
```
# view of states with lower population
df_small = df.loc[df.loc[:, 'Population_Mil'] < 20]
# create histogram
sns.histplot(x = 'Population_Mil',
data = df_small,
bins = 15)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_36_0.svg)
Two overlapping distributions to compare frequencies.
```
# create overlapping histogram with two calls
sns.histplot(x = 'Rates.Property.Motor',
data = df_small,
bins = 15)
sns.histplot(x = 'Rates.Property.Burglary',
data = df_small,
bins = 15,
color = 'purple')
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_38_0.svg)
## 9\.6. Box Plots[¶](https://snakebear.science/09-DataVisualization/index.html#box-plots "Permalink to this heading")
Another way of showing both central tendency and dispersion is the boxplot. In Seaborn, we can create boxplots by using the `kind='box'` argument.
```
# create boxplot
sns.boxplot(x = "State",
y = "Rates.Violent.All",
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_40_0.svg)
```
# create horizontal boxplot
sns.boxplot(y = "State",
x = "Rates.Violent.All",
data=dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_41_0.svg)
```
# create masks
smalli_mask = dfi.loc[:, 'State'].isin(['Idaho', 'Iowa'])
recent_mask = dfi.loc[:, 'Decade'] > 1990
# create boxplots
sns.boxplot(x = "State",
y = "Rates.Violent.All",
hue = 'Decade',
data = dfi.loc[smalli_mask & recent_mask])
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_42_0.svg)
## 9\.7. Scatter Plots[¶](https://snakebear.science/09-DataVisualization/index.html#scatter-plots "Permalink to this heading")
Seabornās `.scatterplot()` method can be used to make scatter plots of data.
```
df.loc[:, ["Population", "Population_Mil", "Rates.Violent.All", "State"]].groupby(by='State').mean()
```
| | Population | Population\_Mil | Rates.Violent.All |
|---|---|---|---|
| State | | | |
| Alabama | 4\.011637e+06 | 4\.011637 | 434\.183019 |
| Alaska | 4\.858700e+05 | 0\.485870 | 492\.749057 |
| Arizona | 3\.553305e+06 | 3\.553305 | 487\.341509 |
| Arkansas | 2\.347026e+06 | 2\.347026 | 376\.058491 |
| California | 2\.715494e+07 | 27\.154940 | 642\.071698 |
| Colorado | 3\.298696e+06 | 3\.298696 | 383\.209434 |
| Connecticut | 3\.183763e+06 | 3\.183763 | 299\.032075 |
| Delaware | 6\.626316e+05 | 0\.662632 | 478\.762264 |
| District of Columbia | 6\.524541e+05 | 0\.652454 | 1652\.773585 |
| Florida | 1\.171569e+07 | 11\.715686 | 737\.747170 |
| Georgia | 6\.455632e+06 | 6\.455632 | 456\.471698 |
| Hawaii | 1\.027689e+06 | 1\.027689 | 213\.001887 |
| Idaho | 1\.041935e+06 | 1\.041935 | 206\.798113 |
| Illinois | 1\.162432e+07 | 11\.624323 | 633\.822642 |
| Indiana | 5\.594049e+06 | 5\.594049 | 318\.428302 |
| Iowa | 2\.873040e+06 | 2\.873040 | 197\.867925 |
| Kansas | 2\.479876e+06 | 2\.479876 | 319\.211321 |
| Kentucky | 3\.686424e+06 | 3\.686424 | 264\.737736 |
| Louisiana | 4\.122317e+06 | 4\.122317 | 594\.069811 |
| Maine | 1\.159225e+06 | 1\.159225 | 122\.618868 |
| Maryland | 4\.572824e+06 | 4\.572824 | 677\.616981 |
| Massachusetts | 5\.924470e+06 | 5\.924470 | 444\.211321 |
| Michigan | 9\.263410e+06 | 9\.263410 | 567\.367925 |
| Minnesota | 4\.332084e+06 | 4\.332084 | 225\.577358 |
| Mississippi | 2\.579543e+06 | 2\.579543 | 291\.556604 |
| Missouri | 5\.122876e+06 | 5\.122876 | 478\.056604 |
| Montana | 8\.173031e+05 | 0\.817303 | 194\.183019 |
| Nebraska | 1\.607538e+06 | 1\.607538 | 243\.671698 |
| Nevada | 1\.244390e+06 | 1\.244390 | 601\.279245 |
| New Hampshire | 1\.001637e+06 | 1\.001637 | 112\.858491 |
| New Jersey | 7\.686547e+06 | 7\.686547 | 404\.007547 |
| New Mexico | 1\.462050e+06 | 1\.462050 | 580\.150943 |
| New York | 1\.837103e+07 | 18\.371025 | 731\.075000 |
| North Carolina | 6\.635885e+06 | 6\.635885 | 442\.022642 |
| North Dakota | 6\.470186e+05 | 0\.647019 | 81\.926415 |
| Ohio | 1\.089756e+07 | 10\.897562 | 345\.690566 |
| Oklahoma | 3\.086300e+06 | 3\.086300 | 395\.273585 |
| Oregon | 2\.787653e+06 | 2\.787653 | 353\.279245 |
| Pennsylvania | 1\.198127e+07 | 11\.981271 | 326\.607547 |
| Rhode Island | 9\.775871e+05 | 0\.977587 | 275\.481132 |
| South Carolina | 3\.390524e+06 | 3\.390524 | 617\.522642 |
| South Dakota | 7\.213471e+05 | 0\.721347 | 152\.688679 |
| Tennessee | 4\.864989e+06 | 4\.864989 | 497\.443396 |
| Texas | 1\.653946e+07 | 16\.539457 | 479\.654717 |
| Utah | 1\.709371e+06 | 1\.709371 | 220\.392453 |
| Vermont | 5\.278923e+05 | 0\.527892 | 104\.839623 |
| Virginia | 5\.939604e+06 | 5\.939604 | 289\.549057 |
| Washington | 4\.660820e+06 | 4\.660820 | 338\.537736 |
| West Virginia | 1\.834482e+06 | 1\.834482 | 184\.901887 |
| Wisconsin | 4\.867310e+06 | 4\.867310 | 183\.911321 |
| Wyoming | 4\.452619e+05 | 0\.445262 | 221\.607547 |
```
# group by state, with mean aggregation
pop_violence_list = ["State", "Population", "Population_Mil", "Rates.Violent.All", "Rates.Property.All"]
df_grouped_state_mean = df.loc[:, pop_violence_list].groupby(by='State').mean()
# create list of the largest population states
big_states = df_grouped_state_mean \
.loc[:, 'Population'] \
.sort_values()[-21:] \
.index.to_list()
# use big state list to index df_grouped
dfbig = df_grouped_state_mean \
.loc[df_grouped_state_mean.index.isin(big_states)]
# create scatterplot
sns.scatterplot(x = "Population_Mil",
y = "Rates.Violent.All",
data = dfbig)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_45_0.svg)
Among the twenty largest states, there does not appear to be a simple linear association between population and rates of violent crime. It would be nice to know what state each point represents. Letās try adding the hue argument to see how that goes.
```
# create scatterplot with labelled data points
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = dfbig.index,
data = dfbig)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_47_0.svg)
Thatās an all new kind of terrible chart that weāve made. The States appear in the legend in alphabetical order, but thereās so many itās impossible to tell one shade of State from another. Letās try it again with a smaller selection of states.
```
# create list of the largest population states
big_six = df_grouped_state_mean \
.loc[:, 'Population'] \
.sort_values()[-8:-1] \
.index \
.to_list()
# use big state list to index df_grouped
dfbig6 = df_grouped_state_mean \
.loc[df_grouped_state_mean.index.isin(big_six)]
# create plot
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = dfbig6.index,
data = dfbig6)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_49_0.svg)
An improvement, but still ugly. We can use a matplotlib method to control where the legend is displayed.
```
# create plot
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = dfbig6.index,
data = dfbig6)
# use a matplotlib method to control the legend display
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_51_0.svg)
Thatās a bit better. Now that the plot is cleaned up, letās look at what it says about our States.
Unfortunately, Florida shows that smaller populations (relative to NY and TX) can have violent crime rates that match or exceed bigger states.
We can examine the same association by region.
```
# add the state supplemental information into the grouped state mean df
df_grouped_state_mean = pd.merge(df_grouped_state_mean.reset_index(),
df_stateinfo)
# plot the association with Region specified as the hue
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = 'Region',
data = df_grouped_state_mean)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_53_0.svg)
Thatās not a particularly good visualization. Too many brightly colored dots all next to one another.
Letās try breaking up the plots by region.
```
for region in df_grouped_state_mean.loc[:, 'Region'].unique():
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
data = df_grouped_state_mean \
.loc[df_grouped_state_mean['Region'] == region])
plt.title(region)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_0.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_1.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_2.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_3.svg)
So there does appear to be a positive association between population size and rates of violence in the Northeast and Midwest.
The plots for the South and the West both have outliers (data points that appear to be quite distant from the others). In the South, we have a relatively small state with a very high rate of violent crime. (Itās not actually a state, despite having a population larger than several states, but we will come back to that). In the West, we have a very large state (California), with a rate of crime very similar to much less populus states.
### 9\.7.1. Scatter Plots with Regression Lines[¶](https://snakebear.science/09-DataVisualization/index.html#scatter-plots-with-regression-lines "Permalink to this heading")
Regression is a technique for fitting a model to your data that can help you better understand how to variables in your data are related. It can also be used to make predictions. Think of it as moving one step beyond just looking at correlations.
āSimple linear regressionā is used to fit a straight line through the data. You use this technique when you think two variables have a simple linear (not curved) relation with one another.
Seaborn includes visualizations of regression lines. Typically, these show a basic scatter plot with the regression line added. Seabornās regression visualization also includes a band around the line indicating the confidence interval.
Note: If you are looking for formal regression statistics for your model, you should be aware that Seabornās documentation states that their regression functionality is only intended to help with visualization and that formal regression statistics should be created with a different package, such as [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html).
Letās use regression visualizations to look at the following question: In large population states, is higher property crime associated with higher violent crime rates?
```
# scatter plot with regression line
sns.lmplot(x = 'Rates.Violent.All',
y = 'Rates.Property.All',
data = dfbig)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_58_0.svg)
How about overall?
```
# scatterplot with regression line
sns.lmplot(x = 'Rates.Violent.All',
y = 'Rates.Property.All',
data = df_grouped_state_mean)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_60_0.svg)
What about the ten states with the highest levels of violence?
```
# create list of the highest violence states by sorting, then indexing the list,
# and then grabbing the state names from the index
highest_violence = df_grouped_state_mean \
.loc[:, 'Rates.Violent.All'] \
.sort_values() \
.index \
.to_list()[-10:]
# use high violence list to index df_grouped using .isin()
dfhighv = df_grouped_state_mean \
.loc[df_grouped_state_mean \
.index \
.isin(highest_violence)]
# plot
sns.lmplot(x = "Population_Mil",
y = "Rates.Violent.All",
data = dfhighv)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_62_0.svg)
These visualizations show a single outlier (notice that datapoint in the top left of the graph) that needs to be explored. We will come back to that soon.
Letās look at the association between property crime and population by region. Some Seaborn plotting functions, but not all, take an argument, `col =`, that allows us to create groups of plots. Earlier we did this with a loop. Using `col` and `col_wrap=`, which specifies how many plots you want per row, makes things a bit easier for us than writing a loop.
```
# scatter plots with regression lines in columns and rows
sns.lmplot(y = 'Population_Mil',
x = 'Rates.Property.All',
col = 'Region', # creates a separate plot for each region
col_wrap = 2, # wraps columns so not all side by side
data=df_grouped_state_mean)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_64_0.svg)
The columns and rows of plots look nice, but the plots themselves are a bit ugly. Here, the scaling is consistent across the subplots, but most regions donāt have data points across the full range of values so the result is short, stunted looking, regression lines. Although we will lose the advantage of having common scaling between graphs, we can go back to our loop to make individual plots so Seaborn will scale each plot individually:
```
# scatter plots with loop
for region in df['Region'].unique(): # loop through the regions
# create a subset of the data with just one region at a time
regiondata = df_grouped_state_mean \
.loc[df_grouped_state_mean['Region'] == region]
# make the plot
sns.lmplot(y = 'Population_Mil',
x = 'Rates.Property.All',
data = regiondata)
plt.title(region)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_0.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_1.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_2.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_3.svg)
Thatās a bit better. Now letās clean it up and provide labels and such.
```
# same as above, but with title and labels
for region in df['Region'].unique():
regiondata = df_grouped_state_mean \
.loc[df_grouped_state_mean['Region'] == region]
sns.lmplot(y = 'Population_Mil',
x = 'Rates.Property.All',
data = regiondata)
plt.ylabel('Population in Millions')
plt.xlabel('Rate of Property Crime')
plt.title(f'Region = {region}')
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_0.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_1.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_2.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_3.svg)
This second approach makes for better individual charts to look at the regression lines, but the difference in scaling between them might create problems for folks who donāt pay close attention to these things (which is most of us, most of the time). So there is a trade-off for plotting things this way. Be careful with these sorts of scaling issues and consider your audience when making your visualizations.
***
We will explore the state crime data a bit more in an upcoming exercise. If you are interested in more recent data about overall trends in the U.S. take a look at [this analysis by the Pew Research Center](https://www.pewresearch.org/fact-tank/2020/11/20/facts-about-crime-in-the-u-s/).
[Back to top](https://snakebear.science/09-DataVisualization/index.html)
Ā© Copyright 2018, Mark Liffiton and Brad Sheese.
[](http://creativecommons.org/licenses/by-nc-sa/3.0/) This work is licensed under [CC BY-NC-SA 3.0](http://creativecommons.org/licenses/by-nc-sa/3.0/). See [License](https://snakebear.science/AB-License.html) for details. See [Github](https://github.com/CS-DS-125/125book/) for source and to report [issues](https://github.com/CS-DS-125/125book/issues) with the contents or website.
Last updated on Feb 05, 2026. Created using [Sphinx](https://www.sphinx-doc.org/) 5.3.0, [Runestone](http://runestoneinteractive.org/) 4.1.17. |
| Readable Markdown | Note
This is a static copy of a Jupyter notebook.
You can access a live version allowing you to modify and execute the code using [Binder](https://mybinder.org/v2/gh/CS-DS-125/125book/master?urlpath=lab/tree/source/09-DataVisualization/notebooks/09_1_Visualization_Introduction_with_Seaborn.ipynb).
## 9\.1. Introduction[¶](https://snakebear.science/09-DataVisualization/index.html#introduction "Permalink to this heading")
There are a variety of data visualization libraries available in Python. There is a lot of commonality in these libraries, but they do take different approaches and offer somewhat different visualization tools.
A library called [Matplotlib](https://en.wikipedia.org/wiki/Matplotlib) was developed in 2002. Matplotlib has been designed to work with Numpy and [Scipy](https://en.wikipedia.org/wiki/SciPy). Matplotlib underlies many Python visualization packages, including the one we will learn called [Seaborn](https://seaborn.pydata.org/). Seaborn is an advanced library, it is powerful and straightforward, which makes it a good place to start. However, if you continue to work with these kinds of tools, consider learning Matplotlib as well.
By convention, Seaborn is imported and given the abbreviation `sns`. When you see a call that is prefaced with `sns`, such as `sns.lineplot()`, you are using the Seaborn library.
Here are some image galleries you can take a quick look at, so you can get a sense of what these packages can do: \* [Seaborn gallery](https://seaborn.pydata.org/examples/index.html) \* [Matplotlib gallery](https://matplotlib.org/gallery.html) \* [Plotly gallery](https://plotly.com/python/)
We will only be scratching the surface of Seaborn, but once you get used to the basics you should be able to start learning the rest for yourself.
We will be using calls to Matplotlib to tweak and display the plotting object we have built with Seaborn. In particular, we will be using a module within the library called `matplotlib.pyplot`. By convention, `matplotlib.pyplot` is imported as `plot`. When you see a function call prefaced with `plot`, such as `plot.show()`, you are using the Matplotlib library.
### 9\.1.1. Preliminaries: Imports and Dataframe Creation[¶](https://snakebear.science/09-DataVisualization/index.html#preliminaries-imports-and-dataframe-creation "Permalink to this heading")
We are going to be using the CORGIS state crime data set. Each row in the dataset represents one year and one U.S. state.
```
# imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# ignore this code entirely
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
# read the data and create the dataframe
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme1 = 'data_corgis_state_crime.csv'
df_original = pd.read_csv(urlg + repo + fnme1)
df = df_original[df_original['State'] != 'United States']
# read the data and create a supplemental dataframe
fnme2 = 'data_state_info.csv'
df_stateinfo = pd.read_csv(urlg + repo + fnme2)
# remove totals and just keep crime rates
column_mask = ~df.columns.str.contains('Totals')
df = df[df.columns[column_mask]]
#drop rows with empty values
df = df.dropna(axis=0, how='any')
# create a decade column
df.loc[:, 'Decade'] = (df.loc[:, 'Year']//10 * 10).astype(int)
# merge state crime data with supplemental state classification data
df_stateinfo = df_stateinfo.reset_index()
df = pd.merge(df, df_stateinfo)
# normalize population for better plotting
df.loc[:, 'Population_Mil'] = df.loc[:, 'Population']/1000000
#check dataframe
df.head(3)
```
| | Population | Rates.Property.All | Rates.Property.Burglary | Rates.Property.Larceny | Rates.Property.Motor | Rates.Violent.All | Rates.Violent.Assault | Rates.Violent.Murder | Rates.Violent.Rape | Rates.Violent.Robbery | State | Year | Decade | index | State Code | Region | Division | Population\_Mil |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3266740 | 1035\.4 | 355\.9 | 592\.1 | 87\.3 | 186\.6 | 138\.1 | 12\.4 | 8\.6 | 27\.5 | Alabama | 1960 | 1960 | 1 | AL | South | East South Central | 3\.26674 |
| 1 | 3302000 | 985\.5 | 339\.3 | 569\.4 | 76\.8 | 168\.5 | 128\.9 | 12\.9 | 7\.6 | 19\.1 | Alabama | 1961 | 1960 | 1 | AL | South | East South Central | 3\.30200 |
| 2 | 3358000 | 1067\.0 | 349\.1 | 634\.5 | 83\.4 | 157\.3 | 119\.0 | 9\.4 | 6\.5 | 22\.5 | Alabama | 1962 | 1960 | 1 | AL | South | East South Central | 3\.35800 |
```
# abbreviated descriptives
df.describe().T[['mean', 'min', 'max']].round(1)
```
| | mean | min | max |
|---|---|---|---|
| Population | 4751877\.0 | 226167\.0 | 38041430\.0 |
| Rates.Property.All | 3683\.2 | 573\.1 | 9512\.1 |
| Rates.Property.Burglary | 929\.5 | 182\.6 | 2906\.7 |
| Rates.Property.Larceny | 2395\.2 | 293\.3 | 5833\.8 |
| Rates.Property.Motor | 358\.5 | 48\.3 | 1839\.9 |
| Rates.Violent.All | 398\.9 | 9\.5 | 2921\.8 |
| Rates.Violent.Assault | 235\.6 | 3\.6 | 1557\.6 |
| Rates.Violent.Murder | 6\.7 | 0\.2 | 80\.6 |
| Rates.Violent.Rape | 28\.3 | 0\.8 | 102\.2 |
| Rates.Violent.Robbery | 128\.2 | 1\.9 | 1635\.1 |
| Year | 1986\.0 | 1960\.0 | 2012\.0 |
| Decade | 1981\.7 | 1960\.0 | 2010\.0 |
| index | 25\.0 | 0\.0 | 50\.0 |
| Population\_Mil | 4\.8 | 0\.2 | 38\.0 |
## 9\.2. Basic Figure Creation with Seaborn[¶](https://snakebear.science/09-DataVisualization/index.html#basic-figure-creation-with-seaborn "Permalink to this heading")
To create a figure or graph we are typically going to:
- call some specific function within Seaborn, such as `sns.lineplot()`, to create a plotting object
- include arguments in the call that specify the data to plot and any options that we need
- write additional code to make tweaks to the object, such as adding a title, legend, or axis labels.
- call `plt.show()` to display the object
Displaying a plot will typically involve a few lines of code. The initial call will create an object that we canāt see yet, but that knows what type of plot it is and what data it holds. We then add to the object and tweak things. Finally, when we want to see what weāve built, we can tell the object to show itself.
This is a slightly different way of interacting with objects than you might be familiar with. It takes some getting used to. The thing to remember is that when you call the initial function you are creating an object. You can then do things with that object, such as put it in a list, assign it to a variable, or display it.
Note: Usually plot objects donāt display themselves until you ask, but Colab (and other forms of Jupyter notebooks) try to be helpful and may display your objects even when you havenāt asked. You might be tempted to rely on that convenience, but Iād like you not to get into that habit, as it can cause quite a bit of confusion later down the road.
### 9\.2.1. Seaborn Patterns[¶](https://snakebear.science/09-DataVisualization/index.html#seaborn-patterns "Permalink to this heading")
Letās look at some code to get a sense of the overall pattern for making plots with Seaborn. Note: Remember that `sns` is the alias for the Seaborn library.
```
# set figure size
sns.set(rc={'figure.figsize':(10,8)})
# data for our plot
yearlist = [1970, 1980, 1990, 2000, 2010, 2020, 2030]
zombiecount = [0, 0, 0, 0, 10, 20, 50] # slowest zombie spread of all time
# plot starts here
sns.lineplot(x = yearlist, y = zombiecount) # called a line plot, lists as x and y
plt.show() # displays the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_7_0.svg)
Letās add a few more components to our figure.
```
# data for our plot
yearlist = [1970, 1980, 1990, 2000, 2010, 2020, 2030]
zombiecount = [0, 0, 0, 0, 10, 20, 50] # slowest zombie spread of all time
# plot starts here
sns.lineplot(x = yearlist, y = zombiecount) # called a line plot, lists as x and y
plt.title('Slow Zombie Apocalypse Chart') # added a title
plt.xlabel('Year') # added a x-axis label
plt.ylabel('Zombies') # added a y-axis label
plt.show() # displays the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_9_0.svg)
Here we will plot two sets of data on the same figure. Notice we are calling `sns.lineplot()` twice, but we are not getting back two plots. Instead, the second call adds to the existing object.
```
# data for our plot
yearlist = [1970, 1980, 1990, 2000, 2010, 2020, 2030]
zombiecount = [0, 0, 0, 0, 10, 20, 50] # slowest zombie spread of all time
bunnycount = [45, 35, 25, 15, 10, 5, 1] # bunnies have a rough time
# plot starts here
sns.lineplot(x = yearlist,
y = zombiecount,
label = 'zombies') # called a line plot, lists as x and y
sns.lineplot(x = yearlist,
y = bunnycount,
label = 'bunnies') # add a line to the existing plot
plt.show() # displays the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_11_0.svg)
Again, here is the general pattern:
1. Have the data in some form
2. Call a Seaborn function to plot the data, supply necessary arguments
3. Tweak or add to the plot with calls to `plt` (which is an alias for another library called matplotlib)
4. Display the result
Make sure you can recognize how the code above accomplishes each step.
### 9\.2.2. Seaborn Patterns with Pandas[¶](https://snakebear.science/09-DataVisualization/index.html#seaborn-patterns-with-pandas "Permalink to this heading")
If you have your data in a dataframe, you can hand specific series from your dataframe to Seaborn for plotting.
For example:
```
# create a zombie bunny dataframe
# don't worry about how this works
dfzb = pd.DataFrame(zip(yearlist, zombiecount, bunnycount))
dfzb.columns = ['year', 'zombies', 'bunnies']
# plot the data
sns.lineplot(x = 'year', y = 'zombies', data = dfzb) # specifies series and dataframe
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_14_0.svg)
Notice that Seaborn has taken the series names and used them as axis-labels. We can override this with a subsequent call to `plt`.
```
sns.lineplot(x = 'year', y = 'zombies', data = dfzb) # specify series and dataframe
plt.ylabel('Zombies!!!!')
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_16_0.svg)
We have options for supplying our data to Seaborn. We can feed it the data directly, for example, by passing it a series, or we use `data =` to specify a dataframe, and then it will look for x and y in the column names. Students commonly blend the two approaches and have difficulty figuring out why it fails. Always check that youāve specified a data source, if you are trying to plot a series from a dataframe.
## 9\.3. Line Plots[¶](https://snakebear.science/09-DataVisualization/index.html#line-plots "Permalink to this heading")
Up until now we have been making lineplots where there was only one y value for each x value in the line. For example, in 2030 there was one 1 bunny left. However, often when we make a lineplot the line is actually representing a summary of lots of data points. For example, letās say we were plotting weekly Broadway revenue. A line plot would then produce a single point for each week representing the average revenue of lots of Broadway productions. In a sense, Seaborn is acting a lot like `pd.groupby()` and it is automatically aggregating over groups of data to produce the plots. If we hand Seaborn data where there are many values for each point we want to plot, and ask it to make a line plot, it will assume we want the mean plotted, and automatically handle aggregating, calculating the mean, and then plotting the line.
In addition, Seaborn will add a band of color around the line. Letās see what that looks like:
```
sns.lineplot(x = "Decade",
y = "Rates.Violent.All",
data = df)
plt.show() # still need this to show the plot
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_19_0.svg)
In this dataset we have data from different states for each decade. Seaborn represents the mean value with the darker blue line. The lighter blue line shows you a statistic called the 95% confidence interval. The confidence interval depends on both the sample size and the variance of the data itself. We will talk more about this later. If you are interested, [read more about confidence intervals here](https://en.wikipedia.org/wiki/Confidence_interval).
We can add a second line to plot by adding a second function call to our code.
Notice that the data in the plot above is the same as the data in the lower line in the plot below. If I showed you the one above and asked you to come to a conclusion about violent crime rates you would likely come to a different conclusion than if I showed you the one below. The only difference between the two is the scaling of the y-axis. Property crime is much more common, so Seaborn changed the y-axis to accommodate the second set of data.
Also notice that the confidence intervals are still there but may be difficult to see due to the scaling and the size of the figure display.
```
sns.lineplot(x = "Decade", y = "Rates.Violent.All", data = df)
sns.lineplot(x = "Decade", y = "Rates.Property.All", data = df)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_22_0.svg)
## 9\.4. Categorical Plots[¶](https://snakebear.science/09-DataVisualization/index.html#categorical-plots "Permalink to this heading")
### 9\.4.1. Bar Charts[¶](https://snakebear.science/09-DataVisualization/index.html#bar-charts "Permalink to this heading")
We can use `sns.barplot()` to create bar charts.
By default, Seaborn represents the mean of the data as the height of the bar and represents the dispersion of the data with a small grey line that crosses through the top of the bar. The top and bottom of that line represent the 95% confidence interval.
```
# create a view into the dataframe restricted to States that start with 'I'
dfi = df.loc[df.loc[:, 'State'].str.startswith('I')]
sns.barplot(x = "State",
y = "Rates.Violent.All",
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_25_0.svg)
To get a horizontal bar chart we just flip the variable assigned to the axes.
```
sns.barplot(y = "State",
x = "Rates.Violent.All",
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_27_0.svg)
### 9\.4.2. Grouped Bar Charts[¶](https://snakebear.science/09-DataVisualization/index.html#grouped-bar-charts "Permalink to this heading")
By adding the `hue` argument we can create grouped bar charts.
```
dfi = df.loc[df.loc[:, 'State'].str.startswith('I')]
sns.barplot(x = "State",
y = "Rates.Violent.All",
hue= 'Decade',
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_29_0.svg)
```
dfi = df.loc[df.loc[:, 'State'].str.startswith('I')]
sns.barplot(x = "Decade",
y = "Rates.Violent.All",
hue= 'State',
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_30_0.svg)
### 9\.4.3. Count Plots[¶](https://snakebear.science/09-DataVisualization/index.html#count-plots "Permalink to this heading")
We can make plots of the frequency of categorical data using `sns.countplot()`.
Note that we are only supplying an argument for the category we want to be counted. Seaborn handles actually doing the count.
```
# countplot
sns.countplot(x = "Region",
data = df_stateinfo)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_32_0.svg)
We can make a horizontal version of the frequency count by specifying the categorical data using the argument `y` rather than `x`.
```
# horizontal countplot
sns.countplot(y = "Division",
data = df_stateinfo)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_34_0.svg)
## 9\.5. Histograms[¶](https://snakebear.science/09-DataVisualization/index.html#histograms "Permalink to this heading")
You can use Seabornās `.histplot()` method to create a histogram, which provides frequency counts for continuous data. This method uses the argument `bins=` to specify the number of bins in the histogram.
Note: In the plot below, the y-axis represents the count of values falling within each bin.
```
# view of states with lower population
df_small = df.loc[df.loc[:, 'Population_Mil'] < 20]
# create histogram
sns.histplot(x = 'Population_Mil',
data = df_small,
bins = 15)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_36_0.svg)
Two overlapping distributions to compare frequencies.
```
# create overlapping histogram with two calls
sns.histplot(x = 'Rates.Property.Motor',
data = df_small,
bins = 15)
sns.histplot(x = 'Rates.Property.Burglary',
data = df_small,
bins = 15,
color = 'purple')
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_38_0.svg)
## 9\.6. Box Plots[¶](https://snakebear.science/09-DataVisualization/index.html#box-plots "Permalink to this heading")
Another way of showing both central tendency and dispersion is the boxplot. In Seaborn, we can create boxplots by using the `kind='box'` argument.
```
# create boxplot
sns.boxplot(x = "State",
y = "Rates.Violent.All",
data = dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_40_0.svg)
```
# create horizontal boxplot
sns.boxplot(y = "State",
x = "Rates.Violent.All",
data=dfi)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_41_0.svg)
```
# create masks
smalli_mask = dfi.loc[:, 'State'].isin(['Idaho', 'Iowa'])
recent_mask = dfi.loc[:, 'Decade'] > 1990
# create boxplots
sns.boxplot(x = "State",
y = "Rates.Violent.All",
hue = 'Decade',
data = dfi.loc[smalli_mask & recent_mask])
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_42_0.svg)
## 9\.7. Scatter Plots[¶](https://snakebear.science/09-DataVisualization/index.html#scatter-plots "Permalink to this heading")
Seabornās `.scatterplot()` method can be used to make scatter plots of data.
```
df.loc[:, ["Population", "Population_Mil", "Rates.Violent.All", "State"]].groupby(by='State').mean()
```
| | Population | Population\_Mil | Rates.Violent.All |
|---|---|---|---|
| State | | | |
| Alabama | 4\.011637e+06 | 4\.011637 | 434\.183019 |
| Alaska | 4\.858700e+05 | 0\.485870 | 492\.749057 |
| Arizona | 3\.553305e+06 | 3\.553305 | 487\.341509 |
| Arkansas | 2\.347026e+06 | 2\.347026 | 376\.058491 |
| California | 2\.715494e+07 | 27\.154940 | 642\.071698 |
| Colorado | 3\.298696e+06 | 3\.298696 | 383\.209434 |
| Connecticut | 3\.183763e+06 | 3\.183763 | 299\.032075 |
| Delaware | 6\.626316e+05 | 0\.662632 | 478\.762264 |
| District of Columbia | 6\.524541e+05 | 0\.652454 | 1652\.773585 |
| Florida | 1\.171569e+07 | 11\.715686 | 737\.747170 |
| Georgia | 6\.455632e+06 | 6\.455632 | 456\.471698 |
| Hawaii | 1\.027689e+06 | 1\.027689 | 213\.001887 |
| Idaho | 1\.041935e+06 | 1\.041935 | 206\.798113 |
| Illinois | 1\.162432e+07 | 11\.624323 | 633\.822642 |
| Indiana | 5\.594049e+06 | 5\.594049 | 318\.428302 |
| Iowa | 2\.873040e+06 | 2\.873040 | 197\.867925 |
| Kansas | 2\.479876e+06 | 2\.479876 | 319\.211321 |
| Kentucky | 3\.686424e+06 | 3\.686424 | 264\.737736 |
| Louisiana | 4\.122317e+06 | 4\.122317 | 594\.069811 |
| Maine | 1\.159225e+06 | 1\.159225 | 122\.618868 |
| Maryland | 4\.572824e+06 | 4\.572824 | 677\.616981 |
| Massachusetts | 5\.924470e+06 | 5\.924470 | 444\.211321 |
| Michigan | 9\.263410e+06 | 9\.263410 | 567\.367925 |
| Minnesota | 4\.332084e+06 | 4\.332084 | 225\.577358 |
| Mississippi | 2\.579543e+06 | 2\.579543 | 291\.556604 |
| Missouri | 5\.122876e+06 | 5\.122876 | 478\.056604 |
| Montana | 8\.173031e+05 | 0\.817303 | 194\.183019 |
| Nebraska | 1\.607538e+06 | 1\.607538 | 243\.671698 |
| Nevada | 1\.244390e+06 | 1\.244390 | 601\.279245 |
| New Hampshire | 1\.001637e+06 | 1\.001637 | 112\.858491 |
| New Jersey | 7\.686547e+06 | 7\.686547 | 404\.007547 |
| New Mexico | 1\.462050e+06 | 1\.462050 | 580\.150943 |
| New York | 1\.837103e+07 | 18\.371025 | 731\.075000 |
| North Carolina | 6\.635885e+06 | 6\.635885 | 442\.022642 |
| North Dakota | 6\.470186e+05 | 0\.647019 | 81\.926415 |
| Ohio | 1\.089756e+07 | 10\.897562 | 345\.690566 |
| Oklahoma | 3\.086300e+06 | 3\.086300 | 395\.273585 |
| Oregon | 2\.787653e+06 | 2\.787653 | 353\.279245 |
| Pennsylvania | 1\.198127e+07 | 11\.981271 | 326\.607547 |
| Rhode Island | 9\.775871e+05 | 0\.977587 | 275\.481132 |
| South Carolina | 3\.390524e+06 | 3\.390524 | 617\.522642 |
| South Dakota | 7\.213471e+05 | 0\.721347 | 152\.688679 |
| Tennessee | 4\.864989e+06 | 4\.864989 | 497\.443396 |
| Texas | 1\.653946e+07 | 16\.539457 | 479\.654717 |
| Utah | 1\.709371e+06 | 1\.709371 | 220\.392453 |
| Vermont | 5\.278923e+05 | 0\.527892 | 104\.839623 |
| Virginia | 5\.939604e+06 | 5\.939604 | 289\.549057 |
| Washington | 4\.660820e+06 | 4\.660820 | 338\.537736 |
| West Virginia | 1\.834482e+06 | 1\.834482 | 184\.901887 |
| Wisconsin | 4\.867310e+06 | 4\.867310 | 183\.911321 |
| Wyoming | 4\.452619e+05 | 0\.445262 | 221\.607547 |
```
# group by state, with mean aggregation
pop_violence_list = ["State", "Population", "Population_Mil", "Rates.Violent.All", "Rates.Property.All"]
df_grouped_state_mean = df.loc[:, pop_violence_list].groupby(by='State').mean()
# create list of the largest population states
big_states = df_grouped_state_mean \
.loc[:, 'Population'] \
.sort_values()[-21:] \
.index.to_list()
# use big state list to index df_grouped
dfbig = df_grouped_state_mean \
.loc[df_grouped_state_mean.index.isin(big_states)]
# create scatterplot
sns.scatterplot(x = "Population_Mil",
y = "Rates.Violent.All",
data = dfbig)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_45_0.svg)
Among the twenty largest states, there does not appear to be a simple linear association between population and rates of violent crime. It would be nice to know what state each point represents. Letās try adding the hue argument to see how that goes.
```
# create scatterplot with labelled data points
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = dfbig.index,
data = dfbig)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_47_0.svg)
Thatās an all new kind of terrible chart that weāve made. The States appear in the legend in alphabetical order, but thereās so many itās impossible to tell one shade of State from another. Letās try it again with a smaller selection of states.
```
# create list of the largest population states
big_six = df_grouped_state_mean \
.loc[:, 'Population'] \
.sort_values()[-8:-1] \
.index \
.to_list()
# use big state list to index df_grouped
dfbig6 = df_grouped_state_mean \
.loc[df_grouped_state_mean.index.isin(big_six)]
# create plot
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = dfbig6.index,
data = dfbig6)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_49_0.svg)
An improvement, but still ugly. We can use a matplotlib method to control where the legend is displayed.
```
# create plot
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = dfbig6.index,
data = dfbig6)
# use a matplotlib method to control the legend display
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_51_0.svg)
Thatās a bit better. Now that the plot is cleaned up, letās look at what it says about our States.
Unfortunately, Florida shows that smaller populations (relative to NY and TX) can have violent crime rates that match or exceed bigger states.
We can examine the same association by region.
```
# add the state supplemental information into the grouped state mean df
df_grouped_state_mean = pd.merge(df_grouped_state_mean.reset_index(),
df_stateinfo)
# plot the association with Region specified as the hue
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
hue = 'Region',
data = df_grouped_state_mean)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_53_0.svg)
Thatās not a particularly good visualization. Too many brightly colored dots all next to one another.
Letās try breaking up the plots by region.
```
for region in df_grouped_state_mean.loc[:, 'Region'].unique():
sns.scatterplot(x = 'Population_Mil',
y = 'Rates.Violent.All',
data = df_grouped_state_mean \
.loc[df_grouped_state_mean['Region'] == region])
plt.title(region)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_0.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_1.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_2.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_55_3.svg)
So there does appear to be a positive association between population size and rates of violence in the Northeast and Midwest.
The plots for the South and the West both have outliers (data points that appear to be quite distant from the others). In the South, we have a relatively small state with a very high rate of violent crime. (Itās not actually a state, despite having a population larger than several states, but we will come back to that). In the West, we have a very large state (California), with a rate of crime very similar to much less populus states.
### 9\.7.1. Scatter Plots with Regression Lines[¶](https://snakebear.science/09-DataVisualization/index.html#scatter-plots-with-regression-lines "Permalink to this heading")
Regression is a technique for fitting a model to your data that can help you better understand how to variables in your data are related. It can also be used to make predictions. Think of it as moving one step beyond just looking at correlations.
āSimple linear regressionā is used to fit a straight line through the data. You use this technique when you think two variables have a simple linear (not curved) relation with one another.
Seaborn includes visualizations of regression lines. Typically, these show a basic scatter plot with the regression line added. Seabornās regression visualization also includes a band around the line indicating the confidence interval.
Note: If you are looking for formal regression statistics for your model, you should be aware that Seabornās documentation states that their regression functionality is only intended to help with visualization and that formal regression statistics should be created with a different package, such as [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html).
Letās use regression visualizations to look at the following question: In large population states, is higher property crime associated with higher violent crime rates?
```
# scatter plot with regression line
sns.lmplot(x = 'Rates.Violent.All',
y = 'Rates.Property.All',
data = dfbig)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_58_0.svg)
How about overall?
```
# scatterplot with regression line
sns.lmplot(x = 'Rates.Violent.All',
y = 'Rates.Property.All',
data = df_grouped_state_mean)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_60_0.svg)
What about the ten states with the highest levels of violence?
```
# create list of the highest violence states by sorting, then indexing the list,
# and then grabbing the state names from the index
highest_violence = df_grouped_state_mean \
.loc[:, 'Rates.Violent.All'] \
.sort_values() \
.index \
.to_list()[-10:]
# use high violence list to index df_grouped using .isin()
dfhighv = df_grouped_state_mean \
.loc[df_grouped_state_mean \
.index \
.isin(highest_violence)]
# plot
sns.lmplot(x = "Population_Mil",
y = "Rates.Violent.All",
data = dfhighv)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_62_0.svg)
These visualizations show a single outlier (notice that datapoint in the top left of the graph) that needs to be explored. We will come back to that soon.
Letās look at the association between property crime and population by region. Some Seaborn plotting functions, but not all, take an argument, `col =`, that allows us to create groups of plots. Earlier we did this with a loop. Using `col` and `col_wrap=`, which specifies how many plots you want per row, makes things a bit easier for us than writing a loop.
```
# scatter plots with regression lines in columns and rows
sns.lmplot(y = 'Population_Mil',
x = 'Rates.Property.All',
col = 'Region', # creates a separate plot for each region
col_wrap = 2, # wraps columns so not all side by side
data=df_grouped_state_mean)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_64_0.svg)
The columns and rows of plots look nice, but the plots themselves are a bit ugly. Here, the scaling is consistent across the subplots, but most regions donāt have data points across the full range of values so the result is short, stunted looking, regression lines. Although we will lose the advantage of having common scaling between graphs, we can go back to our loop to make individual plots so Seaborn will scale each plot individually:
```
# scatter plots with loop
for region in df['Region'].unique(): # loop through the regions
# create a subset of the data with just one region at a time
regiondata = df_grouped_state_mean \
.loc[df_grouped_state_mean['Region'] == region]
# make the plot
sns.lmplot(y = 'Population_Mil',
x = 'Rates.Property.All',
data = regiondata)
plt.title(region)
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_0.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_1.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_2.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_66_3.svg)
Thatās a bit better. Now letās clean it up and provide labels and such.
```
# same as above, but with title and labels
for region in df['Region'].unique():
regiondata = df_grouped_state_mean \
.loc[df_grouped_state_mean['Region'] == region]
sns.lmplot(y = 'Population_Mil',
x = 'Rates.Property.All',
data = regiondata)
plt.ylabel('Population in Millions')
plt.xlabel('Rate of Property Crime')
plt.title(f'Region = {region}')
plt.show()
```
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_0.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_1.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_2.svg)
[](https://snakebear.science/_images/09_1_Visualization_Introduction_with_Seaborn_68_3.svg)
This second approach makes for better individual charts to look at the regression lines, but the difference in scaling between them might create problems for folks who donāt pay close attention to these things (which is most of us, most of the time). So there is a trade-off for plotting things this way. Be careful with these sorts of scaling issues and consider your audience when making your visualizations.
***
We will explore the state crime data a bit more in an upcoming exercise. If you are interested in more recent data about overall trends in the U.S. take a look at [this analysis by the Pew Research Center](https://www.pewresearch.org/fact-tank/2020/11/20/facts-about-crime-in-the-u-s/). |
| Shard | 173 (laksa) |
| Root Hash | 6549672853077037173 |
| Unparsed URL | science,snakebear!/09-DataVisualization/index.html s443 |