Data Visualization with Seaborn (Part #2)

⭐️ Part #2 of a 3-Part Series


Continuing from Part 1 of my seaborn series, we'll proceed to cover 2D plots.


This notebook is a reorganization of the many ideas shared in this Github repo and this blog post. What you see here is a modified version that works for me that I hope will work for you as well. Also, enjoy the cat GIFs.


seaborn visualizations

2D: Visualizing Data in Two Dimensions

One of the best ways to check out potential relationships or correlations amongst the different data attributes is to leverage a pair-wise correlation matrix and depict it as a heatmap.

2D: Heatmap on Correlation Matrix

# Compute pairwise correlation of Dataframe's attributes
corr = wines.corr()
corr
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
fixed acidity 1.000000 0.219008 0.324436 -0.111981 0.298195 -0.282735 -0.329054 0.458910 -0.252700 0.299568 -0.095452 -0.076743
volatile acidity 0.219008 1.000000 -0.377981 -0.196011 0.377124 -0.352557 -0.414476 0.271296 0.261454 0.225984 -0.037640 -0.265699
citric acid 0.324436 -0.377981 1.000000 0.142451 0.038998 0.133126 0.195242 0.096154 -0.329808 0.056197 -0.010493 0.085532
residual sugar -0.111981 -0.196011 0.142451 1.000000 -0.128940 0.402871 0.495482 0.552517 -0.267320 -0.185927 -0.359415 -0.036980
chlorides 0.298195 0.377124 0.038998 -0.128940 1.000000 -0.195045 -0.279630 0.362615 0.044708 0.395593 -0.256916 -0.200666
free sulfur dioxide -0.282735 -0.352557 0.133126 0.402871 -0.195045 1.000000 0.720934 0.025717 -0.145854 -0.188457 -0.179838 0.055463
total sulfur dioxide -0.329054 -0.414476 0.195242 0.495482 -0.279630 0.720934 1.000000 0.032395 -0.238413 -0.275727 -0.265740 -0.041385
density 0.458910 0.271296 0.096154 0.552517 0.362615 0.025717 0.032395 1.000000 0.011686 0.259478 -0.686745 -0.305858
pH -0.252700 0.261454 -0.329808 -0.267320 0.044708 -0.145854 -0.238413 0.011686 1.000000 0.192123 0.121248 0.019506
sulphates 0.299568 0.225984 0.056197 -0.185927 0.395593 -0.188457 -0.275727 0.259478 0.192123 1.000000 -0.003029 0.038485
alcohol -0.095452 -0.037640 -0.010493 -0.359415 -0.256916 -0.179838 -0.265740 -0.686745 0.121248 -0.003029 1.000000 0.444319
quality -0.076743 -0.265699 0.085532 -0.036980 -0.200666 0.055463 -0.041385 -0.305858 0.019506 0.038485 0.444319 1.000000
fig, (ax) = plt.subplots(1, 1, figsize=(10,6))

hm = sns.heatmap(corr, 
                 ax=ax,           # Axes in which to draw the plot, otherwise use the currently-active Axes.
                 cmap="coolwarm", # Color Map.
                 #square=True,    # If True, set the Axes aspect to β€œequal” so each cell will be square-shaped.
                 annot=True, 
                 fmt='.2f',       # String formatting code to use when adding annotations.
                 #annot_kws={"size": 14},
                 linewidths=.05)

fig.subplots_adjust(top=0.93)
fig.suptitle('Wine Attributes Correlation Heatmap', 
              fontsize=14, 
              fontweight='bold')

output_69_0

  • The gradients in the heatmap vary based on the strength of the correlation.
  • You can clearly see that it is very easy to spot potential attributes having strong correlations amongst themselves.

2D: Pair-Wise Scatter Plots

Another way to visualize the same is to use pair-wise scatter plots amongst attributes of interest.

Note: The diagonal Axes are treated differently β€” by drawing a plot to show the univariate distribution of the data for the variable in that column.

# Attributes of interest
cols = ['density', 
        'residual sugar', 
        'total sulfur dioxide', 
        'free sulfur dioxide', 
        'fixed acidity']

pp = sns.pairplot(wines[cols], 
                  size=1.8, aspect=1.2,
                  plot_kws=dict(edgecolor="k", linewidth=0.5),
                  diag_kws=dict(shade=True), # "diag" adjusts/tunes the diagonal plots
                  diag_kind="kde") # use "kde" for diagonal plots

fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots', 
              fontsize=14, fontweight='bold')

output_73_0

Bonus: You can also fit linear regression models to the scatter plots. See [πŸ˜€].

pp = sns.pairplot(wines[cols], 
                  diag_kws=dict(shade=True), # "diag" adjusts/tunes the diagonal plots
                  diag_kind="kde") # use "kde" for diagonal plots
                  kind="reg") # <== πŸ˜€ linear regression to the scatter plots

fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14, fontweight='bold')

output_75_0

Based on the above plot, you can see that scatter plots are also a decent way of observing potential relationships or patterns in two-dimensions for data attributes.


2D: Parallel Coordinates

Another way of visualizing multivariate data for multiple attributes together (or concurrently) is to use parallel coordinates.

More about parallel coordinates: https://github.com/matloff/parcoordtutorial

Important: Before we proceed to run parallel_coordinates(), we'll need to scale our data first, as different attributes are measured on different scales.

We'll be using StandardScaler in sklearn.preprocessing to do the job.

Note: I have another blog post on Feature Scaling (should you be interested to know more).

# Attributes of interest
cols = ['density', 
        'residual sugar', 
        'total sulfur dioxide', 
        'free sulfur dioxide', 
        'fixed acidity']

subset_df = wines[cols]

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
scaled_df = ss.fit_transform(subset_df)
scaled_df = pd.DataFrame(scaled_df, columns=cols)
final_df = pd.concat([scaled_df, wines['wine_type']], axis=1)
final_df.head()
density residual sugar total sulfur dioxide free sulfur dioxide fixed acidity wine_type
0 -0.165631 1.546371 0.181456 -0.367664 -0.166089 white
1 0.301278 -0.681719 0.305311 0.083090 0.373895 red
2 -0.859324 0.411306 0.305311 0.421155 -0.320370 white
3 0.408001 1.210056 1.189993 1.717074 -0.706073 white
4 1.395180 1.777588 2.003900 1.829762 0.142473 white
from pandas.plotting import parallel_coordinates

fig = plt.figure(figsize=(12, 10))
title = fig.suptitle("Parallel Coordinates", fontsize=18)
fig.subplots_adjust(top=0.93, wspace=0)

pc = parallel_coordinates(final_df, 
                          'wine_type', 
                          color=('skyblue', 'firebrick'))

output_80_0

Basically, in this visualization as depicted above, points are represented as connected line segments.

  • Each vertical line represents one data attribute (e.g. residual sugar).
  • One complete set of connected line segments across all the attributes represents one data point.
  • Hence points that tend to cluster will appear closer together.

Just by looking at it, we can clearly see that density is slightly more for red wines as compared to white β€” since there are more red lines clustered above the white ones.

Also residual sugar and total sulfur dioxide are higher for white wines as compared to red, while fixed acidity is higher for red wines as compared to white.

Note: If you don't perform scaling beforehand, this is what you'll get:

# If you don't perform scaling beforehand, this is what you'll get:
fig = plt.figure(figsize=(12, 10))
title = fig.suptitle("Parallel Coordinates", fontsize=18)
fig.subplots_adjust(top=0.93, wspace=0)

new_cols = ['density', 'residual sugar', 'total sulfur dioxide', 'free sulfur dioxide', 'fixed acidity', 'wine_type']
pc = parallel_coordinates(wines[new_cols], 
                          'wine_type', 
                          color=('skyblue', 'firebrick'))

output_83_0


2D: Two Continuous Numeric Attributes [πŸ“ˆ]

[πŸ’”] The traditional way β€” using matplotlib:

plt.scatter(wines['sulphates'], 
            wines['alcohol'],
            alpha=0.4, edgecolors='w')

plt.xlabel('Sulphates')
plt.ylabel('Alcohol')
plt.title('Wine Sulphates - Alcohol Content', y=1.05)

output_86_1

[πŸ’š] The better alternative β€” using Seaborn's jointplot():

jp = sns.jointplot(data=wines,
                   x='sulphates', 
                   y='alcohol', 
                   kind='reg', # <== πŸ˜€ Add regression and kernel density fits
                   space=0, size=6, ratio=4)

output_88_0


πŸ˜€ Replace the scatterplot with a joint histogram using hexagonal bins:

jp = sns.jointplot(data=wines,
                   x='sulphates', 
                   y='alcohol', 
                   kind='hex', # <== πŸ˜€ Replace the scatterplot with a joint histogram using hexagonal bins
                   space=0, size=6, ratio=4)

output_89_0


πŸ˜€ KDE:

jp = sns.jointplot(data=wines,
                   x='sulphates', 
                   y='alcohol', 
                   kind='kde', # <== πŸ˜€ KDE
                   space=0, size=6, ratio=4)

output_90_0


2D: Two Discrete Categorical Attributes [πŸ“Š]

Now that we've covered two continuous numeric attributes, how about visualizing two discrete, categorical attributes?

One way is to leverage separate subplots or facets for one of the categorical dimensions.

[πŸ’”] The traditional way β€” using matplotlib:

fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Wine Type - Quality", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Quality")
ax1.set_ylabel("Frequency") 
rw_q = red_wine['quality'].value_counts()
rw_q = (list(rw_q.index), list(rw_q.values))
ax1.set_ylim([0,2500])
ax1.tick_params(axis='both', which='major', labelsize=8.5)
bar1 = ax1.bar(rw_q[0], rw_q[1], color='red', 
               edgecolor='black', linewidth=1)

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Quality")
ax2.set_ylabel("Frequency") 
ww_q = white_wine['quality'].value_counts()
ww_q = (list(ww_q.index), list(ww_q.values))
ax2.set_ylim([0,2500])
ax2.tick_params(axis='both', which='major', labelsize=8.5)
bar2 = ax2.bar(ww_q[0], ww_q[1], color='white', 
               edgecolor='black', linewidth=1)

output_93_0

While the above is a good way to visualize categorical data, as you can see, leveraging matplotlib has resulted in writing a lot of code 😀.

[πŸ’š] The better alternative β€” using Seaborn's countplot():

In addition, another good way is to use stacked bars or multiple bars for the different attributes in a single plot. We can leverage seaborn for the same easily.

fig = plt.figure(figsize=(10, 7))

cp = sns.countplot(data=wines, 
                   x="quality", 
                   hue="wine_type", 
                   palette={"red": "#FF9999", "white": "#FFE888"})

output_95_0

This definitely looks cleaner and you can also effectively compare the different categories easily from this single plot.

mindblown-cat


2D: Mixed Attributes [πŸ“ˆ+πŸ“Š]

Let’s look at visualizing mixed attributes in 2-D (essentially numeric and categorical together).

One way is to use faceting/subplots along with generic histograms or density plots.

[πŸ’”] Again, let's first look at the traditional way β€” using matplotlib (histograms):

fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Frequency") 
ax1.set_ylim([0, 1200])
ax1.text(1.2, 800, r'$\mu$='+str(round(red_wine['sulphates'].mean(),2)), 
         fontsize=12)
r_freq, r_bins, r_patches = ax1.hist(red_wine['sulphates'], color='red', bins=15,
                                     edgecolor='black', linewidth=1)

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Frequency")
ax2.set_ylim([0, 1200])
ax2.text(0.8, 800, r'$\mu$='+str(round(white_wine['sulphates'].mean(),2)), 
         fontsize=12)
w_freq, w_bins, w_patches = ax2.hist(white_wine['sulphates'], color='white', bins=15,
                                     edgecolor='black', linewidth=1)

output_99_0

[πŸ’”] Using matplotlib (density plots):

fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Density") 
sns.kdeplot(red_wine['sulphates'], ax=ax1, shade=True, color='r')

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Density") 
sns.kdeplot(white_wine['sulphates'], ax=ax2, shade=True, color='y')

output_100_1

While this is good, once again we have a lot of boilerplate code which we can avoid by leveraging seaborn and even depict the plots in one single chart.

[πŸ’š] The better alternative β€” using Seaborn's FacetGrid():

The FacetGrid is an object that links a Pandas DataFrame to a matplotlib figure with a particular structure.

In particular, FacetGrid is used to draw plots with multiple Axes where each Axes shows the same relationship conditioned on different levels of some variable. It's possible to condition on up to three variables by assigning variables to the rows and columns of the grid and using different colors for the plot elements.

The basic workflow is to initialize the FacetGrid object with the dataset and the variables that are used to structure the grid. Then one or more plotting functions can be applied to each subset by calling FacetGrid.map() or FacetGrid.map_dataframe().

Finally, the plot can be tweaked with other methods to do things like change the axis labels, use different ticks, or add a legend. See the detailed code examples here for more information.

fig = plt.figure(figsize=(10,8))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.93, wspace=0.3)

ax = fig.add_subplot(1,1,1)
ax.set_xlabel("Sulphates")
ax.set_ylabel("Frequency") 

g = sns.FacetGrid(data=wines, 
                  hue='wine_type', 
                  palette={"red": "r", "white": "y"})

g.map(sns.distplot, 'sulphates', 
      kde=True, bins=15, ax=ax)

ax.legend(title='Wine Type')
plt.close(2)

output_102_0

You can see the plot generated above is clear and concise and we can easily compare across the distributions easily.

mindblown-cat-2


2D: Box [πŸ“¦] and Violin [🎻] Plots

[πŸ“¦] Box plots are another way of effectively depicting groups of numeric data based on the different values in the categorical attribute.

Additionally, box plots are a good way to know the quartile values in the data and also potential outliers.

f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Wine Quality - Alcohol Content', fontsize=14)

sns.boxplot(data=wines,  
            x="quality", 
            y="alcohol", 
            ax=ax)

ax.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)

output_106_1

[🎻] Another similar visualization is violin plots, which is also an effective way to visualize grouped numeric data using kernel density plots β€” depicting the probability density of the data at different values.

f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Wine Quality - Sulphates Content', fontsize=14)

sns.violinplot(data=wines,
               x="quality", 
               y="sulphates",   
               ax=ax)

ax.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax.set_ylabel("Wine Sulphates",size=12,alpha=0.8)

output_108_1

You can clearly see the density plots above for the different wine quality categories for wine sulphate.


🍷 That's it for Part #2 🎻

benedict-cumberbatch-violin


~ The Complete Seaborn Series ~

Part #1 (1D)

Part #2 (πŸ“)

Part #3 (3D)


If you enjoyed this post and want to buy me a cup of coffee...

The thing is, I'll always accept a cup of coffee. So feel free to buy me one.

Cheers! β˜•οΈ