Data Visualization with Seaborn (Part #1)

⭐️ Part #1 of a 3-Part Series


This notebook is a reorganization of the many ideas shared in this Github repo and this blog post. What you see here is a modified version that works for me that I hope will work for you as well. Also, enjoy the cat GIFs.


seaborn visualizations

In the age of big data, visualization tools are vital. With a single glance at a graphic display, a human being can recognize patterns that a computer might fail to find even after hours of analysis.

In this post, I'll show how you can use a popular Python visualization library β€” Seaborn β€” to plot attractive data visualizations for pattern discovery.

Using seaborn (and some help from matplotlib), we'll explore some effective strategies of visualizing data in multiple dimensions (ranging from 1-D up to 6-D).

To install the latest release of seaborn, you can use either pip or conda:

pip install seaborn

πŸ‘Ύ 0. Import Dependencies

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline

🍷 1. Load Wine Datasets

BTW: We need a "white wine emoji" badly β€” support this cause at change.org.

For our dataset, we'll be using the Wine Quality Data Set available from the UCI Machine Learning Repository. This data actually consists of two datasets depicting various attributes of red and white variants of the Portuguese "Vinho Verde" wine.

red_wine   = pd.read_csv('winequality-red.csv',   sep=';')
white_wine = pd.read_csv('winequality-white.csv', sep=';')
red_wine.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
white_wine.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6

πŸ›  2. Feature Engineering

Create new attributes with hand-engineered features.

furious-typing-cat


πŸ› (#1): Add New Attribute wine_type

red_wine['wine_type'] = 'red'   
white_wine['wine_type'] = 'white'
red_wine.head(1)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality wine_type
0 7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 red
white_wine.head(1)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality wine_type
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.001 3.0 0.45 8.8 6 white

πŸ› (#2): Add New Attributes quality_label

First, let's take a peek at all the unique values of the attribute quality.

print('red_wine\'s list of "quality":\t', sorted(red_wine['quality'].unique()))
print('white_wine\'s list of "quality":\t', sorted(red_wine['quality'].unique()))
red_wine's list of "quality":	 [3, 4, 5, 6, 7, 8]
white_wine's list of "quality":	 [3, 4, 5, 6, 7, 8]

Bucket quality (numerical) scores into a new (categorical) attribute called quality_label:

  • low: value ≀ 5
  • medium: 5 < value ≀ 7
  • high: value > 7

In addition, we'll convert quality_label into a Categorical data type by using pd.Categorical().

red_wine['quality_label'] = red_wine['quality'].apply(lambda value: ('low' if value <= 5 else 'medium') if value <= 7 else 'high')
red_wine['quality_label'] = pd.Categorical(red_wine['quality_label'], categories=['low', 'medium', 'high'])

white_wine['quality_label'] = white_wine['quality'].apply(lambda value: ('low' if value <= 5 else 'medium') if value <= 7 else 'high')
white_wine['quality_label'] = pd.Categorical(white_wine['quality_label'], categories=['low', 'medium', 'high'])
red_wine.head(1)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality wine_type quality_label
0 7.4 0.7 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 red low
white_wine.head(1)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality wine_type quality_label
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.001 3.0 0.45 8.8 6 white medium
# Preview `value_counts()` of the `quality_label` attribute:
print(red_wine['quality_label'].value_counts())
print()
print(white_wine['quality_label'].value_counts())
medium    837
low       744
high       18
Name: quality_label, dtype: int64

medium    3078
low       1640
high       180
Name: quality_label, dtype: int64

Merge red and white wine datasets with pd.concat():

wines = pd.concat([red_wine, white_wine], axis=0,)

# Re-shuffle records just to randomize data points.
# `drop=True`: this resets the index to the default integer index.
wines = wines.sample(frac=1.0, random_state=42).reset_index(drop=True)
wines.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality wine_type quality_label
0 7.0 0.17 0.74 12.8 0.045 24.0 126.0 0.99420 3.26 0.38 12.2 8 white high
1 7.7 0.64 0.21 2.2 0.077 32.0 133.0 0.99560 3.27 0.45 9.9 5 red low
2 6.8 0.39 0.34 7.4 0.020 38.0 133.0 0.99212 3.18 0.44 12.0 7 white medium
3 6.3 0.28 0.47 11.2 0.040 61.0 183.0 0.99592 3.12 0.51 9.5 6 white medium
4 7.4 0.35 0.20 13.9 0.054 63.0 229.0 0.99888 3.11 0.50 8.9 6 white medium

πŸ”¬ 3. Exploratory Data Analysis

Cats Exploring

Apply "Descriptive Statistics" using describe() on a subset of attributes:

subset_attributes = ['residual sugar',        #1
                     'total sulfur dioxide',  #2
                     'sulphates',             #3
                     'alcohol',               #4
                     'volatile acidity',      #5
                     'quality']               #6
rs = round(red_wine[subset_attributes].describe(), 2)
rs
residual sugar total sulfur dioxide sulphates alcohol volatile acidity quality
count 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00
mean 2.54 46.47 0.66 10.42 0.53 5.64
std 1.41 32.90 0.17 1.07 0.18 0.81
min 0.90 6.00 0.33 8.40 0.12 3.00
25% 1.90 22.00 0.55 9.50 0.39 5.00
50% 2.20 38.00 0.62 10.20 0.52 6.00
75% 2.60 62.00 0.73 11.10 0.64 6.00
max 15.50 289.00 2.00 14.90 1.58 8.00
ws = round(white_wine[subset_attributes].describe(), 2)
ws
residual sugar total sulfur dioxide sulphates alcohol volatile acidity quality
count 4898.00 4898.00 4898.00 4898.00 4898.00 4898.00
mean 6.39 138.36 0.49 10.51 0.28 5.88
std 5.07 42.50 0.11 1.23 0.10 0.89
min 0.60 9.00 0.22 8.00 0.08 3.00
25% 1.70 108.00 0.41 9.50 0.21 5.00
50% 5.20 134.00 0.47 10.40 0.26 6.00
75% 9.90 167.00 0.55 11.40 0.32 6.00
max 65.80 440.00 1.08 14.20 1.10 9.00

Using the keys parameter in pd.concat() to seperate red/white wine statistics:

pd.concat([rs, ws], axis=1, 
          keys=['πŸ”΄ Red Wine Statistics', 
                'βšͺ️ White Wine Statistics'])
πŸ”΄ Red Wine Statistics βšͺ️ White Wine Statistics
residual sugar total sulfur dioxide sulphates alcohol volatile acidity quality residual sugar total sulfur dioxide sulphates alcohol volatile acidity quality
count 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00 4898.00 4898.00 4898.00 4898.00 4898.00 4898.00
mean 2.54 46.47 0.66 10.42 0.53 5.64 6.39 138.36 0.49 10.51 0.28 5.88
std 1.41 32.90 0.17 1.07 0.18 0.81 5.07 42.50 0.11 1.23 0.10 0.89
min 0.90 6.00 0.33 8.40 0.12 3.00 0.60 9.00 0.22 8.00 0.08 3.00
25% 1.90 22.00 0.55 9.50 0.39 5.00 1.70 108.00 0.41 9.50 0.21 5.00
50% 2.20 38.00 0.62 10.20 0.52 6.00 5.20 134.00 0.47 10.40 0.26 6.00
75% 2.60 62.00 0.73 11.10 0.64 6.00 9.90 167.00 0.55 11.40 0.32 6.00
max 15.50 289.00 2.00 14.90 1.58 8.00 65.80 440.00 1.08 14.20 1.10 9.00

Again, using the keys parameter in pd.concat() to seperate based on wine quality:

subset_attributes = ['alcohol', 'volatile acidity', 'pH', 'quality']

ls = round(wines[wines['quality_label'] == 'low'][subset_attributes].describe(), 2)
ms = round(wines[wines['quality_label'] == 'medium'][subset_attributes].describe(), 2)
hs = round(wines[wines['quality_label'] == 'high'][subset_attributes].describe(), 2)

pd.concat([ls, ms, hs], axis=1, 
          keys=['πŸ‘Ž Low Quality Wine', 
                'πŸ‘Œ Medium Quality Wine', 
                'πŸ‘ High Quality Wine'])
πŸ‘Ž Low Quality Wine πŸ‘Œ Medium Quality Wine πŸ‘ High Quality Wine
alcohol volatile acidity pH quality alcohol volatile acidity pH quality alcohol volatile acidity pH quality
count 2384.00 2384.00 2384.00 2384.00 3915.00 3915.00 3915.00 3915.00 198.00 198.00 198.00 198.00
mean 9.87 0.40 3.21 4.88 10.81 0.31 3.22 6.28 11.69 0.29 3.23 8.03
std 0.84 0.19 0.16 0.36 1.20 0.14 0.16 0.45 1.27 0.12 0.16 0.16
min 8.00 0.10 2.74 3.00 8.40 0.08 2.72 6.00 8.50 0.12 2.88 8.00
25% 9.30 0.26 3.11 5.00 9.80 0.21 3.11 6.00 11.00 0.21 3.13 8.00
50% 9.60 0.34 3.20 5.00 10.80 0.27 3.21 6.00 12.00 0.28 3.23 8.00
75% 10.40 0.50 3.31 5.00 11.70 0.36 3.33 7.00 12.60 0.35 3.33 8.00
max 14.90 1.58 3.90 5.00 14.20 1.04 4.01 7.00 14.00 0.85 3.72 9.00

It’s quite easy to contrast and compare these statistical measures for the different types of wine samples.

Do you notice the stark difference in some of the attributes?

We'll emphasize those in some of our visualizations later on.


πŸŽ‰

Let's Get This Party Started

DJ Cat


1D: Univariate Analysis

Univariate analysis is basically the simplest form of data analysis or visualization where we are only concerned with analyzing one data attribute (or variable) and visualizing the same (one dimension).

One of the quickest and most effective ways to visualize all numeric data and their distributions, is to leverage histograms using pandas.

The plots below give a good idea about the basic data distribution of each attribute (e.g. alcohol, chlorides, citric acid, etc.).

Note: There are 2 attributes β€” wine_type and quality_label β€” that are not plotted.

Do you know why? πŸ€”
wines.dtypes
fixed acidity            float64
volatile acidity         float64
citric acid              float64
residual sugar           float64
chlorides                float64
free sulfur dioxide      float64
total sulfur dioxide     float64
density                  float64
pH                       float64
sulphates                float64
alcohol                  float64
quality                    int64
wine_type                 object  <== πŸ€”
quality_label           category  <== πŸ€”
dtype: object

Note: Click here for a list of all matplotlib color codes.

fig = wines.hist(bins=15,
                 color='steelblue',
                 edgecolor='black', linewidth=1.0,
                 xlabelsize=10, ylabelsize=10,
                 xrot=45, yrot=0,
                 figsize=(10,9),
                 grid=False)

plt.tight_layout(rect=(0, 0, 1.5, 1.5))   

output_48_0


1D: Continuous Numeric Attribute

Let’s drill down to visualizing one of the continuous, numeric attributes β€” sulphates.

Essentially a histogram or a density plot works quite well in understanding how the data is distributed for that attribute.

Note: To know more about fig.add_subplot(), check out this YouTube video by Harrison (aka sentdex) entitled Matplotlib Tutorial 19 - subplots.
sentdex_screenshot

Start from 1min 53sec.
Histogram
# Prepare the figure
fig = plt.figure( figsize=(6,4) )
title = fig.suptitle("Sulphates Content in Wine", fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.9, wspace=0.3)

# Prepare a subplot
ax = fig.add_subplot(1,1,1) # More info: https://youtu.be/afITiFR6vfw?t=1m53s
ax.set_xlabel("Sulphates")
ax.set_ylabel("Frequency")

# Add text into the subplot
ax.text(x=1.2, 
        y=800, 
        s=r'$\mu$='+str(round(wines['sulphates'].mean(), 2)), 
        fontsize=12)

freq, bins, patches = ax.hist(wines['sulphates'], 
                              bins=50,
                              color='darksalmon', 
                              edgecolor='darkred', linewidth=1.0)

output_52_0


Density Plot
# Prepare the figure
fig = plt.figure( figsize=(6,4) )
title = fig.suptitle("Sulphates Content in Wine", fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.9, wspace=0.3)

# Prepare a subplot
ax1 = fig.add_subplot(111)
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Density") 

# Annotate: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.annotate.html
ax1.annotate('LOOK HERE!', 
             xy=(0.5, 3), 
             xytext=(1.0, 2.0),
             arrowprops=dict(facecolor='mediumaquamarine')) 

# Seaborn time!
sns.kdeplot(wines['sulphates'], 
            ax=ax1, 
            shade=True, 
            color='forestgreen')

output_54_1


Side-by-side: Histogram + Density Plot
fig = plt.figure( figsize=(12,4) )
title = fig.suptitle("Sulphates Content in Wine", fontsize=16, fontweight='bold')
fig.subplots_adjust(top=0.88, wspace=0.3)

#===========#
# Histogram #
#===========#
ax1 = fig.add_subplot(1,2,1)
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Frequency")

ax1.text(x=1.2, y=800, 
         s=r'$\mu$='+str(round(wines['sulphates'].mean(),2)), 
         fontsize=12)

freq, bins, patches = ax1.hist(wines['sulphates'], 
                               bins=40,
                               color='darksalmon',
                               edgecolor='darkred', linewidth=1)

#==============#
# Density Plot #
#==============#
ax2 = fig.add_subplot(1,2,2) 
#ax2 = ax1.twinx() # https://youtu.be/OebyvmZo3w0?t=1m42s
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Density") 
sns.kdeplot(wines['sulphates'], ax=ax2, shade=True, color='forestgreen')

#=============#
# Save Figure #
#=============#
# fig.savefig('suplhates_content_in_wine_side-by-side.jpg')

output_55_0

Bonus: Uncomment the line ax2 = ax1.twinx() to merge the histogram with the density plot. For a more detailed explanation, please refer to this YouTube video by Harrison (aka sentdex).
sentdex_screenshot

Start from 1min 50sec.

1D: Discrete Categorical Attribute

wines.dtypes
fixed acidity            float64
volatile acidity         float64
citric acid              float64
residual sugar           float64
chlorides                float64
free sulfur dioxide      float64
total sulfur dioxide     float64
density                  float64
pH                       float64
sulphates                float64
alcohol                  float64
quality                    int64  <== πŸ€”
wine_type                 object  
quality_label           category
dtype: object
w_q = wines['quality'].value_counts()
w_q = (list(w_q.index), list(w_q.values))

print( w_q[0] )
print( w_q[1] )
[6, 5, 7, 4, 8, 3, 9]
[2836, 2138, 1079, 216, 193, 30, 5]
fig = plt.figure(figsize=(6, 4))
title = fig.suptitle("Wine Quality Frequency", fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.9, wspace=0.3)

ax = fig.add_subplot(1,1,1)
ax.set_xlabel("Quality")
ax.set_ylabel("Frequency") 
ax.tick_params(axis='both', which='major', labelsize=8.5)

bar = ax.bar(w_q[0],   # i.e. [6, 5, 7, 4, 8, 3, 9]
             w_q[1], # i.e. [2836, 2138, 1079, 216, 193, 30, 5]
             width=0.85,
             color='plum', 
             edgecolor='black', linewidth=1)

output_62_0


😎 That's it for Part #1 😎


~ The Complete Seaborn Series ~

Part #1 (πŸ“)

Part #2 (2D)

Part #3 (3D)


If you enjoyed this post and want to buy me a cup of coffee...

The thing is, I'll always accept a cup of coffee. So feel free to buy me one.

Cheers! β˜•οΈ