I was asked by the School of Computing @ NUS (which is also my alma mater) to take an interview with TODAY on computer science graduates and industry developments.
Here's the screenshot of my interview summary:
Here's the link to the TODAY article:
The Big Read: Nerds and geeks no more, computing graduates now rule the roost.
How did Geoffrey Hinton come up with the idea of Dropout?
Hinton says he was inspired by, among other things, a fraud-prevention mechanism used by banks.
In his own words:
I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting.
The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that aren't significant (what Hinton refers to as conspiracies), which the network will start memorizing if no noise is present.
]]>A loss function (or objective function, or optimization score function) is one of the three parameters (the first one, actually) required to compile a model:
model.compile(loss='categorical_crossentropy', # <== LOOK HERE!
optimizer='adam',
metrics=['accuracy'])
We often see categorical_crossentropy
used in multiclass classification tasks.
At the same time, there's also the existence of sparse_categorical_crossentropy
, which begs the question: what's the difference between these two loss functions?
categorical_crossentropy
.
[1,0,0]
[0,1,0]
[0,0,1]
sparse_categorical_crossentropy
.
1
2
3
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
A one-hot encoding is a representation of categorical variables (e.g. cat
, dog
, rat
) as binary vectors (e.g. [1,0,0]
, [0,1,0]
, [0,0,1]
).
cat
is mapped to 1
,dog
is mapped to 2
, andrat
is mapped to 3
.1
is mapped to [1,0,0]
,2
is mapped to [0,1,0]
, and3
is mapped to [0,0,1]
.The wonderful Keras library offers a function called to_categorical()
that allows you to one-hot encode your integer data. Here's how:
import numpy as np
from keras.utils import to_categorical
data = np.array([1, 5, 3, 8])
print(data)
[1 5 3 8]
def encode(data):
print('Shape of data (BEFORE encode): %s' % str(data.shape))
encoded = to_categorical(data)
print('Shape of data (AFTER encode): %s\n' % str(encoded.shape))
return encoded
encoded_data = encode(data)
print(encoded_data)
Shape of data (BEFORE encode): (4,) Shape of data (AFTER encode): (4, 9) [[0. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 1.]]
def decode(datum):
return np.argmax(datum)
for i in range(encoded_data.shape[0]):
datum = encoded_data[i]
print('index: %d' % i)
print('encoded datum: %s' % datum)
decoded_datum = decode(encoded_data[i])
print('decoded datum: %s' % decoded_datum)
print()
index: 0 encoded datum: [0. 1. 0. 0. 0. 0. 0. 0. 0.] decoded datum: 1 index: 1 encoded datum: [0. 0. 0. 0. 0. 1. 0. 0. 0.] decoded datum: 5 index: 2 encoded datum: [0. 0. 0. 1. 0. 0. 0. 0. 0.] decoded datum: 3 index: 3 encoded datum: [0. 0. 0. 0. 0. 0. 0. 0. 1.] decoded datum: 8
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
Google Colab (https://colab.research.google.com/) is Google's collaborative version of the Jupyter/iPython notebook-based editing environment. They released the tool to the general public with a noble goal of dissemination of machine learning education and research.
You should be excited coz even Chris Olah is excited:
Wow, my favorite internal Google tool is now public! https://t.co/eq7Pu9VtHf (think iPython + Google Drive)
— Chris Olah (@ch402) October 27, 2017
So much of my life is in colab.
Its newest feature is the ability to use a GPU as a backend for free for 12 hours at a time. The details are as follows:
What it means is that we can use the GPU even after the end of 12 hours by connecting to a different VM.
File → Upload Notebook
or simply enter your codes in the cells.Edit → Notebook Settings
and set Hardware accelerator
to GPU
.Enter these lines of codes into the cells:
!pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl
!pip3 install torchvision
The output should look something like this:
And that's it!
Take a look at my Colab Notebook that uses PyTorch to train a feedforward neural network on the MNIST dataset with an accuracy of 98%.
Link to my Colab notebook: https://goo.gl/4U46tA
The focus here isn't on the DL/ML part, but the:
Here's a preview of the aforementioned notebook:
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
Most of the state-of-the-art NLP applications — e.g. machine translation and summarization — are now based on recurrent neural networks (RNNs). And more often than not, we'll need to choose a word representation before hand.
Here are two ways of creating word representations:
One-hot Encoding: A simple method is to represent each word using a one-hot vector. Suppose your vocabulary contains 50K words, then the nth
word would be represented as a 50K-dimensional vector, full of 0s except for a 1 at the nth
position. However, with such a large vocabulary of 50K words, this sparse representation is very inefficient.
Word Embeddings (⭐️): Ideally, you'd want similar words to have similar representations, making it easy for the model to generalize what it learns about a word to all similar words. For example, the representation for "car" should be more similar to "lorry" than, say, "pasta". This is the idea behind word embeddings.
In a Nutshell:
Word embeddings provide a dense representation of words and their relative meanings.
They are an improvement over sparse representations used in simpler bag of word model representations.
Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.
Let's explore two different ways to add an embedding layer in Keras:
import re
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
# Define documents
docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!',
'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.']
# Define class labels
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
Note that we're using a Keras Sequential Model here to do the job.
One-hot encode the documents in docs
:
own_embedding_vocab_size = 10
encoded_docs_oe = [one_hot(d, own_embedding_vocab_size) for d in docs]
print(encoded_docs_oe)
Output:
[[2, 6], [5, 9], [2, 9], [4, 9], [2], [7], [2, 9], [3, 5], [2, 9], [5, 9, 6, 3]]
Pad each document to ensure they are of the same length:
maxlen = 5
padded_docs_oe = pad_sequences(encoded_docs_oe, maxlen=maxlen, padding='post')
print(padded_docs_oe)
Output:
[[2 6 0 0 0]
[5 9 0 0 0]
[2 9 0 0 0]
[4 9 0 0 0]
[2 0 0 0 0]
[7 0 0 0 0]
[2 9 0 0 0]
[3 5 0 0 0]
[2 9 0 0 0]
[5 9 6 3 0]]
Define the model:
model = Sequential()
model.add(Embedding(input_dim=own_embedding_vocab_size, # 10
output_dim=32,
input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Compile and train the model:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) # Compile the model
print(model.summary()) # Summarize the model
model.fit(padded_docs_oe, labels, epochs=50, verbose=0) # Fit the model
loss, accuracy = model.evaluate(padded_docs_oe, labels, verbose=0) # Evaluate the model
print('Accuracy: %0.3f' % accuracy)
> _________________________________________________________________
> Layer (type) Output Shape Param #
> =================================================================
> embedding_1 (Embedding) (None, 5, 32) 320
> _________________________________________________________________
> flatten_1 (Flatten) (None, 160) 0
> _________________________________________________________________
> dense_1 (Dense) (None, 1) 161
> =================================================================
> Total params: 481
> Trainable params: 481
> Non-trainable params: 0
> _________________________________________________________________
> None
> Accuracy: 0.800
Note that we're using a Keras Functional Model here to do the job.
(⭐️) Download and use the load_glove_embeddings()
function:
from load_glove_embeddings import load_glove_embeddings
word2index, embedding_matrix = load_glove_embeddings('data_embeddings/en/glove.6B.50d.txt', embedding_dim=50)
One-hot encode the documents in docs
with our special custom_tokenize()
function, which requires the word2index
variable from the previous step:
def custom_tokenize(docs):
output_matrix = []
for d in docs:
indices = []
for w in d.split():
indices.append(word2index[re.sub(r'[^\w\s]','',w).lower()])
output_matrix.append(indices)
return output_matrix
# Encode docs with our special "custom_tokenize" function
encoded_docs_ge = custom_tokenize(docs)
print(encoded_docs_ge)
Output:
[[143, 751], [219, 161], [353, 968], [3082, 161], [4345], [2690], [992, 968], [36, 219], [992, 161], [94, 33, 751, 439]]
Pad each document to ensure they are of the same length:
# Pad documents to a max length of 5 words
maxlen = 5
padded_docs_ge = pad_sequences(encoded_docs_ge, maxlen=maxlen, padding='post')
print(padded_docs_ge)
Output:
[[ 143 751 0 0 0]
[ 219 161 0 0 0]
[ 353 968 0 0 0]
[3082 161 0 0 0]
[4345 0 0 0 0]
[2690 0 0 0 0]
[ 992 968 0 0 0]
[ 36 219 0 0 0]
[ 992 161 0 0 0]
[ 94 33 751 439 0]]
Define the model (note that the embedding_matrix
variable is required here):
from keras.models import Model
from keras.layers import Input
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
input_length=maxlen,
weights=[embedding_matrix],
trainable=False,
name='embedding_layer')
i = Input(shape=(maxlen,), dtype='int32', name='main_input')
x = embedding_layer(i)
x = Flatten()(x)
o = Dense(1, activation='sigmoid')(x)
model = Model(inputs=i, outputs=o)
Compile and train the model:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) # Compile the model
print(model.summary()) # Summarize the model
model.fit(padded_docs_ge, labels, epochs=50, verbose=0) # Fit the model
loss, accuracy = model.evaluate(padded_docs_ge, labels, verbose=0) # Evaluate the model
print('Accuracy: %0.3f' % accuracy)
> _________________________________________________________________
> Layer (type) Output Shape Param #
> =================================================================
> main_input (InputLayer) (None, 5) 0
> _________________________________________________________________
> embedding_layer (Embedding) (None, 5, 50) 20000050
> _________________________________________________________________
> flatten_2 (Flatten) (None, 250) 0
> _________________________________________________________________
> dense_2 (Dense) (None, 1) 251
> =================================================================
> Total params: 20,000,301
> Trainable params: 251
> Non-trainable params: 20,000,050
> _________________________________________________________________
> None
> Accuracy: 1.000
Here's the main difference:
embedding_layer_1 = Embedding(input_dim=own_embedding_vocab_size,
output_dim=32,
input_length=maxlen)
embedding_matrix
):embedding_layer_2 = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
input_length=maxlen,
weights=[embedding_matrix],
trainable=False)
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.
Suppose you ask a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called the wisdom of the crowd.
Likewise, if you aggregate the predictions of a group of predictors (e.g. decision tree classifer, SVM, logistic regression), you will often get better predictions than with the best individual predictor.
We'll cover three popular ensemble methods:
To create a VotingClassifier
, simply aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier.
Here's how it's done in Scikit-Learn:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
# Gather a set of predictors
estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)]
# Initialize the Voting Classifier
voting_clf = VotingClassifier(estimators=estimators,
voting='hard')
# Fit the data
voting_clf.fit(X_train, y_train)
If all classifiers are able to estimate class probabilities (i.e. they have a predict_proba()
method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.
This is called soft voting, and it often achieves higher performance than hard voting because it gives more weight to highly confident votes.
To perform soft voting, all you need to do is replace voting='hard'
with voting='soft'
and ensure that all classifiers can estimate class probabilities.
Note: The
SVC
class can't estimate class probabilities by default, so you'll need to set its probability hyperparameter to True, as this will make theSVC
class use cross-validation to estimate class probabilities (which slows training down), and it will add apredict_proba()
method.
Another approach is to use the same training algorithm (e.g. logistic regression) for every predictor, but to train them on different random subsets of the training set.
For example (where lr_predictor
refers to an instance of logistic regression):
lr_predictor_1
trained on subset_1_of_training_set
lr_predictor_2
trained on subset_2_of_training_set
lr_predictor_3
trained on subset_3_of_training_set
This is called bagging, which is short for bootstrap aggregating.
For bagging, sampling is performed with replacement.
When sampling is performed without replacement, it is called pasting.
In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.
The advantages of using bagging are that:
Armed with the above knowledge, let's look into Random Forests, which is a specific instance of bagging.
A Random Forest is an ensemble of Decision Trees that are generally trained via the bagging method, typically with max_samples
set to the size of the training set.
In other words: Random Forests = Bagging of Decision Trees
However, instead of building a BaggingClassifier
and passing it to a DecisionTreeClassifier
, you can use the RandomForestClassifier
, which is more convenient and optimized for Decision Trees.
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500,
max_leaf_nodes=16,
n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
The following BaggingClassifier
is roughly equivalent to the previous RandomForestClassifier
:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter='random',
max_leaf_nodes=16),
n_estimators=500,
max_samples=1.0,
bootstrap=True,
n_jobs=-1)
One of the great qualities of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it.
You can access the result using the feature_importances_
variable. Here's an example:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
pd.DataFrame({'feature': iris['feature_names'],
'importance': rnd_clf.feature_importances_}).set_index('feature')
importance | |
---|---|
feature | |
sepal length (cm) | 0.096814 |
sepal width (cm) | 0.022546 |
petal length (cm) | 0.433662 |
petal width (cm) | 0.446977 |
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.
Gradient Boosting is by far the most popular boosting method. It works by:
To better illustrate this, let's go through a simple regression example using Decision Trees as the base predictors.
First, let’s fit a DecisionTreeRegressor
to the training set:
from sklearn.tree import DecisionTreeRegressor
tree_reg_1 = DecisionTreeRegressor(max_depth=5)
tree_reg_1.fit(X, y)
Now train a second DecisionTreeRegressor
on the residual errors made by the first predictor:
y2 = y - tree_reg_1.predict(X)
tree_reg_2 = DecisionTreeRegressor(max_depth=5)
tree_reg_2.fit(X, y2)
Then we train a third regressor on the residual errors made by the second predictor:
y3 = y2 - tree_reg_2.predict(X)
tree_reg_3 = DecisionTreeRegressor(max_depth=5)
tree_reg_3.fit(X, y3)
We now have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees:
trees = [tree_reg_1, tree_reg_2, tree_reg_3]
y_pred = sum(t.predict(X_new) for t in trees)
We can use Scikit-Learn's GradientBoostingRegressor
to condense the above codes:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=5,
n_estimators=3,
learning_rate=1.0)
gbrt.fit(X, y)
The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y)
# Initialize and fit GBRT
gbrt = GradientBoostingRegressor(max_depth=5, n_estimators=120)
gbrt.fit(X_train, y_train)
# Measure the validation error at each stage of training
errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
# Find the optimal number of trees
best_n_estimators = np.argmin(errors)
# Train another GBRT ensemble using the optimal number of trees
gbrt_best = GradientBoostingRegressor(max_depth=5,
n_estimators=best_n_estimators)
gbrt_best.fit(X_train, y_train)
The GradientBoostingRegressor
class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if subsample=0.25
, then each tree is trained on 25% of the training instances, selected randomly.
Advantages of using Stochastic Gradient Boosting:
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
This post explores the use of a causal analysis tool that allows us to generate a partial causal graph from purely observational data.
And because cryptocurrency is the new investment hype, let's attempt to make sense of crypto prices using causal graph inference! 🤑
People often confuse correlation with causation. The following charts should illustrate my point:
Note: You can find more entertaining correlations from Spurious Correlations.
In short:
We'll talk about causality from the perspective of the Pearlian causal framework, which uses graphs as the language of causality.
In other words, to represent X causes Y explicitly, we simply use graphs:
It's intuitive and pictorial, and lets you talk about causal pathways from one variable to another: if you can put together a chain of cause and effect going from X to Y, then X might have a causal effect on Y. In that framework, it’s easy to enumerate the consequences of actions.
But how do you find the graph? We'll use a causal analysis tool that's able to give you a partial causal graph from purely observational data. However, before we proceed to the next section, be sure to install the following:
pip install causality
pip install networkx
To retrieve crypto data, load the code from this blog post. We're only interested in the variable combined_df
, which is a pandas.DataFrame
object that contains the values of each cryptocurrency (in USD) w.r.t. time.
You should see the following table (i.e. last 5 rows) when you execute combined_df.tail()
:
DASH | ETC | ETH | LTC | SC | STR | XEM | XMR | XRP | BTC | |
---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||
2017-12-18 | 1093.617265 | 37.237390 | 743.536101 | 329.416791 | 0.014948 | 0.265881 | 0.755417 | 350.778285 | 0.731127 | 18684.557924 |
2017-12-19 | 1155.879642 | 39.012279 | 811.639229 | 346.905402 | 0.015313 | 0.272750 | 0.923820 | 363.436691 | 0.762403 | 18015.201393 |
2017-12-20 | 1347.509546 | 38.341707 | 780.718146 | 315.366494 | 0.018790 | 0.235122 | 0.889440 | 403.312159 | 0.701708 | 16628.158457 |
2017-12-21 | 1402.437089 | 38.373994 | 791.643752 | 302.663058 | 0.022633 | 0.251669 | 0.945312 | 425.916859 | 0.943081 | 15938.493055 |
2017-12-22 | 1247.069623 | 32.871011 | 748.259744 | 283.881877 | 0.019761 | 0.245783 | 0.867034 | 365.363276 | 1.127947 | 15438.632917 |
As always, we'll need to import the required packages.
from causality.inference.search import IC
from causality.inference.independence_tests import RobustRegressionTest
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
# Set figure width to 15 and height to 12
plt.rcParams["figure.figsize"] = [15., 12.]
We'll proceed to perform causal analysis on our cryptocurrency prices.
The output will be a variable called graph
.
# Define the variable types: 'c' is 'continuous'.
# The variables defined here are the ones the search is performed over,
# i.e. the columns in our DataFrame that represent various cryptocurrencies.
variable_types = {x:'c' for x in combined_df.columns}
# Run the IC* algorithm (IC = Inductive Causation)
ic_algorithm = IC(RobustRegressionTest, alpha=0.1)
graph = ic_algorithm.search(data=combined_df, variable_types=variable_types)
Let's take a peek at the nodes in the graph
:
graph.nodes()
Output:
['DASH', 'ETH', 'SC', 'XEM', 'XRP', 'BTC', 'STR', 'XMR', 'LTC', 'ETC']
And then the edges in the graph
:
graph.edges(data=True)
Output:
[('DASH', 'ETH', {'arrows': ['ETH', 'ETH'], 'marked': False}),
('DASH', 'ETC', {'arrows': ['ETC'], 'marked': False}),
('DASH', 'XMR', {'arrows': [], 'marked': False}),
('ETH', 'XEM', {'arrows': ['ETH', 'XEM'], 'marked': False}),
('ETH', 'ETC', {'arrows': ['ETH', 'ETH'], 'marked': True}),
('ETH', 'XRP', {'arrows': ['ETH', 'ETH', 'XRP', 'XRP'], 'marked': False}),
('SC', 'XRP', {'arrows': ['XRP', 'XRP', 'XRP'], 'marked': False}),
('SC', 'ETC', {'arrows': ['ETC'], 'marked': False}),
('XEM', 'XRP', {'arrows': ['XRP', 'XRP'], 'marked': True}),
('XEM', 'LTC', {'arrows': ['XEM', 'LTC'], 'marked': False}),
('XRP', 'STR', {'arrows': ['XRP', 'XRP', 'STR'], 'marked': False}),
('BTC', 'STR', {'arrows': ['BTC', 'STR'], 'marked': False}),
('BTC', 'XMR', {'arrows': ['BTC'], 'marked': False}),
('STR', 'LTC', {'arrows': ['LTC'], 'marked': True}),
('XMR', 'LTC', {'arrows': ['LTC'], 'marked': False})]
In order to generate a visually appealing graph, we'll perform the following:
4.1 Nodes: Generate unique colors per node. This is based on a previous blog post on generating a range of "n" colors.
from colour import Color
def get_color_range(n, output_type='hex'):
red = Color('red')
blue = Color('blue')
color_range = list(red.range_to(blue, n))
if output_type == 'hex':
return [c.get_hex_l() for c in color_range]
else:
return [c.get_rgb() for c in color_range]
n_nodes = len(graph.nodes())
n_colors = get_color_range(n_nodes)
4.2 Edges: Sanitize the edges by removing duplicated arrows.
# Sanitize graph edgges: remove duplicated arrows
sanitized_edges = []
for t in graph.edges(data=True):
attr = t[2]
attr['arrows'] = list(set(attr['arrows']))
sanitized_edges.append((t[0], t[1], attr))
4.3 Edge Labels: Sanitize edge labels by removing labels marked as False, while converting labels marked as True into CAUSAL labels.
edge_labels = [((u,v,),'CAUSAL') if d['marked'] else ((u,v,),'') for u,v,d in graph.edges(data=True)]
edge_labels = dict(edge_labels)
pos = nx.spring_layout(G, k=2000, iterations=1000)
# Add nodes
nx.draw_networkx_nodes(G, pos, node_color=n_colors, node_size=2000)
# Add labels to nodes
nx.draw_networkx_labels(G, pos, font_size=12)
# Add edges
nx.draw_networkx_edges(G, pos, edgelist=sanitized_edges,
arrows=True,
width=2.0,
edge_color='#1C2833',
style='dotted',
alpha=0.8)
# Add labels to edges
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels,
font_size=12,
font_weight='bold')
# plt.draw()
plt.show()
Analysis:
Thus, based on historical crypto prices, the algorithm believes that the following are genuinely causal (you should know what I mean):
I hope that by now, you should realize the power and importance of causal graph inference — and its potential to make you rich 😛.
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
I don't usually share articles on this blog, but when I do, they are must reads.
http://www.fast.ai/2017/12/18/personal-brand/
In particular (on blogging):
- It’s like a resume, only better. I know of a few people who have had blog posts lead to job offers!
- Helps you learn. Organizing knowledge always helps me synthesize my own ideas. One of the tests of whether you understand something is whether you can explain it to someone else. A blog post is a great way to do that.
- I’ve gotten invitations to conferences and invitations to speak from my blog posts. I was invited to the TensorFlow Dev Summit (which was awesome!) for writing a blog post about how I don’t like TensorFlow.
- Meet new people. I’ve met several people who have responded to blog posts I wrote.
- Saves time. Any time you answer a question multiple times through email, you should turn it into a blog post, which makes it easier for you to share the next time someone asks.
- It can be intimidating to start blogging, but remember that your target audience is you-6-months-ago, not Geoffrey Hinton. What would have been most helpful to your slightly younger self? You are best positioned to help people one step behind you. The material is still fresh in your mind. Many experts have forgotten what it was like to be a beginner (or an intermediate). The context of your particular background, your particular style, and your knowledge level will give a different twist to what you’re writing about.
Thank you, Deb-san, for sharing this awesome article.
]]>When training machine learning algorithms, one of the techniques that will speed up your trainings is if you scale your features.
In this post, we'll explore 3 feature-scaling methods that can be implemented in scikit-learn:
StandardScaler
MinMaxScaler
RobustScaler
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
matplotlib.style.use('ggplot')
The StandardScaler
assumes that your data is normally distributed within each feature, and it'll scale them such that the distribution is now:
0
1
.The mean
and standard deviation
are separately calculated for the feature, and the feature is then scaled based on:
$$\frac{x_i - \text{mean($\boldsymbol{x}$)}}{\text{stdev($\boldsymbol{x}$)}}$$
Let's start coding.
# Create data samples x1, x2, x3
np.random.seed(1)
df = pd.DataFrame({
'x1': np.random.normal(0, 2, 10000),
'x2': np.random.normal(5, 3, 10000),
'x3': np.random.normal(-5, 5, 10000)
})
# Use StandardScaler
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=['x1', 'x2', 'x3'])
# Plot and visualize
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(6, 5))
ax1.set_title('Before Scaling')
sns.kdeplot(df['x1'], ax=ax1)
sns.kdeplot(df['x2'], ax=ax1)
sns.kdeplot(df['x3'], ax=ax1)
ax2.set_title('After Standard Scaler')
sns.kdeplot(scaled_df['x1'], ax=ax2)
sns.kdeplot(scaled_df['x2'], ax=ax2)
sns.kdeplot(scaled_df['x3'], ax=ax2)
plt.show()
As you can see, all features are now on the same scale (relative to one another).
Tip: If you use this to scale your training data, make sure to use the same mean and standard deviation to normalize your test set.
The MinMaxScaler
is probably the most famous scaling algorithm and it follows the following formula for each feature:
$$\frac{x_i - \text{min}(\boldsymbol{x})}{\text{max}(\boldsymbol{x}) - \text{min}(\boldsymbol{x})}$$
Basically, it shrinks the range such that it is now between 0
and 1
(or -1
to 1
if there exist negative values).
The MinMaxScaler
works well for cases when the distribution is not Gaussian or when the standard deviation is very small. However, it is sensitive to outliers — though this be rectified by RobustScaler
, which we will see soon.
# Create data samples x1, x2, x3
df = pd.DataFrame({
# positive skew
'x1': np.random.chisquare(8, 1000),
# negative skew
'x2': np.random.beta(8, 2, 1000) * 40,
# no skew
'x3': np.random.normal(50, 3, 1000)
})
# Use MinMaxScaler
scaler = preprocessing.MinMaxScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=['x1', 'x2', 'x3'])
# Plot and visualize
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(6, 5))
ax1.set_title('Before Scaling')
sns.kdeplot(df['x1'], ax=ax1)
sns.kdeplot(df['x2'], ax=ax1)
sns.kdeplot(df['x3'], ax=ax1)
ax2.set_title('After Min-Max Scaling')
sns.kdeplot(scaled_df['x1'], ax=ax2)
sns.kdeplot(scaled_df['x2'], ax=ax2)
sns.kdeplot(scaled_df['x3'], ax=ax2)
plt.show()
While the skewness of the distributions are maintained, the 3 distributions are brought into the same scale such that they overlap.
The RobustScaler
uses a similar method to the Min-Max scaler. However, it uses the interquartile range instead of the min-max, which makes it robust to outliers. It follows the following formula for each feature:
$$\frac{x_i - Q_1(\boldsymbol{x})}{Q_3(\boldsymbol{x}) - Q_1(\boldsymbol{x})}$$
As usual, let's look at a few visualizations to get a better understanding.
# Create data samples x1, x2
x = pd.DataFrame({
# Distribution with lower outliers
'x1': np.concatenate([np.random.normal(20, 1, 1000), np.random.normal(1, 1, 25)]),
# Distribution with higher outliers
'x2': np.concatenate([np.random.normal(30, 1, 1000), np.random.normal(50, 1, 25)]),
})
# Use RobustScaler
scaler = preprocessing.RobustScaler()
robust_scaled_df = scaler.fit_transform(x)
robust_scaled_df = pd.DataFrame(robust_scaled_df, columns=['x1', 'x2'])
# Use MinMaxScaler
scaler = preprocessing.MinMaxScaler()
minmax_scaled_df = scaler.fit_transform(x)
minmax_scaled_df = pd.DataFrame(minmax_scaled_df, columns=['x1', 'x2'])
# Plot and visualize
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(9, 5))
ax1.set_title('Before Scaling')
sns.kdeplot(x['x1'], ax=ax1)
sns.kdeplot(x['x2'], ax=ax1)
ax2.set_title('After Robust Scaling')
sns.kdeplot(robust_scaled_df['x1'], ax=ax2)
sns.kdeplot(robust_scaled_df['x2'], ax=ax2)
ax3.set_title('After Min-Max Scaling')
sns.kdeplot(minmax_scaled_df['x1'], ax=ax3)
sns.kdeplot(minmax_scaled_df['x2'], ax=ax3)
plt.show()
Note that after applying RobustScaler
, the distributions are brought into the same scale and actually overlap — while the outliers remain outside the bulk of the new distributions. Whereas in MinMaxScaler
, the two normal distributions are kept separate by the outliers that are inside the range of 0
and 1
.
The Normalizer
scales each value by dividing each value by its magnitude in n-dimensional space for n number of features.
For example, if your features were x
, y
, and z
, your scaled value for x
would be:
$$\frac{x_i}{\sqrt{x_i^2 + y_i^2 + z_i^2}}$$
Essentially, it normalize samples individually to unit norm. Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.
from mpl_toolkits.mplot3d import Axes3D
df = pd.DataFrame({
'x1': np.random.randint(-100, 100, 1000).astype(float),
'y1': np.random.randint(-80, 80, 1000).astype(float),
'z1': np.random.randint(-150, 150, 1000).astype(float),
})
scaler = preprocessing.Normalizer()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
fig = plt.figure(figsize=(9, 5))
ax1 = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(122, projection='3d')
ax1.scatter(df['x1'], df['y1'], df['z1'])
ax2.scatter(scaled_df['x1'], scaled_df['y1'], scaled_df['z1'])
plt.show()
You probably won't go wrong if you use StandardScaler
to scale your features.
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
Say, we want to convolve a 7 X 7
image with a 3 X 3
filter with a stride length of 2
. What's the final size of the output?
Before going to the equation, let's lay down some definitions:
n X n
f X f
p
s
The output size will be:
$$\lfloor \frac{n + 2p - f} {s} + 1 \rfloor \times \lfloor \frac{n + 2p - f} {s} + 1 \rfloor$$
Note: Should the fraction not be an integer, we'll have to round it down.
And since:
n = 7
f = 3
p = 0
s = 2
This give us 3 X 3
because:
$$\lfloor \frac{7 + 0 - 3} {2} + 1 \rfloor \times \lfloor \frac{7 + 0 - 3} {2} + 1 \rfloor = 3 \times 3$$
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model that is capable of performing linear or nonlinear classification, regression, and even outlier detection.
SVMs are particularly well-suited for classification of complex but small- or medium-sized datasets.
Here are a few ways to customize more advanced versions of them.
Although linear SVM classifiers are efficient and work well in many cases, many datasets are not even close to being linearly separable.
One approach to handling nonlinear datasets is to add more features, such as polynomial features, which may result in a linearly separable dataset.
To implement this in Scikit-Learn, we simply create a Pipeline
containing a PolynomialFeatures
transformer, followed by a StandardScaler
and a LinearSVC
.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
polynomial_svm_clf = Pipeline([
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
])
#polynomial_svm_clf.fit(X, y)
Note: If your SVM model is overfitting, you can try regularizing it by reducing
C
.
The LinearSVC
class regularizes the bias term, so you should center the training set first by subtracting its mean. This is automatic if you scale the data using the StandardScaler
.
Also make sure that you set the loss
hyperparameter to "hinge"
as it is not the default value (the default value is "squared_hinge"
).
Finally, for better performance, you should set the dual
hyperparameter to False
, especially when n_samples > n_features
.
As we have seen previously, adding polynomial features via the PolynomialFeatures
transformer is simple and straightforward to implement. However, the polynomial degree should not be too high as this will create a huge number of features, making the model too slow.
An alternative is to perform the kernel trick. It makes it possible to get the same result as if you added many polynomial features, even with very high-degree polynomials, without actually having to add them.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# With Polynomial Kernel
poly_kernel_svm_clf = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])
# With RBF Kernel
rbf_kernel_svm_clf = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])
LinearSVC vs. SVC The
LinearSVC
class is based on theliblinear
library, which implements an optimized algorithm for linear SVMs. It does not support the kernel trick, but it scales almost linearly with the number of training instances and the number of features — its training time complexity is roughlyO(m × n)
.On the other hand, the
SVC
class is based on thelibsvm
library, which implements an algorithm that supports the kernel trick. The training time complexity is usually betweenO(m^2 × n)
andO(m^3 × n)
. Unfortunately, this means that it gets dreadfully slow when the number of training instances gets large.
With so many kernels to choose from, how can you decide which one to use?
As a rule of thumb, you should always try the linear kernel first (remember that LinearSVC
is much faster than SVC(kernel="linear")
), especially if the training set is very large or if it has plenty of features.
And if the training set is not too large, you should try the Gaussian RBF kernel as well; it works well in most cases.
Finally, should you have spare time and computing power, you can also experiment with a few other kernels using cross-validation and grid search, especially if there are kernels specialized for your training set’s data structure.
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
Activation functions are an extremely important feature in neural networks as they decide whether a neuron should be activated or not.
Essentially, the role of an activation function is to produce a non-linear decision boundary via non-linear combinations of the weighted inputs.
And with so many different activation functions available, a natural follow-up question would be: which activation function should I use?
Listed in the order of recommendation:
keras.layers.ELU(alpha=1.0)
keras.layers.LeakyReLU(alpha=0.01)
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
There are two ways to build Keras models: sequential and functional.
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs.
Alternatively, the functional API allows you to create models that have a lot more flexibility as you can easily define models where layers connect to more than just the previous and next layers. In fact, you can connect layers to (literally) any other layer. As a result, creating complex networks such as siamese networks and residual networks become possible.
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(2, input_dim=1))
model.add(Dense(1))
In the example above, layers are added piecewise via the Sequential
object.
The Sequential model API is great for developing deep learning models in most situations, but it also has some limitations.
For example, it is not straightforward to define models that may have:
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
# Define the input
# Unlike the Sequential model, you must create and define
# a standalone "Input" layer that specifies the shape of input
# data. The input layer takes a "shape" argument, which is a
# tuple that indicates the dimensionality of the input data.
# When input data is one-dimensional, such as the MLP, the shape
# must explicitly leave room for the shape of the mini-batch size
# used when splitting the data when training the network. Hence,
# the shape tuple is always defined with a hanging last dimension.
# For instance, "(2,)", as in the example below:
visible = Input(shape=(2,))
# Connecting layers
# The layers in the model are connected pairwise.
# This is done by specifying where the input comes from when
# defining each new layer. A bracket notation is used, such that
# after the layer is created, the layer from which the input to
# the current layer comes from is specified.
# Note how the "visible" layer connects to the "Dense" layer:
hidden = Dense(2)(visible)
# Create the model
# After creating all of your model layers and connecting them
# together, you must then define the model.
# As with the Sequential API, the model is the thing that you can
# summarize, fit, evaluate, and use to make predictions.
# Keras provides a "Model" class that you can use to create a model
# from your created layers. It requires that you only specify the
# input and output layers. For example:
model = Model(inputs=visible, outputs=hidden)
The Keras functional API provides a more flexible way for defining models.
Specifically, it allows you to define multiple input or output models as well as models that share layers. More than that, it allows you to define ad hoc acyclic network graphs.
Models are defined by creating instances of layers and connecting them directly to each other in pairs, and then defining a Model
that specifies the layers to act as the input and output to the model, via the parameters inputs
and outputs
, respectively.
Take a look at how Sequential and Functional models are being used in the examples featured in the post "Embeddings in Keras: Train vs. Pretrained".
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️
Are you confused by the confusion matrix? This should help:
Generally:
NO = 0
.YES = 1
.Regarding each of the 4 elements in the confusion matrix:
Check out Kevin's Simple Guide to Confusion Matrix for more details.
]]>If you enjoyed this post and want to buy me a cup of coffee...
The thing is, I'll always accept a cup of coffee. So feel free to buy me one.
Cheers! ☕️