Introduction to Machine Learning with Scikit-Learn

Resources for Machine Learning

Text

Online classes

Websites

Online sklearn videos

Outline

  • Goal:
    • Introduction to sklearn API
    • Discuss basic ML concepts.
  • What are some questions that scikit learn can address?
  • Supervised Learning
  • Model Selection
    • Cross validation
    • Bias-variance trade-off
    • Regularization
  • Unsupervised learning
    • PCA
    • Clustering
  • What we are not covering.

Why sklearn?

  • Consistent interface
  • Great documentation
  • More data $\rightarrow$ ML is more valuable
  • Opportunity everywhere (almost)
In [80]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
try:
    import smash
    reload(smash)
except:
    pass
colors = plt.rcParams['axes.color_cycle']

Machine Learning Questions

-

Let's look at this from a data perspective.

Input data:

  • The $X$ matrix has $n$ samples and $m$ featues ($n , m$).
  • We refer to $X$ as the freatures

Value to estimate:

  • The $y$ vector has $n$ values ($n,$)
  • Called the response vector.

Boston Housing

In [26]:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target

print X.shape
print y.shape
(506, 13)
(506,)

In [27]:
print boston.DESCR[165:1200]
   
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    

In [35]:
plt.hist(y, alpha=0.75)
plt.xlabel("median value of owner-occupied homes $1000's")
plt.show()

Questions:

  • Can we predict the median value of owner-occupied homes?
    • regression
  • Which characteristics are most important when predicting home values?
    • inference, see stats models

Handwritten Digits

In [36]:
digits = datasets.load_digits()
X = digits.data
y = digits.target

print X.shape
print y.shape
(1797, 64)
(1797,)

In [44]:
print digits.DESCR[:595]
 Optical Recognition of Handwritten Digits Data Set

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

In [71]:
fig, ax = plt.subplots(4,4, sharex=True, sharey=True)
for r in range(4):
    for c in range(4):
        ax[r,c].imshow(X[r*4+c].reshape(8,8), cmap=plt.cm.gray_r)
fig.show()

Questions:

  • Given an image, can we correctly assign it to the correct numeric category?
    • classification
  • Can we increase the performance (and accuracy) of our prediction by reducing the number of features?
    • dimensionality reduction

Supervised and Unsupervised Learning

Supervised

  • Given a set of labeled data.
  • Goal: estimate $\hat{f}(X)$
  • Predict
    • quantitative: Regression
    • qualitative: Classification

Unsupervised Learning

  • We do not have labeled data.
  • Clustering
  • Dimensionality reduction

The estimator object

Every algorithm in sklearn is an instance of the estimator class.

Supervised methods:

  • fit(X,y): given features $X$ and resonse $y$, estimate $\hat{f}(x)$.
  • predict(X): predict the value of $\hat{y}$ given $X$ (e.g. $\hat{f}(X)$).
  • score(X,y): Indication of fit.
  • model.predict_proba(): for classification, returns a probability instead of class label.

Usupervised methods:

  • fit(X): fit the training data.
  • transform(X): returns the new representation of $X$ based on the fit.

Regression

  • Linear Regression
  • Nearest Neighbor Regression
In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
try:
    import smash
    reload(smash)
except:
    pass
colors = plt.rcParams['axes.color_cycle']
In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

from sklearn.datasets.samples_generator import make_regression
In [106]:
X, y = make_regression( n_samples=30, n_features=1, n_informative=1,
                        random_state=0, noise=75) 
X = X + 5
y = y + 300

def plot():
    fig, ax = plt.subplots()
    ax.scatter(X[:,0], y, edgecolor='none', s=100, alpha=0.75, c=colors[0])
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    fig.show()
In [107]:
plot()

Linear Regression

In [108]:
regr = LinearRegression()
regr.fit(X, y)

print regr.intercept_, regr.coef_
print regr.score(X,y)
-220.348130992 [ 100.96917215]
0.759670081702

In [109]:
X_new = np.linspace(2,8,10)[:,np.newaxis] # to make it (n,m)
y_pred = regr.predict(X_new)
In [110]:
def plot():
    fig, ax = plt.subplots()

    ax.plot(X_new[:,0], y_pred, color='darkgrey', linewidth=2)
    ax.scatter(X[:,0], y, edgecolor='none', s=100, alpha=0.75, c=colors[0])

    ax.set_xlabel('X')
    ax.set_ylabel('y')

    fig.show()
In [111]:
plot()

Nearst Neighbors

In [112]:
regr_nn = KNeighborsRegressor()
regr_nn.fit(X, y)

print regr_nn.score(X,y)
0.781550018011

In [113]:
def plot():
    fig, ax = plt.subplots()

    ax.scatter(X[:,0], y, edgecolor='none', s=100, alpha=0.75, c=colors[0])
    ax.plot(X_new[:,0], regr_nn.predict(X_new), color='darkgrey', linewidth=2)

    ax.set_xlabel('X')
    ax.set_ylabel('y')
    fig.show()
In [114]:
plot()

How do they Work?

In [115]:
X, y = make_regression( n_samples=15, n_features=1, n_informative=1,
                        random_state=1, noise=35) 
X = X + 5
y = y + 250
In [116]:
regr.fit(X,y)
Out[116]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
In [119]:
def plot():
    fig, ax = plt.subplots()

    ax.vlines(X[:,0], regr.predict(X), y, color='grey')
    ax.scatter(X[:,0], y, edgecolor='none', s=100, alpha=1., c=colors[0])
    ax.plot(X_new[:,0], regr.predict(X_new) , color='darkgrey', linewidth=2)

    ax.set_xlabel('X')
    ax.set_ylabel('y')
    fig.show()
In [120]:
plot()
In [121]:
from sklearn import metrics

h = 50
xx, yy = np.meshgrid(np.linspace(0, regr.intercept_*2.0, h),
                     np.linspace(0, regr.coef_[0]*2.0, h))

Z = np.c_[xx.ravel(), yy.ravel()]
res = map(lambda z: metrics.mean_squared_error(y, z[0] + z[1]*X), Z)
res2 = np.array(res).reshape(xx.shape)

points = np.c_[ Z[np.argsort(res)[650]], np.array([regr.intercept_, regr.coef_[0]])].T
In [122]:
def contour(ax):
    ax.contour(xx, yy, res2, 50, alpha=0.20, cmap=plt.cm.Greys_r)
    ax.contourf(xx, yy, res2, 50, alpha=0.20, cmap=plt.cm.Greys)
    ax.scatter(regr.intercept_, regr.coef_[0] , s=70, 
               alpha=0.75, edgecolor='none', c=colors[0])
    ax.scatter(points[0][0], points[0][1], c=colors[1], s=70)

    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())

    ax.set_ylabel('coef_')
    ax.set_xlabel('intercept_')

def line(ax):
    ax.vlines(X[:,0], regr.predict(X), y, color='grey')
    ax.plot(X_new[:,0], points[0][0] + points[0][1]*X_new[:,0], color=colors[1])
    ax.plot(X_new[:,0], points[1][0] + points[1][1]*X_new[:,0], color=colors[0])
    ax.scatter(X[:,0], y, edgecolor='none', s=70, alpha=1., c=colors[0])

    ax.set_xlabel('X')
    ax.set_ylabel('y')
 
def plot():    
    fig, ax = plt.subplots(1,2, figsize=(12,6))
    contour(ax[0])
    line(ax[1])
    fig.show()
In [123]:
plot()
In [124]:
regr_nn.fit(X,y)
Out[124]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          n_neighbors=5, p=2, weights='uniform')
In [125]:
fig, ax = plt.subplots()

ax.scatter(X[:,0], y, edgecolor='none', s=100, alpha=0.75, c=colors[0])
ax.plot(X_new[:,0], regr_nn.predict(X_new), color='darkgrey', linewidth=2)

pt = 3.7
neighbors = np.argsort(np.sqrt(np.square(X[:,0] - pt)))[:5]
for i in range(5):
    ax.plot( [ pt, X[neighbors[i]][0]] , [np.mean(y[neighbors]), y[neighbors[i]]], color='black' )
ax.scatter(X[neighbors], y[neighbors], c='black', s=40)
ax.scatter(pt, np.mean(y[neighbors]), c='black', s=40)

pt = 6
neighbors = np.argsort(np.sqrt(np.square(X[:,0] - pt)))[:5]
for i in range(5):
    ax.plot( [ pt, X[neighbors[i]][0]] , [np.mean(y[neighbors]), y[neighbors[i]]], color='black' )
ax.scatter(X[neighbors], y[neighbors], c='black', s=40)
ax.scatter(pt, np.mean(y[neighbors]), c='black', s=40)

ax.set_xlabel('X')
ax.set_ylabel('y')
fig.show()

Classification

In [126]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.datasets.samples_generator import make_classification
from sklearn.datasets import make_circles
In [130]:
def plotXY():
    fig, ax = plt.subplots()
    ax.scatter(X[:, 0], X[:, 1], c=y, s=70, alpha=0.75, edgecolor='none')
    ax.set_xlabel('X[:,0]')
    ax.set_ylabel('X[:,1]')
    fig.show()
In [167]:
def plot(clf):
    h = 50
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, h),
                         np.linspace(y_min, y_max, h))

    clf.fit(X, y)
    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    Z = Z.reshape(xx.shape)

    fig, ax = plt.subplots()
    ax.contourf(xx, yy, Z, alpha=0.15, levels=[0,0.5,1], cmap=plt.cm.Greys)
    ax.scatter(X[:, 0], X[:, 1], c=y, s=70, alpha=0.75, edgecolor='none')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xlabel('X[:,0]')
    ax.set_ylabel('X[:,1]')
    fig.show()

Madelon dataset

In [168]:
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=6, n_clusters_per_class=1)
In [169]:
plotXY()
In [170]:
clf = KNeighborsClassifier()
clf.fit(X, y)
print clf.score(X,y)
0.95

In [171]:
plot(clf)
In [172]:
clf = LogisticRegression()
clf.fit(X, y)
print clf.score(X,y)
0.95

In [173]:
plot(clf)

Circles

In [174]:
X, y = make_circles(noise=0.2, factor=0.5, random_state=1)
In [175]:
plotXY()
In [176]:
clf = KNeighborsClassifier()
clf.fit(X, y)
print clf.score(X,y)
0.9

In [177]:
plot(clf)
In [178]:
clf = LogisticRegression()
clf.fit(X, y)
print clf.score(X,y)
0.48

In [179]:
plot(clf)

Model Selection

  • Cross validation
  • Bias-Variance Trade-Off
  • Regularization
    • Ridge
    • Lasso
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
try:
    import smash
    reload(smash)
except:
    pass
colors = plt.rcParams['axes.color_cycle']
In [2]:
from sklearn import datasets
from sklearn.datasets.samples_generator import make_regression

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

Sample data

In [3]:
X, y = make_regression( n_samples=60, n_features=1, n_informative=1,
                        random_state=0, noise=20) 
X = X + 5
y = y + 300
In [4]:
def plot():
    fig, ax = plt.subplots()
    ax.scatter(X[:,0], y, edgecolor='none', s=100, alpha=0.75, c=colors[0])
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    fig.show()
In [5]:
plot()
In [6]:
regr = LinearRegression()
regr.fit(X, y)
print regr.score(X,y)
0.659251562827

In [7]:
regr_nn = KNeighborsRegressor(1)
regr_nn.fit(X, y)
print regr_nn.score(X,y)
1.0

What's going on?

  • KNeighborsRegressor(1): we are just memorizing the data
  • We are likely overfitting to the data.
    • Not likely to perform well on data we have not seen.
  • Solution: cross validate
In [8]:
def plot():
    fig, ax = plt.subplots()
    ax.scatter(X[:,0], y, edgecolor='none', s=100, alpha=0.75, c=colors[0])
    ax.plot(sorted(X[:,0]), regr_nn.predict(X[np.argsort(X[:,0])]), linewidth=2)
    ax.plot(X[:,0], regr.predict(X), linewidth=2)
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    fig.show()
In [9]:
plot()

Cross Validation

In [10]:
from sklearn import cross_validation

X, y = make_regression( n_samples=60, n_features=1, n_informative=1,
                        random_state=0, noise=10) 
X = X + 5
y = y + 300

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, 
                                        test_size=0.2, random_state=0)

print X.shape, X_train.shape, X_test.shape

def plot():
    fig, ax = plt.subplots()
    ax.scatter(X_train[:,0], y_train, edgecolor='none', 
               s=100, alpha=0.5, c=colors[0], label='train')
    ax.scatter(X_test[:,0], y_test, edgecolor='none', 
               s=100, alpha=1., c=colors[1], label='test')
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.legend(loc='upper left')
    fig.show()
(60, 1) (48, 1) (12, 1)

In [11]:
plot()
In [12]:
regr = LinearRegression()
regr.fit(X_train, y_train)
print regr.score(X_train, y_train)
print regr.score(X_test, y_test)
0.888219653513
0.939492295451

In [13]:
regr_nn = KNeighborsRegressor(1)
regr_nn.fit(X_train, y_train)
print regr_nn.score(X_train, y_train)
print regr_nn.score(X_test, y_test)
1.0
0.809129476496

In [14]:
regr_nn = KNeighborsRegressor(2)
regr_nn.fit(X_train, y_train)
print regr_nn.score(X_train, y_train)
print regr_nn.score(X_test, y_test)
0.937726017289
0.938673349488

In [21]:
def plot():
    X = X_train
    y = y_train
    fig, ax = plt.subplots()
    ax.scatter(X_test[:,0], y_test, edgecolor='none', 
               s=100, alpha=1, c=colors[1], label='test points')
    ax.plot(sorted(X[:,0]), regr_nn.predict(X[np.argsort(X[:,0])]), 
            linewidth=2, label='KNeighborsRegressor(2)', color=colors[2])
    ax.plot(X[:,0], regr.predict(X), linewidth=2, 
            label='LinearRegression()', color=colors[3])
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.legend(loc='upper left')
    fig.show()
In [22]:
plot()

Bias-Variance Trade-Off

Bias

Refers to the error in our prediction that is introduced by our choice of model.

  • e.g. Linear models assume a linear relationship

Variance

The amount our prediction will change if we use a different training set.

Good predictions

  • will simultaneously minimize bias and variance.
  • More flexible methods result in less bias.

Trade-off?

Easy to find a low-bias method.

  • KNeighborsRegressor(1) passes through every point.
  • But... high variance when cross validated.

Easy to find a low-variance method.

  • A hoizontal line.
  • This has very high bias.

Challange: find a method that minimizes both

In [23]:
from sklearn.metrics import mean_squared_error
from ipywidgets import StaticInteract, RangeWidget

Test function and PolynomialRegression taken from Jake Vanderplas' excellent tutorial.

In [24]:
def test_func(x, err=0.5):
    y = 10 - 1. / (x + 0.1)
    if err > 0:
        y = np.random.normal(y, err)
    return y

def make_data(N=40, error=1.0, random_seed=1):
    # randomly sample the data
    np.random.seed(1)
    X = np.random.random(N)[:, np.newaxis]
    y = test_func(X.ravel(), error)
    
    return X, y
In [25]:
class PolynomialRegression(LinearRegression):
    """Simple Polynomial Regression to 1D data"""
    def __init__(self, degree=1, **kwargs):
        self.degree = degree
        LinearRegression.__init__(self, **kwargs)
        
    def fit(self, X, y):
        if X.shape[1] != 1:
            raise ValueError("Only 1D data valid here")
        if self.degree > 1:
            Xp = X ** (1 + np.arange(self.degree))
            return LinearRegression.fit(self, Xp, y)
        else:
            return LinearRegression.fit(self, X, y)
         
    def predict(self, X):
        if self.degree > 1:
            Xp = X ** (1 + np.arange(self.degree))
            return LinearRegression.predict(self, Xp)
        else:
           return LinearRegression.predict(self, X)
In [26]:
X_true, y_true = make_data(100, error=0.0)
X, y = make_data(100, error=1.0)
MSE = mean_squared_error(y, y_true)
print MSE
0.815400074255

In [27]:
def plot():
    plt.scatter(X, y, c=colors[0], s=70, alpha=0.75)
    plt.plot(sorted(X_true[:,0]), y_true[np.argsort(X_true[:,0])], 
         color="darkgrey", linewidth=2)
    plt.show()
In [28]:
plot()
In [29]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, 
                                        test_size=0.2, random_state=0)
test = []
train = []
shape = []
X_tmp = np.linspace(0,1,100)[:, np.newaxis]
In [30]:
degree = range(1,20)
for d in degree:
    model = PolynomialRegression(degree=d)
    model.fit(X_train, y_train)
    test.append(mean_squared_error(model.predict(X_test), y_test))
    train.append(mean_squared_error(model.predict(X_train), y_train))
    shape.append(model.predict(X_tmp))
In [31]:
def plot():
    plt.plot(np.array(degree)/20., test,'-')
    plt.plot(np.array(degree)/20., train,'-')
    plt.hlines(MSE, 0, 1)
    plt.xlabel('increasing flexibility')
    plt.ylabel('MSE')
    plt.show()
In [32]:
plot()
In [33]:
def plot(index):
    fig, ax = plt.subplots(1,2, figsize=(12,6))