lab1

WAI Intro to ML - Session 1 Lab

🖋️ Written by Jan and Laura from Warwick AI

Section 1 - Linear Regression

Doctors often need to predict how a patient’s condition will develop over time. For someone newly diagnosed with diabetes, will their condition progress slowly or rapidly? In this section, we’ll explore how we can use real data to make predictions like this.

Exercise 1.1

To introduce these concepts, we’ll use scikit-learn. Let’s start by looking at a real dataset of simple measurements that would be available at diagnosis. Run the cells below.

1
from sklearn.datasets import load_diabetes
2

3
diabetes = load_diabetes(as_frame=True)
4
diabetes['data']

If you are curious you can read more about this dataset and see what the indicators s1-s6 relate to here.

Exercise 1.2

We call data like this, or any data from which we make predictions, features. And the thing we are trying to predict (in our case, disease progression) is called the target. In code, we often use these shorthand variable names:

1
X = <the features we want to use>
2
y = <the target>
3

4
# NOTE: capitals denote matrices (multiple columns), lower case vectors (one column)

In this case, BMI is the most predictive feature, so we’ll focus on it first. Finish the first cell (watch out for case sensitivity).

1
# to_frame() makes it matrix-like, which is expected by scitkit-learn
2
X = diabetes['data']['bmi'].to_frame()
3
y = diabetes['target']

We can now visualise the data we have.

1
import matplotlib.pyplot as plt
2

3
plt.scatter(X, y, alpha=0.5)
4
plt.xlabel('BMI (normalised)')
5
plt.ylabel('Disease progression')
6
plt.show()

Exercise 1.3

Let’s make our first predictions! This data looks fairly linear, so let’s try linear regression first. Have a look at the docs here to figure out how to complete the cell below.

1
from sklearn.linear_model import LinearRegression
2
import numpy as np
3

4
model = LinearRegression().fit(X,y) # <- we still need to pass in the training data
5

6
# You can ignore the following line, but use plot_X for plotting
7
plot_X = np.arange(X.min().min(), X.max().max(), 0.001).reshape(-1,1) # For plotting our predition line we want an even distribution of the x-axis - if we use the data samples our line will look broken when plotted.
8

9
y_pred = model.predict(plot_X)
10

11
plt.scatter(X, y, alpha=0.5)
12
plt.plot(plot_X, y_pred, color='red')
13
plt.show()

Exercise 1.4

We can also make our own features from the data we have. In the example above, the data is not perfectly linear, so we can “engineer” polynomial features to capture that part of the relationship too. At degree=1, we will end up with an identical model as earlier, but as you increase the degree we will slightly accuracy increase and then drop off. Try it for yourself.

1
from sklearn.preprocessing import PolynomialFeatures
2
from sklearn.pipeline import make_pipeline
3
import numpy as np
4

5
model = make_pipeline(
6
    PolynomialFeatures(degree=1), # <-- try, e.g., 1, 3, 10, 25
7
    LinearRegression()
8
)
9
model.fit(X, y)
10

11
y_pred = model.predict(plot_X)
12

13
plt.scatter(X, y, alpha=0.5)
14
plt.plot(plot_X, y_pred, color='red')
15
plt.show()

This is an important aspect of ML: models can learn to follow the training data too closely (often just memorise it) and won’t perform well on unseen data.

Section 2 - Evaluating Our Model

We’ve inspected the dataset and made a model, but how good is it? We’re fortunate, in this case, that we can view our data and model and see for ourselves whether it’s any good. However, in majority of problems, we will not be so fortunate. We must come up with some different tools to evaluate our models rather than simply visualising them.

Exercise 2.1

For regression tasks, we are trying to create a model which can predict the outcome on unseen data. In the example we’re using, we’ve trained a model to predict the progression of diabetes using some data we have about the patient. Now we hope that, given a new unseen patient, we can accurately predict the progression.

Notice we want our model to do well on “unseen” data. So, to evaluate our model, we can mimic this scenario. Seperate some data to be our “unseen” data, and leave the rest to be trained on. Often an 70:30 split is used between the training data and “unseen”/test data.

Luckily, this is such a common technique that scikit-learn has exactly a function to do this split for us!

1
from sklearn.model_selection import train_test_split
2

3
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42) # Test size tells us what proportion of the data to split into the test set.
4
                                                                                          # random_state lets us choose the data to go into our test set randomly - different values give different random sets of data.

We should find that we now have two different sets of data with the same underlying struture.

1
plt.scatter(train_X, train_y, alpha=0.5)
2
plt.xlabel('BMI (normalised)')
3
plt.ylabel('Disease progression')
4
plt.show()

1
plt.scatter(test_X, test_y, alpha=0.5)
2
plt.xlabel('BMI (normalised)')
3
plt.ylabel('Disease progression')
4
plt.show()

Exercise 2.2

Now that we’ve split the data we need to evaluate the accuracy of our model on the test data. First we train the model using our training data.

1
model = make_pipeline(
2
    PolynomialFeatures(degree=15), # <-- try, e.g., 1, 3, 10, 25
3
    LinearRegression()
4
)
5
model.fit(train_X, train_y)
6

7

8
y_pred = model.predict(plot_X)
9

10
plt.scatter(train_X, train_y, alpha=0.5)
11
plt.plot(plot_X, y_pred, color='red')
12
plt.show()

And now we can predict on our test set, and evaluate these predictions using mean squared error.

1
from sklearn.metrics import mean_squared_error
2

3
y_pred = model.predict(test_X)
4
mse_test = mean_squared_error(test_y, y_pred)  # Take the mean squared error between the true values and our predicted ones.
5

6
y_pred = model.predict(train_X)
7
mse_train = mean_squared_error(train_y, y_pred)
8

9
print(f"Test Error : {mse_test}, Train Error : {mse_train}")

Have a look at the test error vs train error for different degrees in polynomials. What do you notice?

Very high order polynomial functions are in danger of overfitting the data. But a model that perfectly memorises the training data might perform terribly on unseen data. We say that a model which performs well on seen data but poorly on unseen data has poor generalisation.

Exercise 2.3

Let’s plot what we’re noticing, train a model with varying degrees and plot the training error vs test error in each case.

1
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42) # FILL THIS IN
2

3
degrees = [1, 2, 3, 5, 10, 25]
4
train_errors = []
5
test_errors = []
6

7
for degree in degrees:
8
  model = make_pipeline(
9
    PolynomialFeatures(degree=degree),
10
    LinearRegression()
11
  )
12
  model.fit(train_X, train_y)
13

14
  # Calculate training error
15
  train_pred = model.predict(train_X)
16
  train_error = mean_squared_error(train_y, train_pred)
17
  train_errors.append(train_error)
18

19
  # Calculate test error
20
  test_pred = model.predict(test_X)
21
  test_error = mean_squared_error(test_y, test_pred)
22
  test_errors.append(test_error)
23

24
# Plot Training and Test errors for each degree
25
plt.plot(degrees, train_errors, marker='o', label='Training Error')
26
plt.plot(degrees, test_errors, marker='s', label='Test Error')
27
plt.xlabel('Polynomial Degree')
28
plt.ylabel('Error (MSE)')
29
plt.legend()
30
plt.show()

Section 3: Regularisation

We saw that high-degree polynomials overfit the training data. Regularisation is a technique that prevents this by penalising overly complex models. Ridge regression is a type of linear regression with regularisation.

The strength of regularisation is controlled by the parameter alpha:

Small alpha: less regularisation (model can be complex)
Large alpha: more regularisation (model forced to be simpler)

We can adjust our pipeline to include Ridge regularisation like so:

1
from sklearn.linear_model import Ridge
2

3
model = make_pipeline(
4
      PolynomialFeatures(degree=10),
5
      Ridge(alpha=0.1)
6
  )

Exercise 3.1

Train two models of degree 10, with and without ridge regularisation and plot the resulting function.

1
model = make_pipeline(
2
    PolynomialFeatures(degree=10), # <-- try, e.g., 1, 3, 10, 25
3
    LinearRegression()
4
)
5

6
model.fit(X, y)
7

8
y_pred = model.predict(plot_X)
9

10
plt.scatter(train_X, train_y, alpha=0.5)
11
plt.plot(plot_X, y_pred, color='red')
12
plt.show()

1
model_with_regression = make_pipeline(
2
    PolynomialFeatures(degree=10), # <-- try, e.g., 1, 3, 10, 25
3
    Ridge(alpha=0.1)
4
)
5

6
model_with_regression.fit(X,y)
7

8
y_pred = model_with_regression.predict(plot_X)
9

10
plt.scatter(train_X, train_y, alpha=0.5)
11
plt.plot(plot_X, y_pred, color='red')
12
plt.show()

Exercise 3.2

Let’s observe how different alpha values affect training and test errors.

Plot the training and test error for different values of alpha on a high order polynomial.

1
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
2

3
train_errors = []
4
test_errors = []
5

6
for alpha in alphas:
7
  # Use a high polynomial degree that we know overfits
8
  model = make_pipeline(
9
      PolynomialFeatures(degree=10),
10
      Ridge(alpha=alpha) # Use ridge regression
11
  )
12

13
  # Fit the model
14
  model.fit(train_X, train_y)
15

16
  # Get training error
17
  train_pred = model.predict(train_X)
18
  train_error = mean_squared_error(train_y, train_pred)
19
  train_errors.append(train_error)
20

21
  # Get test errors
22
  test_pred = model.predict(test_X)
23
  test_error = mean_squared_error(test_y, test_pred)
24
  test_errors.append(test_error)
25

26
# Plot train and test errors for each alpha (see Ridge regression)
27
plt.plot(alphas, train_errors, marker='o', label='Training Error')
28
plt.plot(alphas, test_errors, marker='s', label='Test Error')
29
plt.xscale('log')
30
plt.xlabel('Regularisation Strength (alpha)')
31
plt.legend()
32
plt.show()

Exercise 3.3

Right at the start we define our data X to be the bmi data, but there are many other variables we could look at.

Try changing the line in Exercise 1.2, so that we look at ‘s2’ for example. Look back over the graphs you created from the exercises. Investigate the graphs for each different variable to get an idea of what they look like.

Exercise 3.4

Notice all of the variables are scattered, none of them provide a good correlation between diabetes and themselves alone. Perhaps when considered together they can provide a better picture.

We can do linear regression on many variables at once! It just becomes almost impossible to visualise. We’ll have to rely on the training and test error alone from now on.

Start by training models without regularisation.

You will notice very high degree polynomials will have much more dramatic differences in test and train errors.

Try a range of different polynomial degrees.

Warning : High order polynomials with regression will take a minute or so to train now that we are using all the features, just be patient.

1
X = diabetes['data']
2
y = diabetes['target']
3

4
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=10)
5

6
# Create the model
7
model = make_pipeline(
8
    PolynomialFeatures(degree=2), # <-- try, e.g., 1, 3, 10, 25
9
    LinearRegression()
10
)
11

12
# Fit the model
13
model.fit(train_X, train_y)
14

15
# Get test error
16
y_pred = model.predict(test_X)
17
mse_test = mean_squared_error(test_y, y_pred)  # Take the mean squared error between the true values and our predicted ones.
18

19
# Get training error
20
y_pred = model.predict(train_X)
21
mse_train = mean_squared_error(train_y, y_pred)
22

23
print(f"Test Error : {mse_test}, Train Error : {mse_train}")

Exercise 3.5

Repeat Exercise 3.2 using ALL the features from the data set and take notice that we start to prefer a more complex model over simple one (But not too complex!).

1
X = diabetes['data']
2
y = diabetes['target']
3

4
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=10)
5

6
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
7

8
train_errors = []
9
test_errors = []
10

11
for alpha in alphas:
12
  # Use a high polynomial degree that we know overfits
13
  model = make_pipeline(
14
      PolynomialFeatures(degree=10),
15
      Ridge(alpha=alpha)
16
  )
17

18
  model.fit(train_X, train_y)
19

20
  train_pred = model.predict(train_X)
21
  train_error = mean_squared_error(train_y, train_pred)
22
  train_errors.append(train_error)
23

24
  test_pred = model.predict(test_X)
25
  test_error = mean_squared_error(test_y, test_pred)
26
  test_errors.append(test_error)
27

28
# Plot the results
29
plt.plot(alphas, train_errors, marker='o', label='Training Error')
30
plt.plot(alphas, test_errors, marker='s', label='Test Error')
31
plt.xscale('log')
32
plt.xlabel('Regularisation Strength (alpha)')
33
plt.legend()
34
plt.show()

Extension

Awesome! Let’s recap on the Califronia housing dataset. This time, we’re using median income to predict the value of the house a family owns. This sort of prediction has lots of uses in insurance.

Train a model for this dataset. Everything you need is in the previous exercises so please do go back and review them if you feel stuck (and don’t forget you can ask us for help!)

Here’s some code to get you started:

1
from sklearn.datasets import fetch_california_housing
2

3
housing = fetch_california_housing(as_frame=True)
4

5
X = housing['data']['MedInc'].to_frame()
6
y = housing['target']
7

8
plt.scatter(X, y, alpha=0.1)
9
plt.xlabel('Median income (normalised)')
10
plt.ylabel('Median house value (normalised)')
11
plt.show()

1
# train_X, test_X, train_y, test_y = ...
2
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=10)
3

4
# model = make a pipeline with polynomial features and a ridge model
5
model = make_pipeline(
6
    PolynomialFeatures(degree=5),
7
    Ridge(alpha=1.0)
8
)
9

10
model.fit(train_X, train_y)
11

12
mse = mean_squared_error(test_y, model.predict(test_X))
13
print(f"MSE: {mse:.2f}")

Don’t forget to visualise the variables, but also use all the variables for you final model even if you can’t visualise the final output you can still check its accuracy using the test error.

1
!jupyter nbconvert --to markdown Lab1_solutions.ipynb

Lab 1