Improving and evaluating our classifier#

In start this activity by looking at how decision trees tend to overfit the data if they are left unchecked. Over fitting the data means they try to account for the outlying data points at the cost of the prediction accuracy of the general trend.

We will also look at k-fold cross validation. This is a more robust method of validation than the held-out method we used previously.

In k-fold cross validation, we can test every example once. This is done by splitting the data set into \(k\) subsets and training/testing the model \(k\) times using different combinations of the subsets.

Finally, we look at how accurate our model is on QSOs compared with other galaxies. As mentioned in the lectures, QSOs are galaxies that have an Active Galactic Nucleus (AGN). The AGN makes the galaxy brighter and as such they are detectable with the SDSS instruments out to much higher redshifts.

We will use the same data set as the first activity and even some of functions we wrote in previous questions.

Decision trees have many advantages: they are simple to implement, easy to interpret, the data doesn’t require too much preparation, and they are reasonably efficient computationally.

Decision trees do have some limitations though, one of the biggest being they tend to over fit the data. What this means is that if they are left unchecked they will create an overly complicated tree that attempts to account for outliers in the data. This comes at the expense of the accuracy of the general trend.

Part of the reason for this over-fitting is that the algorithm works by trying to optimise the decision locally at each node. There are ways in which this can be mitigated and in the next problem we will see how constraining the number of decision node rows (the tree depth) impacts on the accuracy of our predictions.

In order to see how the tree is overfitting we would like to examine how our decision tree performs for different tree depths. Specifically, we would like to see how it performs on test data compared to the data that was used to train it.

Naïvely we’d expect, the deeper the tree, the better it should perform. However, as the model overfits we see a difference in its accuracy on the training data and the more general testing data.

We can control the depth of decision tree learned, using an argument to DecisionTreeRegressor. For example, to set the maximum depth to 5:

dtr = DecisionTreeRegressor(max_depth=5)

Complete the function accuracy_by_treedepth. The function should return the median difference for both the testing and training data sets for each of the tree depths in depths.

accuracy_by_treedepth should take the following arguments:

features and targets (as in previous problems);
depths: an array of tree depths to be used as the max_depth of the decision tree regressor.

Your function should return two lists (or arrays) containing the median_diff values for the predictions made on the training and test sets using the maximum tree depths given by the depths.

For example, if depths is [3, 5, 7], then your function should return two lists of length 3. You can choose the size of the split between your testing and training data (if in doubt, 50:50 is fine).

We’ve included code to plot the differences as a function of tree depths. You should take a moment to familiarise yourself with what each line is doing. If your code is working well then your plot should look a bit like the following:

Median differences as a function of tree depth for training and testing datasets

%run 5a.ipynb

import numpy as np
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeRegressor

# Complete the following function
def accuracy_by_treedepth(features, targets, depths):
  # split the data into testing and training sets

  # Initialise arrays or lists to store the accuracies for the below loop

  # Loop through depths
  for depth in depths:
    # initialize model with the maximum depth.
    dtr = DecisionTreeRegressor(max_depth=depth)

    # train the model using the training set

    # Get the predictions for the training set and calculate their med_diff

    # Get the predictions for the testing set and calculate their med_diff

  # Return the accuracies for the training and testing sets

  data = np.load('data/sdss_galaxy_colors.npy')
  features, targets = get_features_targets(data)

  # Generate several depths to test
  tree_depths = [i for i in range(1, 36, 2)]

  # Call the function
  train_med_diffs, test_med_diffs = accuracy_by_treedepth(features, targets, tree_depths)
  print("Depth with lowest median difference : {}".format(tree_depths[test_med_diffs.index(min(test_med_diffs))]))

  # Plot the results
  train_plot = plt.plot(tree_depths, train_med_diffs, label='Training set')
  test_plot = plt.plot(tree_depths, test_med_diffs, label='Validation set')
  plt.xlabel("Maximum Tree Depth")
  plt.ylabel("Median of Differences")
  plt.legend()
  plt.show()

Discussion of results#

We can see that the accuracy of the decision tree on the training set gets better as we allow the tree to grow to greater depths. In fact, at a depth of 27 our errors goes to zero!

Conversly, the accuracy measure of the predictions for the test set gets better initially and then worse at larger tree depths. At a tree depth ~19 the decision tree starts to overfit the data. This means it tries to take into account outliers in the training set and loses its general predictive accuracy.

Overfitting is a common problem with decision trees and can be circumvented by adjusting parameters like the tree depth or setting a minimum number of cases at each node. For now, we will set a maximum tree depth of 19 to prevent over-fitting in our redshift problem.

Cross validation#

The method we used to validate our model so far is known as hold-out validation. Hold out validation splits the data in two, one set to test with and the other to train with. Hold out validation is the most basic form of validation.

While hold-out validation is better than no validation, the measured accuracy (i.e. our median of differences) will vary depending on how we split the data into testing and training subsets. The med_diff that we get from one randomly sampled training set will vary to that of a different random training set of the same size.

In order to be more certain of our models accuracy we should use \(k\)-fold cross validation. \(k\)-fold validation works in a similar way to hold-out except that we split the data into \(k\) subsets. We train and test the model \(k\) times, recording the accuracy each time. Each time we use a different combination of \(k-1\) subsets to train the model and the final \(k^\text{th}\) subset to test. We take the average of the \(k\) accuracy measurements to be the overall accuracy of the the model.

The KFold library is designed to split the data into training and testing subsets. It does this by offering an iterable object that can be initialised with

kf = KFold(n_splits=k, shuffle=True)

The n_splits=k specifies the number of subsets to use.

By default shuffle is set to False. It is generally good practice to shuffle the data for cross validation as sometimes during collection and storage, data of a similar type can be stored adjacently which would lead to some learning bias when training the tree. For example, if the data was sorted by redshift, on the first iteration the model might be trained with redshifts 0 to 3 and tested on galaxies with redshifts ~4.

In the next couple of problems we will use the sklearn library KFold to help us split our data into our \(k-1\) training subsets and remaining test subset. In the first problem we will use the convenience of KFolds to help us calculate the \(k\)-fold cross validated accuracy of our model. In the second we will extend this to provide a \(k\)-folded cross validated prediction for every galaxy in our data set.

Exercise: KFold cross validation#

Your task is to complete the function cross_validate_model. The function takes 4 arguments:

model, feaures, and targets as in previous problems;
k in our \(k\)-fold. This is the number of subsets to train and test.

Your function should return a list containing the k median of differences for each of the k folds using median_diff.

Note that we have set the max_depth=19 when we initialise the decision tree to prevent the model from overfitting.

KFolds usage

We have created the KFold object to give you a set of training and testing indices for each of the k runs. It is worth taking a moment to understand this.

Specifically, the object is initialised with

kf = KFold(n_splits=k, shuffle=True)

The n_splits=k passes our desired number of subsets/folds. We want to shuffle the data (as previously explained). The iterator is then used with:

for train_indices, test_indices in kf.split(features):

The kf.split(features) is an iterator that, for each of the k iterations, returns two arrays of indices to be used with our feature and target arrays, i.e. features[train_indices],targets[train_indices]

import numpy as np
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor

# complete this function
def cross_validate_model(model, features, targets, k):
  kf = KFold(n_splits=k, shuffle=True)

  # initialise a list to collect median_diffs for each iteration of the loop below

  for train_indices, test_indices in kf.split(features):
    train_features, test_features = features[train_indices], features[test_indices]
    train_targets, test_targets = targets[train_indices], targets[test_indices]

    # fit the model for the current set

    # predict using the model

    # calculate the median_diff from predicted values and append to results array


  # return the list with your median difference values

import numpy as np
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor

# complete this function
def cross_validate_model(model, features, targets, k):
  kf = KFold(n_splits=k, shuffle=True)

  # initialise a list to collect median_diffs for each iteration of the loop below
  diffs = []

  for train_indices, test_indices in kf.split(features):
    train_features, test_features = features[train_indices], features[test_indices]
    train_targets, test_targets = targets[train_indices], targets[test_indices]

    # fit the model for the current set
    model.fit(train_features, train_targets)

    # predict using the model
    predictions = model.predict(test_features)

    # calculate the median_diff from predicted values and append to results array
    diffs.append(median_diff(predictions, test_targets))

  # return the list with your median difference values
  return diffs

data = np.load('data/sdss_galaxy_colors.npy')
features, targets = get_features_targets(data)

# initialize model with a maximum depth of 19
dtr = DecisionTreeRegressor(max_depth=19)

# call your cross validation function
diffs = cross_validate_model(dtr, features, targets, 10)

# Print the values
print('Differences: {}'.format(', '.join(['{:.3f}'.format(val) for val in diffs])))
print('Mean difference: {:.3f}'.format(np.mean(diffs)))

Cross validation of predictions#

Cross validation is an important part of ensuring that our model is returning values that are at least partially accurate. The problem with held-out validation is that the we are only able to get prediction values for the data in our test set.

With \(k\)-fold cross validation each galaxy is tested at least once and because of this we are able to get a prediction value for every galaxy. We’ll do this in the next question.

Exercise: KFold Cross Validated Predictions#

Complete the function cross_validate_predictions. This is very similar to the previous question except instead of returning the med_diff accuracy measurements we would like to return a predicted value for each of the galaxies.

The function takes the same 4 arguments as the previous question, i.e. model, feaures, targets and k.

Your function should return a single variable. The returned variable should be a 1-D numpy array of length \(m\), where \(m\) is the number of galaxies in our data set. You should make sure that you maintain the order of galaxies when giving your predictions, such that the first prediction in your array corresponds to the first galaxy in the features and targets arrays.

import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor

# complete this function
def cross_validate_predictions(model, features, targets, k):
  kf = KFold(n_splits=k, shuffle=True)

  # declare an array for predicted redshifts from each iteration
  all_predictions = np.zeros_like(targets)

  for train_indices, test_indices in kf.split(features):
    # split the data into training and testing

    # fit the model for the current set

    # predict using the model

    # put the predicted values in the all_predictions array defined above
    all_predictions[test_indices] = predictions

  # return the predictions
  return all_predictions

import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor

# complete this function
def cross_validate_predictions(model, features, targets, k):
  kf = KFold(n_splits=k, shuffle=True)

  # declare an array for predicted redshifts from each iteration
  all_predictions = np.zeros_like(targets)

  for train_indices, test_indices in kf.split(features):
    # split the data into training and testing
    train_features, test_features = features[train_indices], features[test_indices]
    train_targets, test_targets = targets[train_indices], targets[test_indices]

    # fit the model for the current set
    model.fit(train_features, train_targets)

    # predict using the model
    predictions = model.predict(test_features)

    # put the predicted values in the all_predictions array defined above
    all_predictions[test_indices] = predictions

  # return the predictions
  return all_predictions

data = np.load('data/sdss_galaxy_colors.npy')
features, targets = get_features_targets(data)

# initialize model
dtr = DecisionTreeRegressor(max_depth=19)

# call your cross validation function
predictions = cross_validate_predictions(dtr, features, targets, 10)

# calculate and print the rmsd as a sanity check
diffs = median_diff(predictions, targets)
print('Median difference: {:.3f}'.format(diffs))

# plot the results to see how well our model looks
plt.scatter(targets, predictions, s=0.4)
plt.xlim((0, targets.max()))
plt.ylim((0, predictions.max()))
plt.xlabel('Measured Redshift')
plt.ylabel('Predicted Redshift')
plt.show()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/Users/arunkannawadi/Desktop/code/data-driven-astronomy/data_driven_astronomy/notebooks/5b.ipynb Cell 19 in <cell line: 1>()
----> <a href='vscode-notebook-cell:/Users/arunkannawadi/Desktop/code/data-driven-astronomy/data_driven_astronomy/notebooks/5b.ipynb#ch0000020?line=0'>1</a> data = np.load('data/sdss_galaxy_colors.npy')
      <a href='vscode-notebook-cell:/Users/arunkannawadi/Desktop/code/data-driven-astronomy/data_driven_astronomy/notebooks/5b.ipynb#ch0000020?line=1'>2</a> features, targets = get_features_targets(data)
      <a href='vscode-notebook-cell:/Users/arunkannawadi/Desktop/code/data-driven-astronomy/data_driven_astronomy/notebooks/5b.ipynb#ch0000020?line=3'>4</a> # initialize model

File ~/opt/anaconda3/lib/python3.8/site-packages/numpy/lib/npyio.py:407, in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
    405     own_fid = False
    406 else:
--> 407     fid = stack.enter_context(open(os_fspath(file), "rb"))
    408     own_fid = True
    410 # Code to distinguish from NumPy binary files and pickles.

FileNotFoundError: [Errno 2] No such file or directory: 'data/sdss_galaxy_colors.npy'

K-Fold discussion#

K-Fold cross validation is an important part of assessing the accuracy of any machine learning model. When we plotted our predicted vs measured redshifts we are able to see that for many our galaxies we were able to get a reasonably accurate prediction of redshift. However, there are also several outliers where our model does not give a good prediction.

Scatter plot of predicted vs. measured redshift

We have learnt the inner workings of \(k\)-Fold cross validation with the help of the KFold library. Now that you have a working understanding of \(k\)-Fold you should be aware that there are several methods and libraries in the sklearn.model_selection modules that provide off the shelf versions of some of the routines that we have just written.

The cross_val_predict function performs the same actions as the cross_validate_predictions function you wrote in the previous question. It can be called with

predictions = cross_val_predict(dtr, features, targets, cv=k)

where dtr is our decision tree regressor object, cv=k allows us to specify the number of folds(k) to use and features/targets are as we have used them so far.

There is one other tool in the sklearn.model_selection library that is worth noting, the cross_val_score function. This provides a score of how well the model performed similar to the med_diff we have been using so far. We will not go into the usage here, but you need to specify which metric is used to score the model.

QSO vs Galaxies#

You might be surprised to learn that our sample of galaxies consists of two different populations: regular galaxies and quasi-stellar objects (QSOs). QSOs are a type of galaxy that contain an actively (and intensely) accreting supermassive black hole. This is often referred to as an Active Galactic Nucleus (AGN).

AGN

The light emitted from the AGN is significantly brighter than the rest of the galaxy and we are able to detect these QSOs out to much higher redshifts. In fact, most of the normal galaxies we have been using to create our models have redshifts less than \(z \approx 0.4\), while the QSOs have redshifts all the way out to \(z \approx 6\). Due to this contribution from the AGN, the flux magnitudes measured at different wavelengths might not follow the typical profile we assumed when predicting redshifts.

In the next question we are going look at whether there is a difference in the accuracy of the decision trees between QSOs and regular galaxies.

Exercise: QSO and Galaxy#

Write a function split_galaxies_qsos that splits our data containing both galaxies and QSOs into two arrays that contain only galaxies and QSOs respectively. Your function should take a single data argument.

The function should return two NumPy arrays, the first galaxies containing only rows from data that are galaxies and the second qsos containing only rows that are QSOs.

The data array contains a column data['spec_class'] where the values will either be b'GALAXY' or b'QSO'.

Gotcha: use b’GALAXY’ and not ‘GALAXY’

The spectral class is stored as a byte string (not Unicode strings), so the literals must have a b out the front. Comparing against ‘GALAXY’ will not match any rows, whereas b’GALAXY’ will.

Hint: use masking to select the rows

We can use masking to select particular rows:

import numpy as np
data = np.load('sdss_galaxy_colors.npy')
galaxies = data[data['spec_class'] == b'GALAXY']

The inner data['spec_class'] == b'GALAXY' returns all of the indices that have a galaxy spectral type. These indices are then used to select the rows with the outer data[...].

import numpy as np
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor

# complete this function
def split_galaxies_qsos(data):
  pass
  # split the data into galaxies and qsos arrays

  # return the seperated galaxies and qsos arrays

import numpy as np
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor

# Solution
def split_galaxies_qsos(data):
  # split the data into galaxies and qsos arrays
  galaxies = data[data['spec_class'] == b'GALAXY']
  qsos = data[data['spec_class'] == b'QSO']

  # return the seperated galaxies and qsos arrays
  return galaxies, qsos

def cross_validate_median_diff(data):
  features, targets = get_features_targets(data)
  dtr = DecisionTreeRegressor(max_depth=19)
  return np.mean(cross_validate_model(dtr, features, targets, 10))

data = np.load('data/sdss_galaxy_colors.npy')

# Split the data set into galaxies and QSOs
galaxies, qsos= split_galaxies_qsos(data)

# Here we cross validate the model and get the cross-validated median difference
# The cross_validated_med_diff function is in "written_functions"
galaxy_med_diff = cross_validate_median_diff(galaxies)
qso_med_diff = cross_validate_median_diff(qsos)

# Print the results
print("Median difference for Galaxies: {:.3f}".format(galaxy_med_diff))
print("Median difference for QSOs: {:.3f}".format(qso_med_diff))

QSO discussion#

So our QSOs have a greater median residual \((\approx 0.074)\) than the galaxies \((\approx 0.016)\). There are a couple of possibilities why this is the case.

There are far fewer QSOs (8525) than galaxies (41,475). Galaxies aren’t as bright as QSOs so they become too faint to be detected with SDSS at redshifts \(\approx 0.4\). This creates a measurement bias. When I take a random sample of galaxies the same size as the QSO data set I get a med_diff of \(\approx 0.018\), which is slightly higher than the full set, but not enough to account for the gap between the two populations.

The figure below shows the normalised distribution function of the two populations.

Normalized redshift distribution of QSOs and galaxies

We can see that the majority of galaxies form a peak around \(0.10\) while the QSOs are resonably evenly distributed out to redshift \(\approx 2.5\). This can lead to a measurement bias. In the case of the galaxies we have trained our decision tree with target redshifts approximately less than \(0.4\). As such the predictions from this model will not be larger than the maximum target value. So the maximum difference (or residual) for each galaxy in this set will be a lot smaller than the maximum residual for the QSOs.

We can often get a clearer view of this by looking at the predicted redshifts vs actual redshifts in a plot.

Scatter plot of predicted vs. measured redshift, colored by QSOs and galaxies

Summary#

We have looked at how decision trees are prone to overfitting the model and how limiting the maximum depth of the tree can be used to prevent this. By comparing the accuracy of the model on the training set with that of the test set for different tree depths we found that a maximum tree depth of 19 was suitable for our model.

We looked at \(k\)-fold cross validation and the various methods that can be used to implement it. \(k\)-fold cross validation mitigates the risk that the training set has a unique or specific population of the data set; For example if all the training data contained QSOs and the testing set regular galaxies. \(k\)-folds cross validation also allows you to get a prediction for all the points in your data set.

We concluded by looking at the sub-population of QSOs and how their accuracy measurement was significantly worse than that of the other galaxies. On closer inspection we found that this was a measurement bias resulting from the difference in the range of redshifts in each population.

You have hopefully learnt all the tools necessary to implement a decision tree on a regression problem of your own. In the next module we will look at how decision trees can be used for classification. We will be using them to classify galaxies as either an elliptical, a spiral or a merger.

Data-driven Astronomy

Improving and evaluating our classifier

Contents

Improving and evaluating our classifier#

Discussion of results#

Cross validation#

Exercise: KFold cross validation#

Cross validation of predictions#

Exercise: KFold Cross Validated Predictions#

K-Fold discussion#

QSO vs Galaxies#

Exercise: QSO and Galaxy#

QSO discussion#

Summary#