Unit 3: Supervised Learning: Classification

KNN, decision trees, random forests and SVM

Learning Outcomes
  • Apply KNN, decision trees and Naive Bayes
  • Reduce variance with random forests
  • Use support vector machines and kernels
  • Detect and limit overfitting

KNN and Naive Bayes

K-nearest neighbours classifies a point by the majority vote of its closest training points, while Naive Bayes applies Bayes theorem under the assumption that features are conditionally independent.

from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)

Decision Trees and Random Forests

A decision tree splits features using information gain or the Gini index, and a random forest averages many such trees trained on random subsets to reduce variance and improve generalisation.

from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)

Support Vector Machines

A support vector machine finds the hyperplane that maximises the margin between classes, and kernel functions let it separate data that is not linearly separable.

from sklearn.svm import SVC clf = SVC(kernel="rbf").fit(X_train, y_train)

Summary

This unit surveyed the main classification algorithms from simple neighbour and probabilistic methods to tree ensembles and margin-based support vector machines, with attention to overfitting.

Exercises

  • Train a KNN classifier and study the effect of changing k.
  • Explain how a decision tree chooses a split.
  • Describe how a random forest reduces overfitting.
  • State the role of the kernel in a support vector machine.