Unit 4: Unsupervised Learning

Clustering and dimensionality reduction

Learning Outcomes
  • Cluster data with k-means and hierarchical methods
  • Choose an appropriate number of clusters
  • Reduce dimensions using PCA
  • Interpret unsupervised results

K-Means Clustering

K-means partitions data into k clusters by iteratively assigning points to the nearest centroid and updating centroids to minimise within-cluster variance. The elbow method helps select a good value of k.

from sklearn.cluster import KMeans km = KMeans(n_clusters=3, random_state=42).fit(X) print(km.labels_)

Hierarchical Clustering

Hierarchical clustering builds a tree called a dendrogram by repeatedly merging the closest clusters, allowing the number of clusters to be chosen afterwards by cutting the tree at a chosen level.

Principal Component Analysis

PCA projects data onto the orthogonal directions of greatest variance, reducing the number of dimensions while retaining most of the information, which aids visualisation and speeds up learning.

from sklearn.decomposition import PCA X2 = PCA(n_components=2).fit_transform(X)

Summary

This unit covered clustering with k-means and hierarchical methods and dimensionality reduction with PCA, the main tools for finding structure in unlabelled data.

Exercises

  • Apply k-means and use the elbow method to choose k.
  • Explain the difference between k-means and hierarchical clustering.
  • Describe what the principal components represent.
  • Reduce a dataset to two dimensions with PCA and plot it.