Unit 4: Unsupervised Learning

Learning Outcomes

Cluster data with k-means and hierarchical methods
Choose an appropriate number of clusters
Reduce dimensions using PCA
Interpret unsupervised results

K-Means Clustering

K-means partitions data into k clusters by iteratively assigning points to the nearest centroid and updating centroids to minimise within-cluster variance. The elbow method helps select a good value of k.

from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, random_state=42).fit(X)
print(km.labels_)

Hierarchical Clustering

Hierarchical clustering builds a tree called a dendrogram by repeatedly merging the closest clusters, allowing the number of clusters to be chosen afterwards by cutting the tree at a chosen level.

Principal Component Analysis

PCA projects data onto the orthogonal directions of greatest variance, reducing the number of dimensions while retaining most of the information, which aids visualisation and speeds up learning.

from sklearn.decomposition import PCA
X2 = PCA(n_components=2).fit_transform(X)

Summary

This unit covered clustering with k-means and hierarchical methods and dimensionality reduction with PCA, the main tools for finding structure in unlabelled data.

Exercises

Apply k-means and use the elbow method to choose k.
Explain the difference between k-means and hierarchical clustering.
Describe what the principal components represent.
Reduce a dataset to two dimensions with PCA and plot it.

← Previous Course Home Next →