Member-only story

8 Clustering Methods From Scikit Learn We Should Know

Prosenjit Chakraborty
9 min readMar 18, 2021

--

Courtesy of www.VincentVanGogh.org

Scikit Learn is an open source, Python based very popular machine learning library. It supports various supervised (regression and classification) and unsupervised learning models.

In this blog, we’ll use 8 clustering methods or unsupervised machine learning models on the Iris Plants database (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Scikit Learn clustering methods only.

The database contains the following details:

1. Sepal Length in cm
2. Sepal Width in cm
3. Petal Length in cm
4. Petal Width in cm
5. Class: Iris Setosa or Iris Versicolour or Iris Virginica

Let’s load the data and use it as a Spark table…

df = spark.table ('iris_data_set')
print(f"""There are {df.count()} records in the dataset.""")
labelCol = "Class"
df.show(5)

…and convert the Spark DataFrame into a Panda DataFrame:

import pandas as pd
dataset = df.toPandas()

Data Analysis

Mean & Standard Deviation

If we want to have a quick look about how the data are distributed around the mean, we can describe the dataset:

Scatter Matrix

If we see the correlation among the features we can see Petal_Length and Petal_Width have high positive correlation in between themselves (highlighted below).

Here, we’ll take these two features only to cluster the input dataset using the following methods.

1. K-means Clustering

The K-Means clustering partitions n observations into k clusters such that each observation belongs to the cluster with the nearest mean which is called the cluster centroid or center of the cluster. This algorithm…

--

--

Prosenjit Chakraborty
Prosenjit Chakraborty

Written by Prosenjit Chakraborty

Tech enthusiast, Principal Architect — Data & AI.

No responses yet

Write a response