Member-only story

9 Classification Methods From Spark MLlib We Should Know

Prosenjit Chakraborty
10 min readJan 12, 2021

--

How about a classifier to classify humans characters into Good, Bad or Ugly! Image source.

Spark MLlib is a distributed machine learning framework comprising a set of popular machine learning libraries and utilities. As this use Spark Core for parallel computing, so really useful to apply the algorithms on big data sets.

In this blog, we’ll use 9 well known classifiers to classify the Banknote dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Spark MLlib classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.

The database contains the following details:

1. Variance of Wavelet Transformed image
2. Skewness of Wavelet Transformed image
3. Kurtosis of Wavelet Transformed image
4. Entropy of image
5. Class variable (0 or 1)

Let’s load the data and use it as a Spark table…

df = spark.table ('data_banknote_authentication')
print(f"""There are {df.count()} records in the dataset.""")
labelCol = "Class"
df.show(5)

Data Analysis

--

--

Prosenjit Chakraborty
Prosenjit Chakraborty

Written by Prosenjit Chakraborty

Tech enthusiast, Principal Architect — Data & AI.

No responses yet