9 Classification Methods From Spark MLlib We Should Know

How about a classifier to classify humans characters into Good, Bad or Ugly! Image source.

Spark MLlib is a distributed machine learning framework comprising a set of popular machine learning libraries and utilities. As this use Spark Core for parallel computing, so really useful to apply the algorithms on big data sets.

In this blog, we’ll use 9 well known classifiers to classify the Banknote dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Spark MLlib classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.

The database contains the following details:

Let’s load the data and use it as a Spark table…

Data Analysis

If we want to have a quick look about how the data are distributed around the mean, we can describe the dataset:

If we want to see any opportunity of reducing dimensions, we can plot the scatter matrix.

Now, let’s see a quick definition of 3 main components of MLlib:

Estimator, Transformer & Pipeline

Estimator: An Estimator is an algorithm that fits or trains on data. This implements a fit() method, which accepts a Spark DataFrame and produces a Model. E.g. pyspark.ml.classification.LogisticRegression is an estimator.

Transformer: A Transformer is an abstraction that includes feature transformers and learned models. This implements transform() method which transforms one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. This not mandatory to use but this really helps to connect the dots i.e. stages in appropriate order.

Data Preprocessing

In the dataset, the last attribute is the dependent variable/label and rest are independent attributes or features. Let’s extract the features & labels:

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models.

As we’re using Pipeline, we don’t need to individually transform the stages e.g. VectorAssembler or StandardScaler. But, just to show how would the features appear after transforming, let’s execute the following:

Some algorithms need scaling the features into a same scale while some others (e.g. tree based algorithms) are invariant to it. This process is called Feature Scaling. In this blog we’ll use StandardScaler.

The following code block is optional.

1. Logistic Regression

Logistic Regression is a classification algorithm created based on the logistic function — Sigmoid activation function to convert the outcome into categorical value. This function produces a S-shaped curve which takes any number as input and produces an output in-between 0 and 1 (in case of Binary Logistic Regression).

Example: Graph of a logistic regression curve showing probability of passing an exam versus hours studying (reference: Wikipedia).
Reference: https://www.saedsayad.com/logistic_regression.htm

Logistic regression can be of three types:

  1. Binomial / Binary: Dependent variable can have only two possible types, “0” and “1”.
  2. Multinomial: Dependent variable can have three or more possible types.
  3. Ordinal: Dependent variables that are ordered.

Here, we’ll use only the Binomial one to predict the output.

Training the Logistic Regression model (reference: library) on the training set:

The parameters used (reference: library):

  • maxIter: max number of iterations (>= 0).
  • regParam: regularization parameter.
  • elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.
  • family: the name of family which is a description of the label distribution to be used in the model; options: auto, binomial, multinomial.

Predicting the test set results:

Measuring accuracy: we’ll use MulticlassClassificationEvaluator to evaluate the models. This can calculate different metrics such as, f1, accuracy, precisionByLabel etc. Here, we’ll only measure the accuracy.

Once the evaluator is created, we can use it to measure performance of difference models.

2. Linear Support Vector Machine

This model returns a best-fit hyperplane that divides, or categorizes, the input data. This assumes that the data are linearly separable.


Training the SVC model (reference: library) on the training set:

Predicting the test set results:

Measuring accuracy:

3. Naïve Bayes

Naïve Bayes classifiers are a collection of algorithms based on Bayes’ Theorem.

Bayes’ Theorem — mathematical formula, source: Wikipedia

All naïve Bayes classifiers assume that the value of a particular feature is independent of the value of the other features, given the class variable. Given, class variable y and dependent feature vector x1 through xn,

Refer the Probabilistic model for further reading.

Few classifiers (in this blog we’ll use Gaussian NB):

  • Multinomial (default),
  • Bernoulli and
  • Gaussian

Training the NB model (reference: library) on the training set:

Predicting the test set results:

Measuring accuracy:

4. Decision Tree

A decision tree is a series of if-then-else rules learned from your data for classification or regression tasks. This method is commonly used in data mining.

Image source: Learning Spark, 2nd Edition, figure 10–9. Decision tree example.

Training the Decision Tree model (reference: library) on the training set:

The impurity parameter selects the function to calculate information gain and the information gain is used to split a node:

  • gini
  • entropy
Reference: source

Predicting the test set results:

Measuring accuracy:

5. Random Forest

Random Forest is an ensemble of Decision Trees by taking various sub-samples of the dataset and uses averaging or majority voting to improve the predictive accuracy. It’s based on the concept that, averaging/voting predictions from different models will be more robust than a prediction of any individual model.

Image source: Wikipedia

Training the Random Forest model (reference: library) on the training set:

Few of the parameters (reference library):

  • featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: ‘auto’ (choose automatically for task: If numTrees == 1, set to ‘all’. If numTrees > 1 (forest), set to ‘sqrt’ for classification and to ‘onethird’ for regression), ‘all’ (use all features), ‘onethird’ (use 1/3 of the features), ‘sqrt’ (use sqrt(number of features)), ‘log2’ (use log2(number of features)), ’n’ (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = ‘auto’.
  • impurity: Criterion used for information gain calculation (case-insensitive), options: ‘entropy’, ‘gini’.
  • maxDepth: Maximum depth of the tree.
  • numTrees: Number of trees to train.

Measuring accuracy:

6. Gradient-Boosted Tree

Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. These iteratively train decision trees in order to minimize a loss function.

Training the GBT classifier model (reference library) on the training set:

Predicting the test set results:

Measuring accuracy:

7. MLP Classifier

Multi-Layer Perceptron (MLP) is a class of artificial neural network (ANN). It has at-least three layers of nodes: an input layer, a hidden layer and an output layer.

Image source: Scikit Learn

Training the MLP classifier model (reference library) on the training set:

Predicting the test set results:

Measuring accuracy:

8. One-vs-Rest

OneVsRest is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently. It is also known as “One-vs-All.” — reference.

Training the OvR classifier model (reference library) on the training set:

Predicting the test set results:

Measuring accuracy:

9. Factorization Machines

Factorization Machines are able to estimate interactions between features even in problems with huge sparsity… The spark.ml implementation supports factorization machines for binary classification and for regression. reference.

Training the FM classifier model (reference library) on the training set:

One of the parameters we’re passing here (reference library):

  • stepSize: Step size to be used for each iteration of optimization.

Predicting the test set results:

Measuring accuracy:

Further reference:


In case we need to work with massive amount of data, Spark MLlib is a good choice. Otherwise, for limited data we can try the single-node framework - Scikit Learn.

For using classifiers from Scikit Learn check this out — 10 Classification Methods From Scikit Learn We Should Know.

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium & LinkedIn.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store