10 Classification Methods From Scikit Learn We Should Know
Scikit Learn is an open source, Python based very popular machine learning library. It supports various supervised (regression and classification) and unsupervised learning models.
In this blog, we’ll use 10 well known classifiers to classify the Pima Indians Diabetes dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Scikit Learn classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.
The database contains the following details:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
Let’s load the data and use it as a Spark table…
df = spark.table ('pima_indians_diabetes')
print(f"""There are {df.count()} records in the dataset.""")
df.show(5)

…and convert the Spark DataFrame into a Panda DataFrame:
import pandas as pd
dataset = df.toPandas()
Data Analysis
Mean & Standard Deviation
If we want to have a quick look about how the data are distributed around the mean, we can describe the dataset as follows:
dataset.describe().transpose()

Scatter Matrix
These are another important metrics.
Scatter matrix is a set of scatter plots (scatter plot is a mathematical diagram to show the correlation in-between two variable or features) among several feature pairs in a matrix format. This is used to determine if the pair of features are correlated and if the correlation (linear relationship) is positive or negative.
- Positive correlation = if feature 1 increases, feature 2 increases as well.
- Negative correlation = if feature 1 increases, feature 2 decreases.
Scatter matrix is used to verify how independent the input features are and if dimensionality reduction is possible.
sampled_data = df.drop ("Class").sample(False, 0.8).toPandas()axs = pd.plotting.scatter_matrix(sampled_data, figsize=(10, 10))
num_cols = len(sampled_data.columns)
for cur_col in range(num_cols):
ax = axs[cur_col, 0]
ax.yaxis.label.set_rotation(0)
ax.yaxis.label.set_ha('right')
ax.set_yticks(())
h = axs[num_cols-1, cur_col]
h.xaxis.label.set_rotation(90)
h.set_xticks(())

Data Preprocessing
Train & Test Datasets
In the dataset, the last attribute is the dependent variable/label and rest are independent attributes or features. Let’s extract the features & labels:
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values
We’ll now split the dataset into random train and test subsets using sklearn library.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split( \
X, y, \
test_size = 0.25, \
random_state = 0)
Standard Scaling
Some algorithms need scaling the features into a same scale while some others (e.g. tree based algorithms) are invariant to it. This process is called Feature Scaling. In this blog we’ll use StandardScaler.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Now, our dataset is ready to be used by different classifiers.
1. Logistic Regression
Logistic Regression is a classification algorithm created based on the logistic function — Sigmoid activation function to convert the outcome into categorical value. This function produces a S-shaped curve which takes any number as input and produces an output in-between 0 and 1 (in case of Binary Logistic Regression).


Logistic regression can be of three types:
- Binomial / Binary: Dependent variable can have only two possible types, “0” and “1”.
- Multinomial: Dependent variable can have three or more possible types.
- Ordinal: Dependent variables that are ordered.
Here, we’ll use only the Binary one to predict the output.
Implementation
Training the Logistic Regression model (reference: library) on the training set:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X_train, y_train)

Predicting the test set results:
y_pred_lr = lr.predict(X_test)
Creating the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm_lr = confusion_matrix(y_test, y_pred_lr)
print(cm_lr)acc_lr = accuracy_score(y_test, y_pred_lr)
print (acc_lr)

Confusion Matrix
Confusion Matrix is used to evaluate the accuracy of a classification model. The method we used is: sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None), where y_true is the truth target values and y_pred is the predicted values returned by a classifier.

Accuracy Score
if the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0 — refer: Scikit-Learn
2. K-Nearest Neighbours (K-NN)
This model calculates the class membership of the dependent variable by calculating distance with its k nearest neighbors. KNN model is very compute intensive for larger datasets.

Implementation
Training the K-NN model (reference: library) on the training set:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, \
metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)

Here, we have chosen minkowski as a fast distance metric function.
Predicting the test set results:
y_pred_knn = knn.predict(X_test)
Creating the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm_knn = confusion_matrix(y_test, y_pred_knn)
print (cm_knn)acc_knn = accuracy_score(y_test, y_pred_knn)
print (acc_knn)

3. SVC (Support Vector Classifier) with Linear Kernel
This model returns a best-fit hyperplane that divides, or categorizes, the input data. This assumes that the data are linearly separable.


Implementation
Training the SVC model (reference: library) on the training set:
from sklearn.svm import SVC
svc = SVC(kernel = 'linear', random_state = 0)
svc.fit(X_train, y_train)

Predicting the test set results:
y_pred_svc = svc.predict(X_test)
Creating the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm_svc = confusion_matrix(y_test, y_pred_svc)
print (cm_svc)acc_svc = accuracy_score(y_test, y_pred_svc)
print (acc_svc)

4. Kernel SVM (Support Vector Machine)
In SVC the kernel function can be any of the following:

The polynomial, RBF (Radial Basis Function) and sigmoid kernels are especially useful when the data-points are not linearly separable. In this section, we’ll use RBF kernel.

Implementation
from sklearn.svm import SVC
svc_rbf = SVC(kernel = 'rbf', random_state = 0)
svc_rbf.fit(X_train, y_train)

Predicting the test set results:
y_pred_svc_rbf = svc_rbf.predict(X_test)
Creating the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_svc_rbf)
print (cm)acc_svc_rbf = accuracy_score(y_test, y_pred_svc_rbf)
print (acc_svc_rbf)

5. Naïve Bayes
Naïve Bayes classifiers are a collection of algorithms based on Bayes’ Theorem.

All naïve Bayes classifiers assume that the value of a particular feature is independent of the value of the other features, given the class variable. Given, class variable y and dependent feature vector x1 through xn,

Few classifiers (in this blog we’ll use Gaussian NB):
Implementation
Training the NB model (reference: library) on the training set:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)

Predicting the test set results:
y_pred = nb.predict(X_test)
Creating the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)acc_nb = accuracy_score(y_test, y_pred)
print(acc_nb)

6. Decision Tree
A decision tree is a series of if-then-else rules learned from your data for classification or regression tasks. This method commonly used in data mining.

Implementation
Training the Decision Tree model (reference: library) on the training set:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
dt.fit(X_train, y_train)

The criterion parameter selects the function to calculate information gain and the information gain is used to split a node:
- gini
- entropy

Predicting the test set results:
y_pred = dt.predict(X_test)
Making the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print (cm)acc_dt = accuracy_score(y_test, y_pred)
print (acc_dt)

7. Random Forest
Random Forest is an ensemble of Decision Trees by taking various sub-samples of the dataset and uses averaging or majority voting to improve the predictive accuracy. It’s based on the concept that, averaging/voting predictions from different models will be more robust than a prediction of any individual model.

Implementation
Training the Random Forest model on the training set:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

Few of the parameters (reference library):
- n_estimators: the number of trees in the forest, default=100.
- criterion: the function to measure the quality of a split, either “gini” or “entropy”, default=“gini”.
- max_depth: the maximum depth of the tree.
- max_features: the number of features to consider when looking for the best split — “auto” (sqrt(n_features)) / “sqrt” (sqrt(n_features)) / “log2” (log2(n_features)), default=“auto”
Predicting the test set results:
y_pred_rf = rf.predict(X_test)
Creating the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_rf)
print (cm)acc_rf = accuracy_score(y_test, y_pred_rf)
print (acc_rf)

8. AdaBoost Classifier
AdaBoost (Adaptive Boosting) can be used in conjunction with many other types of learning algorithms to improve performance.
Implementation
Training the AdaBoost model (reference library) on the training set:
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier()
abc.fit(X_train, y_train)

The algorithm parameter takes either:
- SAMME or,
- SAMME.R (default; this typically converges faster than SAMME)
Predicting the test set results:
y_pred_abc = abc.predict(X_test)
Creating the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_abc)
print (cm)acc_abc = accuracy_score(y_test, y_pred_abc)
print (acc_abc)
9. Quadratic Discriminant Analysis
A quadratic classifier is statistical classifier that uses a quadratic decision surface to separate measurements of two or more classes of objects or events. It is a more general version of the linear classifier. — Wikipedia
This has no hyperparameters to tune.
Implementation
Training the quadratic discriminant analysis model (reference library) on the training set:
from sklearn.discriminant_analysis \
import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

Predicting the test set results:
y_pred_qda = qda.predict(X_test)
Making the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_qda)
print (cm)acc_qda = accuracy_score(y_test, y_pred_qda)
print (acc_qda)

10. MLP Classifier
Multi-Layer Perceptron (MLP) is a class of artificial neural network (ANN). It has at-least three layers of nodes: an input layer, a hidden layer and an output layer.

Implementation
Training the MLP classifier model (reference library) on the training set:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(alpha=1, max_iter=1000)
mlp.fit(X_train, y_train)

Predicting the test set results:
y_pred_mlp = mlp.predict(X_test)
Making the confusion matrix & calculating accuracy score:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_mlp)
print (cm)acc_mlp = accuracy_score(y_test, y_pred_mlp)
print (acc_mlp)

Conclusion
Scikit Learn is a single-node framework and contains various effective tools for predictive data analysis. This works well for limited data. In case we need to work with massive amount of data, we can try Apache Spark MLlib.