10 Classification Methods From Scikit Learn We Should Know

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
df = spark.table ('pima_indians_diabetes')
print(f"""There are {df.count()} records in the dataset.""")
df.show(5)
import pandas as pd
dataset = df.toPandas()

Data Analysis

Mean & Standard Deviation

If we want to have a quick look about how the data are distributed around the mean, we can describe the dataset as follows:

dataset.describe().transpose()

Scatter Matrix

These are another important metrics.

  • Negative correlation = if feature 1 increases, feature 2 decreases.
sampled_data = df.drop ("Class").sample(False, 0.8).toPandas()axs = pd.plotting.scatter_matrix(sampled_data, figsize=(10, 10))
num_cols = len(sampled_data.columns)
for cur_col in range(num_cols):
ax = axs[cur_col, 0]
ax.yaxis.label.set_rotation(0)
ax.yaxis.label.set_ha('right')
ax.set_yticks(())
h = axs[num_cols-1, cur_col]
h.xaxis.label.set_rotation(90)
h.set_xticks(())
The scatter matrix shows, the input features are not correlated.

Data Preprocessing

Train & Test Datasets

In the dataset, the last attribute is the dependent variable/label and rest are independent attributes or features. Let’s extract the features & labels:

X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split( \
X, y, \
test_size = 0.25, \
random_state = 0)

Standard Scaling

Some algorithms need scaling the features into a same scale while some others (e.g. tree based algorithms) are invariant to it. This process is called Feature Scaling. In this blog we’ll use StandardScaler.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

1. Logistic Regression

Logistic Regression is a classification algorithm created based on the logistic function — Sigmoid activation function to convert the outcome into categorical value. This function produces a S-shaped curve which takes any number as input and produces an output in-between 0 and 1 (in case of Binary Logistic Regression).

Example: Graph of a logistic regression curve showing probability of passing an exam versus hours studying (reference: Wikipedia).
Reference: https://www.saedsayad.com/logistic_regression.htm
  1. Multinomial: Dependent variable can have three or more possible types.
  2. Ordinal: Dependent variables that are ordered.

Implementation

Training the Logistic Regression model (reference: library) on the training set:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm_lr = confusion_matrix(y_test, y_pred_lr)
print(cm_lr)
acc_lr = accuracy_score(y_test, y_pred_lr)
print (acc_lr)

Confusion Matrix

Confusion Matrix is used to evaluate the accuracy of a classification model. The method we used is: sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None), where y_true is the truth target values and y_pred is the predicted values returned by a classifier.

Reference: Wikipedia

Accuracy Score

2. K-Nearest Neighbours (K-NN)

This model calculates the class membership of the dependent variable by calculating distance with its k nearest neighbors. KNN model is very compute intensive for larger datasets.

Image source: Wikipedia

Implementation

Training the K-NN model (reference: library) on the training set:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, \
metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm_knn = confusion_matrix(y_test, y_pred_knn)
print (cm_knn)
acc_knn = accuracy_score(y_test, y_pred_knn)
print (acc_knn)

3. SVC (Support Vector Classifier) with Linear Kernel

This model returns a best-fit hyperplane that divides, or categorizes, the input data. This assumes that the data are linearly separable.

Image source: Scikit Learn

Implementation

Training the SVC model (reference: library) on the training set:

from sklearn.svm import SVC
svc = SVC(kernel = 'linear', random_state = 0)
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm_svc = confusion_matrix(y_test, y_pred_svc)
print (cm_svc)
acc_svc = accuracy_score(y_test, y_pred_svc)
print (acc_svc)

4. Kernel SVM (Support Vector Machine)

In SVC the kernel function can be any of the following:

Source: scikit-learn
Image source: Scikit Learn; using linear, polynomial & RBF kernels respectively.

Implementation

from sklearn.svm import SVC
svc_rbf = SVC(kernel = 'rbf', random_state = 0)
svc_rbf.fit(X_train, y_train)
y_pred_svc_rbf = svc_rbf.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_svc_rbf)
print (cm)
acc_svc_rbf = accuracy_score(y_test, y_pred_svc_rbf)
print (acc_svc_rbf)

5. Naïve Bayes

Naïve Bayes classifiers are a collection of algorithms based on Bayes’ Theorem.

Bayes’ Theorem — mathematical formula, source: Wikipedia
Refer the Probabilistic model for further reading.

Implementation

Training the NB model (reference: library) on the training set:

from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
acc_nb = accuracy_score(y_test, y_pred)
print(acc_nb)

6. Decision Tree

A decision tree is a series of if-then-else rules learned from your data for classification or regression tasks. This method commonly used in data mining.

Image source: Learning Spark, 2nd Edition, figure 10–9. Decision tree example.

Implementation

Training the Decision Tree model (reference: library) on the training set:

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
dt.fit(X_train, y_train)
  • entropy
Reference: source
y_pred = dt.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print (cm)
acc_dt = accuracy_score(y_test, y_pred)
print (acc_dt)

7. Random Forest

Random Forest is an ensemble of Decision Trees by taking various sub-samples of the dataset and uses averaging or majority voting to improve the predictive accuracy. It’s based on the concept that, averaging/voting predictions from different models will be more robust than a prediction of any individual model.

Image source: Wikipedia

Implementation

Training the Random Forest model on the training set:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=2, random_state=0)
rf.fit(X_train, y_train)
  • criterion: the function to measure the quality of a split, either “gini” or “entropy”, default=“gini”.
  • max_depth: the maximum depth of the tree.
  • max_features: the number of features to consider when looking for the best split — “auto” (sqrt(n_features)) / “sqrt” (sqrt(n_features)) / “log2” (log2(n_features)), default=“auto
y_pred_rf = rf.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_rf)
print (cm)
acc_rf = accuracy_score(y_test, y_pred_rf)
print (acc_rf)

8. AdaBoost Classifier

AdaBoost (Adaptive Boosting) can be used in conjunction with many other types of learning algorithms to improve performance.

Implementation

Training the AdaBoost model (reference library) on the training set:

from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier()
abc.fit(X_train, y_train)
  • SAMME.R (default; this typically converges faster than SAMME)
y_pred_abc = abc.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_abc)
print (cm)
acc_abc = accuracy_score(y_test, y_pred_abc)
print (acc_abc)

9. Quadratic Discriminant Analysis

Implementation

Training the quadratic discriminant analysis model (reference library) on the training set:

from sklearn.discriminant_analysis \
import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
y_pred_qda = qda.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_qda)
print (cm)
acc_qda = accuracy_score(y_test, y_pred_qda)
print (acc_qda)

10. MLP Classifier

Multi-Layer Perceptron (MLP) is a class of artificial neural network (ANN). It has at-least three layers of nodes: an input layer, a hidden layer and an output layer.

Image source: Scikit Learn

Implementation

Training the MLP classifier model (reference library) on the training set:

from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(alpha=1, max_iter=1000)
mlp.fit(X_train, y_train)
y_pred_mlp = mlp.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_mlp)
print (cm)
acc_mlp = accuracy_score(y_test, y_pred_mlp)
print (acc_mlp)

Conclusion

Scikit Learn is a single-node framework and contains various effective tools for predictive data analysis. This works well for limited data. In case we need to work with massive amount of data, we can try Apache Spark MLlib.

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store