Member-only story
9 Classification Methods From Spark MLlib We Should Know
Spark MLlib is a distributed machine learning framework comprising a set of popular machine learning libraries and utilities. As this use Spark Core for parallel computing, so really useful to apply the algorithms on big data sets.
In this blog, we’ll use 9 well known classifiers to classify the Banknote dataset (download from here and for details, refer here). Instead of going deep into the algorithms or mathematical details, we have limited our discussion on using the Spark MLlib classification methods only. We haven’t included further model optimization/hyper-parameter tuning which can be part of further detailed discussions.
The database contains the following details:
1. Variance of Wavelet Transformed image
2. Skewness of Wavelet Transformed image
3. Kurtosis of Wavelet Transformed image
4. Entropy of image
5. Class variable (0 or 1)
Let’s load the data and use it as a Spark table…
df = spark.table ('data_banknote_authentication')
print(f"""There are {df.count()} records in the dataset.""")
labelCol = "Class"
df.show(5)