15 Minutes to sklearn and Machine Learning - Classification Algorithm Articles

15 Minutes to sklearn and Machine Learning - Classification Algorithm Articles

Author: Ho Ching

Introduction

Welcome to this article on sklearn and machine learning classification algorithms. In this article, we will take a 15-minute journey through some of the most common classification algorithms, including logistic regression, naive Bayes, KNN, SVM, and decision tree. We will also explore the interfaces of these algorithms in sklearn and provide code snippets to demonstrate their usage.

Logistic Regression

Logistic regression is a classification algorithm that is commonly used in machine learning. Although its name contains “return,” it is actually a classification algorithm rather than a linear regression model. Logistic regression is also known as logit regression, maximum entropy classifier, or log-linear classifier.

sklearn Interface for Logistic Regression

The sklearn interface for logistic regression is as follows:

class sklearn.linear_model.LogisticRegression(
    penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True,
    intercept_scaling=1, class_weight=None, random_state=None, solver='warn',
    max_iter=100, multi_class='warn', verbose=0, warm_start=False, n_jobs=None):

Common Parameters for Logistic Regression

  • penalty: The penalty term. It is generally “l1” or “l2.”
  • dual: This parameter is only applicable to the use of liblinear solver with “l2” penalty term. Usually, when the number of samples is greater than the number of features, this parameter is set to False.
  • C: Regularization strength (smaller values indicate more regularization), must be a positive floating point number.
  • solver: Parametric solver. Usually there {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}.
  • multi_class: Multi-classification problem is converted, if you use “ovr”, converting multi-classification problems sucked into multiple dichotomous view of the title; if using a “multinomial”, polynomial fitting loss function will be loss of the entire probability distribution.

Case Study for Logistic Regression

Here is an example of using logistic regression to classify the iris dataset:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)

# Predict the test set predictions
clf.predict(X[:2, :])

# Predict the probability of the test set predictions
clf.predict_proba(X[:2, :])

# Determine the quality of our models
clf.score(X, y)

Naive Bayes

Naive Bayes is a set of supervised learning algorithms based on Bayes’ theorem, in the case where a given class variable value, simple conditional independence assumptions exist between each pair of features. Here are the interfaces of the Naive Bayes algorithms in sklearn:

  • Gaussian Naive Bayes (GaussianNB): The Gaussian Naive Bayes principle can be found in the article http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf.
  • Polynomial Naive Bayes (MultinomialNB / MNB): The MultinomialNB algorithm is used to learn from a set of data.
  • Complementary Naive Bayes (ComplementNB / CMB): The ComplementNB model is an improved standard Naive Bayes (the MNB) algorithm, especially for unbalanced data sets.
  • Bernoulli Naive Bayes (BernoulliNB): The BernoulliNB training and achieving a naive Bayes classification algorithm based on data multivariate Bernoulli distribution.

Case Study for Naive Bayes

Here are some examples of using the Naive Bayes algorithms to classify the iris dataset:

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB

iris = load_iris()
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points out of a total %d points: %d" % (iris.data.shape[0], (iris.target != y_pred).sum()))

# Using MultinomialNB
import numpy as np
X = np.random.randint(50, size=(1000, 100))
y = np.random.randint(6, size=(1000))
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)
print(clf.predict(X[2:3]))

# Using ComplementNB
import numpy as np
X = np.random.randint(50, size=(1000, 100))
y = np.random.randint(6, size=(1000))
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB()
clf.fit(X, y)
print(clf.predict(X[2:3]))

# Using BernoulliNB
import numpy as np
X = np.random.randint(50, size=(1000, 100))
y = np.random.randint(6, size=(1000))
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, y)
print(clf.predict(X[2:3]))

K-Nearest Neighbors (KNN)

KNN is implemented based on the nearest neighbor learning each query point, where k is an integer value specified by the user. It is one of the most classic machine learning algorithms. The sklearn interface of KNN is as follows:

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs):

Common Parameters for KNN

  • n_neighbors: The number of neighbors, KNN is the most important parameter.
  • algorithm: Calculation algorithm, with a common nearest neighbor algorithm {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}.

Case Study for KNN

Here is an example of using KNN to classify the iris dataset:

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

iris = datasets.load_iris()
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(iris.data, iris.target)
print(neigh.predict((iris.data)))
print(neigh.predict_proba((iris.data)))

Support Vector Machine (SVM)

Support vector machines (SVMs) is a set of classification, supervised learning regression and outlier detection. Here I will introduce only classification. SVM advantages are: effective in high dimensional space; is still valid in the case of dimension greater than the number of samples, so for small data sets, the SVM can exhibit good performance.

sklearn Interface for SVM

The most commonly used is generally SVC interface. The sklearn interface of SVM is as follows:

class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None):

Common Parameters for SVM

  • C: Penalty parameter C entry error.
  • kernel: Kernel function. Commonly used kernel functions: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’.
  • probability: Whether to use probability estimates predicting.

Case Study for SVM

Here is an example of using SVM to classify the iris dataset:

import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(C=1, kernel='rbf', gamma='auto')
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))

Expansion: SVM Binary Classification Problem Solving

SVM binary classification problem solving has a unique advantage, however, is very difficult to solve the multi-classification problems. Common solution is “one to one” way to solve the multi-classification problem.

Decision Tree

Decision tree is one of the top ten classic decision tree algorithm, able to handle multi-classification problems. The sklearn interface of decision tree is as follows:

class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False):

Common Parameters for Decision Tree

  • criterion: This function is used to measure the basis for segmentation. Common are “gini” used to calculate the Gini coefficient and “entropy” is used to calculate the information gain.
  • max_depth: The maximum depth of the tree.
  • min_samples_split: Dividing the minimum number of samples required to the internal node.
  • min_samples_leaf: The minimum number of samples required for the leaf nodes.

Case Study for Decision Tree

Here is an example of using decision tree to classify the iris dataset:

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
clf.fit(iris.data, iris.target)
clf.predict(iris.data)
clf.predict_proba(iris.data)