15 Minutes to sklearn and Machine Learning - Classification Algorithm Articles
Author: Ho Ching
Introduction
Welcome to this article on sklearn and machine learning classification algorithms. In this article, we will take a 15-minute journey through some of the most common classification algorithms, including logistic regression, naive Bayes, KNN, SVM, and decision tree. We will also explore the interfaces of these algorithms in sklearn and provide code snippets to demonstrate their usage.
Logistic Regression
Logistic regression is a classification algorithm that is commonly used in machine learning. Although its name contains “return,” it is actually a classification algorithm rather than a linear regression model. Logistic regression is also known as logit regression, maximum entropy classifier, or log-linear classifier.
sklearn Interface for Logistic Regression
The sklearn interface for logistic regression is as follows:
class sklearn.linear_model.LogisticRegression(
penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True,
intercept_scaling=1, class_weight=None, random_state=None, solver='warn',
max_iter=100, multi_class='warn', verbose=0, warm_start=False, n_jobs=None):
Common Parameters for Logistic Regression
penalty: The penalty term. It is generally “l1” or “l2.”dual: This parameter is only applicable to the use of liblinear solver with “l2” penalty term. Usually, when the number of samples is greater than the number of features, this parameter is set to False.C: Regularization strength (smaller values indicate more regularization), must be a positive floating point number.solver: Parametric solver. Usually there {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}.multi_class: Multi-classification problem is converted, if you use “ovr”, converting multi-classification problems sucked into multiple dichotomous view of the title; if using a “multinomial”, polynomial fitting loss function will be loss of the entire probability distribution.
Case Study for Logistic Regression
Here is an example of using logistic regression to classify the iris dataset:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)
# Predict the test set predictions
clf.predict(X[:2, :])
# Predict the probability of the test set predictions
clf.predict_proba(X[:2, :])
# Determine the quality of our models
clf.score(X, y)
Naive Bayes
Naive Bayes is a set of supervised learning algorithms based on Bayes’ theorem, in the case where a given class variable value, simple conditional independence assumptions exist between each pair of features. Here are the interfaces of the Naive Bayes algorithms in sklearn:
- Gaussian Naive Bayes (GaussianNB): The Gaussian Naive Bayes principle can be found in the article http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf.
- Polynomial Naive Bayes (MultinomialNB / MNB): The MultinomialNB algorithm is used to learn from a set of data.
- Complementary Naive Bayes (ComplementNB / CMB): The ComplementNB model is an improved standard Naive Bayes (the MNB) algorithm, especially for unbalanced data sets.
- Bernoulli Naive Bayes (BernoulliNB): The BernoulliNB training and achieving a naive Bayes classification algorithm based on data multivariate Bernoulli distribution.
Case Study for Naive Bayes
Here are some examples of using the Naive Bayes algorithms to classify the iris dataset:
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
iris = load_iris()
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points out of a total %d points: %d" % (iris.data.shape[0], (iris.target != y_pred).sum()))
# Using MultinomialNB
import numpy as np
X = np.random.randint(50, size=(1000, 100))
y = np.random.randint(6, size=(1000))
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)
print(clf.predict(X[2:3]))
# Using ComplementNB
import numpy as np
X = np.random.randint(50, size=(1000, 100))
y = np.random.randint(6, size=(1000))
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB()
clf.fit(X, y)
print(clf.predict(X[2:3]))
# Using BernoulliNB
import numpy as np
X = np.random.randint(50, size=(1000, 100))
y = np.random.randint(6, size=(1000))
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, y)
print(clf.predict(X[2:3]))
K-Nearest Neighbors (KNN)
KNN is implemented based on the nearest neighbor learning each query point, where k is an integer value specified by the user. It is one of the most classic machine learning algorithms. The sklearn interface of KNN is as follows:
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs):
Common Parameters for KNN
n_neighbors: The number of neighbors, KNN is the most important parameter.algorithm: Calculation algorithm, with a common nearest neighbor algorithm {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}.
Case Study for KNN
Here is an example of using KNN to classify the iris dataset:
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(iris.data, iris.target)
print(neigh.predict((iris.data)))
print(neigh.predict_proba((iris.data)))
Support Vector Machine (SVM)
Support vector machines (SVMs) is a set of classification, supervised learning regression and outlier detection. Here I will introduce only classification. SVM advantages are: effective in high dimensional space; is still valid in the case of dimension greater than the number of samples, so for small data sets, the SVM can exhibit good performance.
sklearn Interface for SVM
The most commonly used is generally SVC interface. The sklearn interface of SVM is as follows:
class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None):
Common Parameters for SVM
C: Penalty parameter C entry error.kernel: Kernel function. Commonly used kernel functions: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’.probability: Whether to use probability estimates predicting.
Case Study for SVM
Here is an example of using SVM to classify the iris dataset:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(C=1, kernel='rbf', gamma='auto')
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
Expansion: SVM Binary Classification Problem Solving
SVM binary classification problem solving has a unique advantage, however, is very difficult to solve the multi-classification problems. Common solution is “one to one” way to solve the multi-classification problem.
Decision Tree
Decision tree is one of the top ten classic decision tree algorithm, able to handle multi-classification problems. The sklearn interface of decision tree is as follows:
class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False):
Common Parameters for Decision Tree
criterion: This function is used to measure the basis for segmentation. Common are “gini” used to calculate the Gini coefficient and “entropy” is used to calculate the information gain.max_depth: The maximum depth of the tree.min_samples_split: Dividing the minimum number of samples required to the internal node.min_samples_leaf: The minimum number of samples required for the leaf nodes.
Case Study for Decision Tree
Here is an example of using decision tree to classify the iris dataset:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
clf.fit(iris.data, iris.target)
clf.predict(iris.data)
clf.predict_proba(iris.data)