Dimensionality Reduction and Principal Component Analysis
Dimensionality reduction aims reduce the number of features of a high dimensional dataset in-order to overcome the difficulties that arise due to the curse of dimensionality.
There are two approaches: feature selection and feature extraction. Feature selection focuses on finding a subset of the original attributes. Whereas feature extraction transforms the original high-dimensional space into a lower dimensional one. Ideally, some structure in the data should remain in order to preserve enough information. Algorithms can be unsupervised (principal component analysis or PCA, independent component analysis or ICA) or supervised (linear discriminant analysis or LDA). In feature extraction, transformations can be linear (PCA, LDA) or non-linear (t-SNE, autoencoders).
There are plenty of applications such as hidden patterns visualization (by removing highly correlated attributes), noise reduction (removing irrelevant features), further exploration, data compression and storage, etc. In fact, dimensionality reduction is usually applied as a preprocessing step for other machine learning and data ...
Dropout is a widely used regularization technique for neural networks. Neural networks, especially deep neural networks, are flexible machine learning algorithms and hence prone to overfitting. In this tutorial, we'll explain what is dropout and how it works, including a sample TensorFlow implementation.
If you [have] a deep neural net and it's not overfitting, you should probably be using a bigger one and using dropout, ... - Geoffrey Hinton 
Dropout is a regularization technique where during each iteration of gradient descent, we drop a set of neurons selected at random. By drop, what we mean is that we essentially act as if they do not exist.
Logistic Regression is a variant of linear regression where dependent or output variable is categorical. Don't be confused by the name logistic regression, its a classification algorithm. In particular, we can use it for binary classification. For example, we might want to predict whether or not a person has diabetes, or whether or not an email is spam.
The logistic (or sigmoid) function
The term logistic in logistic regression comes from the logistic function (also known as sigmoid function), which can be written as:
Overfitting is one of the most important problems (and concepts) in machine learning.
Generalization, Overfitting and Under-fitting
It's not a good idea to test a machine learning model on a dataset which we used to train it, since it won't give any indication of how well our model performs on unseen data. The ability to perform well on unseen data is called generalization, and is the desirable characteristic we want in a model.
When a model performs well on training data (the data on which the algorithm was trained) but does not perform well on test data (new or unseen data), we say that it has overfit the training data or that the model is overfitting. This happens because the model learns the noise present in the training data as if it was a reliable pattern.
Conversely, when a model does not perform well on tr...
Types of Machine Learning problems: Supervised, Unsupervised and Reinforcement Learning
Currently, most of the machine learning products use supervised learning. In this, we have a set of features or inputs X (for example, an image) and our model will predict a target or output variable y (for example, caption for the image).
y = f(X)
In other words, our model learns a function that maps inputs to desired outputs. Features are independent variables and targets are the dependent variable.
Classification and Regression
Supervised learning problems can be further grouped ...
This tutorial describes the important components of a learning algorithm: representation (what the model looks like), evaluation (how do we differentiate good models from bad ones), and optimization (what is our process for finding the good models among all the possible models).
Linear Regression is a simple machine learning model for regression problems, i.e., when the target variable is a real value.
Let's start with an example — suppose we have a dataset with information about the area of a house (in square feet) and its price (in thousands of dollars) and our task is to build a machine learning model which can predict the price given the area. Here is what our dataset looks like
Gradient Descent is one of the most popular and widely used optimization algorithm. Given a machine learning model with parameters (weights and biases) and a cost function to evaluate how good a particular model is, our learning problem reduces to that of finding a good set of weights for our model which minimizes the cost function.
There is no explicit training phase in KNN! In other words, for classifying new data points, we'll directly use our dataset (in some sense, the dataset is the model).
To classify a new data point, we find the k points in the training data closest to it, and make a prediction based on whichever class is most common among these k points (i.e. we simulate a vote). Here closest is defined by a suitable distance metric such as euclidean distance. Other distance metrics are discussed below.
For example, if we want to classify blue point as shown in following figure, we consider k nearest data points and we assign the class which has the majority.
If k = 3, we get two data points with green class and one data point with red class. Hence, we'll predict green class for the new point.
Here's another example, let us change the position of new point (blue point) as shown below.
If we take k = 5 then we get four neighbors with red class and one neighbor with green class. Hence, new point will be classified as red point.
KNN as regression algorithm
In case of regression (when target variable is a real value), we take the average of the K nearest neighbors.
Tuning the hyper-parameter K
A small value of k means that noise will have a higher influence on the result and large value make the algorithm computationally expensive. Usually, we perform cross-validation to find out best k value (or to choose the value of k that best suits our accuracy / speed trade-off). If you don't want to try multiple values of k, a rule of thumb is to set k equal to the square root of total number of data points. For more on choosing best value of k, refer this stackoverflow thread.
There are various options available for distance metric such as euclidian or manhattan distance. The most commonly used metric is euclidian distance.
Minkowski is the generalization of Euclidian and Manhattan distance.
Note that you'll want to do some pre-processing on the input data (for example, make sure each dimension has 0 mean and unit variance) so that the distance metrics above are meaningful.
## load the dataset
from sklearn.datasets import load_iris
dataset = load_iris()
X = dataset.data
y = dataset.target
# standardize the data to make sure each feature contributes equally
# to the distance
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_processed = ss.fit_transform(X)
## split the dataset into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.3, random_state=42)
## fit n nearest neighbor model
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors = 5, metric="minkowski", p=2)
In a parametric model, we continuously update a fixed number of parameters to learn a function which can classify new data point without requiring the training data (for example, logistic regression). In a non-parametric model, the number of parameters grows with the size of training data. This is what happens in KNN.
Practical Machine Learning With Python [Part 3b]: K-means Clustering
K-means is one of the simplest unsupervised learning algorithm used for clustering problem. Clustering is a process that finds groups of similar objects. So, in clustering our goal is to group objects based on their features similarity. K-means clustering is very easy to understand, very easy to implement and computationally efficient clustering algorithm. Now, let us see how it works.
Basic idea behind K-means is, we define k centroids, that is, one for each cluster. Here, k is the hyperparameter and we should be very careful about it. Usually, you should try range of values to determine best value of k. Where do we place them initially? Common choice is to place them as fas ar possible. Now, assign each data point to the nearest centroid. Once each data point has been assigned to one of the centroids, our next step is to recalculate k new centroids. How do we do that? We do it by moving centroid(old) to the center of the data samples that were assigned to it. And how do we do find center? We find it by taking the mean of data points in a particular cluster.
K-means clutering aims to find positions μi, i=1,2,....