Commonlounge

Communities

Message

Follow

Marta Enesco

Data Scientist, Graduate Research Assistant at University of Potsdam

Active In

Machine Learning

1 discussion. Member

TED Talks

Member

Featured Contributions

Contributed 76%

tutorial

TutorialMachine LearningLast updated

Conditional Random Fields

Conditional Random Fields are **probabilistic graphical models for sequential or structured data**. They allow us to perform classification taking into account the context delivered by the sequence. We talk about a *structured prediction*, where segments are assumed to be related with each other.

By doing so, valuable contextual information, which would be lost in individual classifications, can be given to the model. For example, words in a sentence are grammatically connected: after an adjective it is more likely to find a noun than a verb. And this hint can be used to label the noun *books* in the sentence “The woman carefully carried the two red __books__”.

CRFs are mostly used in NLP tasks such as part-of-speech tagging or sequence labeling (for extracting specific words). It also has applications in computer vision: from image segmentation (lab...

Read more…(1505 words)

Contributed 48%

tutorial

TutorialMachine LearningLast updated

Feature Engineering: Techniques, examples and case studies

Let us say you have a machine learning problem that you'd like to solve. You try a few machine learning algorithms, and they give okay results, say in the range 55%-65% accuracy. You'd like to do better.

In practice, the most important thing that needs to be done is what is known as **feature engineering**. Feature engineering is the process of finding the optimal set of features (input) that should be given as input to the machine learning model.

Read more…(1449 words)

Contributed 27%

tutorial

TutorialMachine LearningLast updated

Theoretical guarantees are not what they seem

Machine learning has a number of popular theorems. However, one should keep in mind that theorems are stated in a mathematically strict sense. Extrapolating from these theorems to make decisions in practice is prone to errors, and one needs to be very careful. In this article, we will discuss some examples on how theoretical guarantees might not be what they seem, or how we can avoid reaching incorrect conclusions.

This article does not intend to say *theory is bad*. Its main point is, theory and practice often have a wide gap. When practitioners with less experience in theory come across widely applicable theorems, they often extrapolate it in ways which are undesirable.

**Example A**: The **universality theorem** asserts that __neural networks can compute any computable function__. Moreover, even a NN ...

Read more…(1022 words)

Contributed 95%

tutorial

TutorialMachine Le…Last updated

Anomaly Detection

**Anomaly detection** refers to the technique of identifying **unusual patterns** and finding **outliers** in a set of observations. Outliers are data points that differ considerably from the remainders of the dataset. Usually, extreme values that diverge from the normal or expected behavior.

Historically statistics was applied to find and remove outliers, for example from the tails of a Gaussian distribution. The idea was that outliers which result from errors (noise, human, etc) may arise in misleading interpretations. In addition, by filtering them out, modern algorithms in supervised learning can gain in accuracy. On the other hand, the anomalies are nowadays also the object of interest, as it is the case of *“rare events”* in physics, medicine, business or cybersecurity.

Datasets vary in their nature, but the most typical ones are time series and spatial data. There are three main types of outliers

**Points**are single occurrences anomalous with respect to the complete dataset. For example, the stock v...

Read more…(1019 words)

Contributed 66%

tutorial

TutorialMachine Le…Last updated

Dimensionality Reduction and Principal Component Analysis

**Dimensionality reduction** aims reduce the number of features of a high dimensional dataset in-order to overcome the difficulties that arise due to the curse of dimensionality.

There are two approaches: **feature selection** and **feature extraction**. Feature selection focuses on finding a subset of the original attributes. Whereas feature extraction transforms the original high-dimensional space into a lower dimensional one. Ideally, some structure in the data should remain in order to preserve enough information. Algorithms can be unsupervised (principal component analysis or PCA, independent component analysis or ICA) or supervised (linear discriminant analysis or LDA). In feature extraction, transformations can be linear (PCA, LDA) or non-linear (t-SNE, autoencoders).

There are plenty of applications such as hidden patterns visualization (by removing highly correlated attributes), noise reduction (removing irrelevant features), further exploration, data compression and storage, etc. In fact, dimensionality reduction is usually applied as a **preprocessing step **for other machine learning and data ...

Read more…(951 words)

Contributed 94%

tutorial

TutorialMachine LearningLast updated

Bayesian Machine Learning

The idea behind Bayesian approach is to incorporate into machine learning algorithms some **prior beliefs about the model θ** by applying the *Bayes' rule*. It is highly useful when data is scarce or difficult to obtain, which is often the case in practice. In Bayesian analysis, data D is not assumed to be right, but is allowed to become “less wrong with size”. The process consists on recursively updating our initial belief or knowledge (*prior*) as more evidence is obtained (*data*). Goals can be to either find the most probable model **θ*** (B*ayesian inference*) or to directly compute optimal predictions y* (*Bayesian prediction)*.

Lets say we have an empirical dataset D = {(x_{1}, y_{1}), ..., (x_{n}, y_{n})} and a model θ. Then, by Bayes' theorem, we have

P(\theta | D) = \dfrac{P(D|\theta) \times P(\theta)}{P(D)}

Read more…(1142 words)

Contributed 91%

tutorial

TutorialMachine Le…Last updated

Ensemble Methods (Part 3): Meta-learning, Stacking and Mixture of Experts

Ensemble methods were introduced in a previous tutorial. In this tutorial we will explore two more ensemble learning algorithms, namely - **stacking** and **mixture of experts**. Both these methods can be looked at as examples of **meta learning**, when machine learning models are trained on data from predictions outputted by other machine learning models.

Let us continue with the scenario where *m *models are trained on a dataset of *n* samples. **Stacking** (or stacked generalization) builds the models in the ensemble using different learning algorithms (e.g. one neural network, one decision tree, ...), as opposed to **bagging** or **boosting** that train various incarnations of the same learner (e.g. all decision trees).

The outputs of the models are combined to compute the ultimate prediction of any instance *x*:

\hat{y}(x) = \sum_{j=1}^m \beta_j h_j(x)

Read more…(496 words)

Contributed 18%

tutorial

TutorialMachine LearningLast updated

Improving Results After Implementing a Working Model

In this article, we'll discuss the importance of collecting more data and using more complex machine learning models for prediction in different situations and circumstances. What helps more (data or algorithm) depends on the specific problem at hand. In the discussion, we'll also highlight different caveats that might effect the desirability for pursuing each of these directions.

Collecting more data is one of the most promising ways of improving performance. In fact, you might have come across quotes like "*We don’t have better algorithms. We just have more data.*" or the article "*The Unreasonable Effectiveness of Data*" by Peter Norvig (Director of Research, Google).

One of the most effective ways in improving algorith...

Read more…(1121 words)

Contributed 80%

tutorial

TutorialMachine Le…Last updated

K-Means Clustering

**K-means clustering **is an algorithm to perform clustering. It is simple to understand and implement, and hence is one of the most popular methods, and often the first algorithm to be tried out when performing clustering.

The idea behind **clustering** is to segregate the data into groups called *clusters*, so that instances with similar behavior are classified in the same cluster. It is used in data mining, pattern recognition and anomaly detection. Clustering is the most popular **unsupervised learning** method. Unsupervised learning is a set of techniques to identify patterns and underlying characteristics in data.

K-means clustering partitions the dataset into* k *clu...

Read more…(848 words)

Contributed 71%

tutorial

TutorialMachine LearningLast updated

Hidden Markov Models

Let us imagine a dynamic system whose future states only depend on the current one. Such a stochastic process is called a **Markov process**, and it is said to satisfy the **Markov property (“memoryless”)**:

p(h_{t+1} |h_1,..., h_t) = p(h_{t+1}|h_t)

Now, let us assume that each real state is not directly available, but an observation of it that works as an indicator. In fact, this is a realistic scenario. For example, a system with noise-corrupted measurements or a process that cannot be completely measured. There is an uncertainty about the real state of the world, which is referred to as **hidden**. A **Hidden Markov Model (HMM)** serves as a probabilistic model of such a system.

Let H be the *latent, hidden* variable that evolves over time. Let O be the random variable over its *observations,* also known as the *output sequence. *Graphically, the system at time steps {1, …,...

Read more…(692 words)

Load More