End-to-End Example: Using Logistic Regression for predicting Diabetes
In this tutorial, we will see how to use Data Science to predict whether a person has diabetes or not, based on information like blood pressure, body mass index (BMI), age, etc.
The data was collected and made available by "National Institute of Diabetes and Digestive and Kidney Diseases" as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.
We will be using Python as our programming language, and making use of some popular python data science related packages. First of all, we will import pandas to read our data from a CSV file and manipulate it for further use. We will also use numpy to convert out data into a format suitable to feed our classification model. We'll ...
MapReduce is a programming framework that allows us to perform parallel processing on large data sets in a distributed environment. It is a data processing paradigm for condensing large volumes of data into useful aggregated results.
As the name suggests, Map and Reduce functions are performed on the data. Map function processes a block of data and generates a set of intermediate key/value pairs. Reduce function merges all the intermediate values associated with the same intermediate key into a single value.
The two operations are each designed such that efficient implementation is possible in distributed systems. Moreover, the two types of operations together provide a lot of flexibility and expressibility, so that a variety of real-world tasks can be implemented within the MapReduce paradigm.
Trillions of gigabytes of data is being produced yearly, and the number is still growing exponentially. It is estimated that for every person, 1.7 megabytes of data will be produced every second by 2020 and digital data accumulation will reach about 44 zettabytes or 44 trillion gigabytes. This explosion of data is also shown in the graph below.
Data is only a raw material and extracting information from it requires further work. Our society is increasingly becoming data dependent and data science is the field which helps us make sense of this huge quantity of data.
Probability: Conditional and Marginal Probabilities, Bayes' Rule and Random Variables
Introduction: What is Probability?
Probability is used to mathematically describe the chance of occurrence of an event. It quantifies randomness and uncertainty. For example, probability tells us the chance of it raining on a particular day, or someone winning a lottery. The probability that an event occurs is always between 0 and 1, where 1 represents absolute certainty and 0 represents completely impossible.
Probability can be basically determined in two ways - theoretically and empirically. The theoretical (also called classical) method is used specially to determine probabilities of game of chances like lotteries, roulette wheel, coin flip etc. While the empirical (also called observational) method is used to determine probabilities of an event whose outcome can’t be predetermined.
To describe probability mathematically, we need few basic elements which are discussed below:
Statistics: Central Tendency metrics, Dispersion and Correlation
Statistics is a very broad branch of mathematics that deals with everything related to data, from collection and organization of data to its analysis, interpretation and presentation. With the ever increasing amount of data, statistics has become an indispensable tool in every field where one has to work with data.
When the amount of data we are dealing with is fairly small then it might be possible to talk about all the data items individually. However, when we are dealing with large quantities of data, which is almost always the case in real world situations, we need to have some characteristic values that can represent the data.
In this tutorial, we'll introduce such measures first for a single variable. For example, say the weight of students in a particular school. These measures will include measures of central tendency and measures of dispersion. Then, we'll look at measures for und...
Linear Algebra: Vectors, Matrices and their properties
Large datasets often comprise of hundreds to millions of individual data items. It is easier to work with this data and operate on it when it is represented in the form of vectors and matrices. Broadly speaking linear algebra is a branch of mathematics that deals with vectors and operations on vectors. Linear algebra is also extremely important in various machine learning and data processing algorithms.
Vectors, Vector Addition and Multiplication
Vectors can be simply thought of as an array of numbers where the order of the numbers also matters. They are typically represented by a lowercase bold letter such as x. The individual members are simply denoted by writing the vector name with subscript indicating the order of the individual member like x1 for the first member, x2...
A database is a collection of logically interrelated data and description of this data, designed to meet the information needs for organization. The primary objective of creating a database is to store, update and retrieve data efficiently and use it for analysis. The data stored can also include any transient data such as input documents, reports and intermediate results obtained during processing of the data. To achieve this, database must have different characteristics and some of them are:
It should have a structure which deals with data types and data behavior.
It should have a proper retrieval method which could be a declarative query language (like SQL, which we'll talk about in this article) or a procedural database programming language (like Java or Python).
Network Analysis, metrics for Centrality and The PageRank Algorithm
Introduction to Network Analysis
Networks are everywhere. Be it the network of friends in a university or the online social networks such as Facebook and Twitter or the transportation networks or financial networks, networks are ubiquitous in human society. A network is a collection of objects (nodes) with relationships / interconnections (edges) between them.
Since networks are so universal, analyzing them gives us valuable information about different complex phenomena. For example, by analyzing a network of people in an organization, we can find out the most influential people in that organization. Or by analyzing a network of different airports or train stations we can find out which places have the maximum risk of spreading a particular virus. Network analysis provides us with the tools that will help us answer these type of questions.
Graphs and types of graphs
The most common and easy method to represent network...