# An Overview of Data Science

February 20, 2018

Welcome to your first Data Science Tutorial on CommonLounge. This tutorial will be quite different from the rest of the tutorials in this course, and will give you a broad overview of the field of data science.

# What is Data Science? Why Data Science?

Trillions of gigabytes of data is being produced yearly, and the number is still growing exponentially. It is estimated that by 2020, 1.7 megabytes of data will be produced every second for every person, and digital data accumulation will reach about 44 zettabytes or 44 trillion gigabytes. This explosion of data is also shown in the graph below.

Growth of Data. Source: Patrick Cheesman

Data is only a raw material and extracting information from it requires further work. Our society is increasingly becoming data dependent and data science is the field which helps us make sense of this huge quantity of data.

Data Science is an interdisciplinary field, and makes use of methods and technologies from different fields such as computer science, mathematics, statistics and machine learning.

Data Science is involved with the collection, preparation, analysis, visualization, management and preservation of data. This data is often available in very large quantities, and covers a variety of types.

# Examples of Data Science around us

Data science is widely used by companies and other organizations to get insights about their customers, staff, products and processes.

For example,

• Uber uses data science to calculate how much to charge for a particular ride, which riders to give discounts to, and to test what kind of loyalty programs are working best for its drivers.
• Airbnb (an online marketplace for lodging and short term rentals) uses data science to help people estimate the prices they should rent their homes at.

For any data-centric organization, the data is the voice of the customer and data science is the interpretation of that voice.

Besides commercial sectors, government and non-government organizations also depend heavily on data science. By using data science, the government can detect fraud and criminal activities, optimize investment and funding and much more.

Similarly, NGOs use data science to strengthen their cause by providing reliable proof. For example, World Wildlife Fund (WWF) increases the effectiveness of their fundraising by using data science to show information about different wild animals and birds.

In addition to these institutions, several other organizations are using data science for a multitude of tasks and its use will only increase with time.

# Opportunities in Data Science

The exponential growth of data has also led to a rapid increase in the number of data science jobs available. Analysis done by LinkedIn based on its huge database of professional profiles shows the growth in number of Data Analyst and Data Science roles (see figure below).

Growth of Data Science and Data Analysis jobs

# Key Components of Data Science

## Programming (Python, R)

As mentioned before, data science deals with large amounts of data. In data science, this data is managed and analyzed using computer programming. Other non-programming ways to analyze the data are studied in fields such as Data Analytics / Business Analytics.

In the data science community, the following two programming languages are most popular:

• Python: The availability of large number of third party packages like numpy, scipy, scikitlearn, matplotlib, etc, make data science projects easier to implement and have led to its immense popularity. In addition to that, different IDEs like PyCharm, Vim, Emacs and interactive python environments like IPython and Jupyter have made using python easier than other languages.
• R: R is a programming language specially developed to carry out variety of statistical and graphical techniques. It is a programming language that was designed and created by statisticians, for statistics. R too has different packages for data wrangling, data visualization and machine learning. It is an open source language and there is an active community of statisticians and programmers who are constantly enriching the language by adding new libraries for new statistical methods.

## Data (and its various types)

Data science uses programming to analyze data, and this data can be of various types. Some important categories of data are discussed below:

Structured Data: The data that is easy to represent in a tabular form, and store and manipulate in databases and Excel files. The data has a clearly defined data model. For example, Airbnb has a database of places available for rent, which consists of variables like size of home (in square feet), number of guests it can accommodate, number of beds, number of bathrooms, per day cost of renting the home, and so on.

Unstructured Data: Data which doesn’t fit into a data model easily is called unstructured data. Examples of unstructured data include emails, PDF files, images, videos, etc.

Natural language: Data that is directly written in languages we use to communicate with each other such as English, Chinese, French, etc. Natural language data is a sub-type of unstructured data.

Image, Video, Audio: Images, videos and audios are widely generated from sensors like cameras and microphones. They are unstructured in nature and extracting information from them can be quite a challenge.

Graphs: Graph is a mathematical structure consisting of nodes and edges which model pairwise relation between entities. For example, information about Facebook friends can be represented as a graph, where people are nodes, and an edge between two nodes denotes that those two people are friends.

Machine Generated: Machine generated data is any information created by a computer, different applications or machines without humans being involved directly.

## Statistics and Probability

Statistics: Statistics is a branch of mathematics that deals with collection, organization, analysis and interpretation of data. Statistical methods and techniques are implemented via programming to analyze data. Some commonly used concepts include mean, mode, median, standard deviation, hypothesis testing, skewness, etc.

Probability: Probability is used to mathematically describe the likelihood of occurrence of an event. It quantifies randomness and uncertainty. For example, probability tells us the chance of it raining on a particular day, or someone winning a lottery. The probability that an event occurs is always between 0 and 1, where 1 represents absolute certainty and 0 represents complete impossibility. Some commonly used concepts include random variables, different probability distributions, conditional probability, Bayes theorem, z-testing, etc.

Relation with Data Science: Statistics and probability form the mathematical foundation of data science. Without a clear understanding of statistics and probability, it is very easy to misinterpret data and reach incorrect conclusions.

## Machine Learning

Introduction: Arthur Samuels defined machine learning as the field of study that gives computers the ability to learn without being explicitly programmed. A machine learns whenever it changes its structure or program in a manner that its expected future performance improves. The change can occur due to its inputs or in response to external information. For example, when the performance of a machine learning model being trained for object recognition improves after looking at several pictures of the object, it is reasonable to say that the machine has learned to identify the object.

In simple terms machine learning involves two goals: generalization and improvement.

• Learning leads to generalization: the machine learning model must be able to make predictions on data that it has not previously seen.
• Learning leads to improvement: as data or computational resources increase, the model should be able to make more accurate predictions.

Machine learning systems perform a variety of tasks that involve recognition, diagnosis, planning, robot control, prediction, etc.

Machine Learning in Data Science: Data scientists use a number of machine learning algorithms to predict different things from available data. For example, by using the sales data of a shopping mall from previous years, we can predict the approximate sales for coming years using regression methods like linear regression. Similarly, classifying data into known classes, like classifying birds based on their whistle, requires machine learning algorithms like logistic regression, decision trees, etc.

## Big Data

Introduction: When a set of data becomes so huge in quantity or gets complex enough that there is difficulty in processing it using traditional data management approaches, then we turn to Big Data. Usually, storing or processing this data requires a large number of computers (starting from 10s for small companies, to tens of thousands for large companies). Big Data is characterized by three Vs:

• Volume: Big Data is large in volume: it can range from terabytes to zettabytes.
• Variety: Big Data is diverse in nature. It can be in different formats and types. Most companies have a mix of structured and unstructured data.
• Velocity: Large amount of data is generated on an ongoing basis. For example, this data is coming in from users interacting with a website, or from sensors that might be constantly collecting data.

Three Vs of Big Data: Volume, Velocity and Variety

Big Data and Data Science: The emergence of big data has raised the importance of Data Science. Data scientists use different tools to process big data like Hadoop, Spark, R, Pig, Java and others, as per their needs. As our technology and society becomes more data driven, big data and data science will become even more intricately related.

# Conclusion

Congrats on completing your first Data Science tutorial on CommonLounge!

We hope you had a great time learning. In the next tutorial, we will jump into some practical and hands-on topics.