Advanced Data Preparation

March 21, 2018

Introduction

In this tutorial, we are going to learn about a few Advanced Data Preparation concepts, including:

What Format Your Data Should Be In
Data Interpreter
Pivot
Splitting a Column into Multiple Columns
MetaData Grid

Excellent data analysis and visualization depend on the quality of your data. It often has to be wrangled and cleaned before we can commence analysis. Let’s learn how to do exactly that!

Formatting

The first concept that we will cover is formatting.

Formatting is an important consideration in both analysis and presentation. Data needs to be in a proper format for Tableau to be able to parse and display it. Incorrectly formatted data can prevent Tableau from reading it or cause Tableau to interpret the data in ways other than what is required.

Hence, it is important to understand how to ensure that the data is well formatted and conducive to creating great visualizations and analysis in Tableau.

Tableau Public can accept data in different formats such as CSV, Excel, JSON, etc.

Formatting for Excel and CSV files

Let’s start with the general best practices for well-formatted data in Excel or CSV.

Ensure that the data begins from the first column and row.

Many report based data worksheets will have text in the first row. Here we see that the first row is explanatory and that field names begin at row 2. You can simply remove this row and then import the sheet in Tableau.

Make the very first row the column headers

Data in Excel and CSV formats sometimes can have data beginning from the first row itself. Ensure that the first row consists only of meaningful columns names.

Have one record in every subsequent row

If your Excel sheet contains cross tab* data, we will need to exclude it or alter it. Tableau is essentially row oriented — it expects each record to be stored in a row. But cross tabs make the data column oriented. In such cases, exclude the cross tabs or aggregate and store.

*Cross tab: A crosstab query is a type of select query. When you run a crosstab query, the results display in a datasheet that has a different structure from other types of datasheets.

Also known as contingency tables or cross tabs, cross tabulation groups the variables to understand the relation between different variables. It is usually used in statistical analysis to find patterns, trends, and probabilities within raw data.

When you can use cross tabulation

Cross tabulation is usually performed on categorical data — data that can be divided into mutually exclusive groups.

If you have notes or aggregations in any row, remove them as they will not be read or incorrectly parsed.

Plenty of data sheets have rows of aggregations like Totals, Averages using in-built Excel functions. While useful, these pre-aggregations make it difficult to read the files into Tableau. Also, in light of Tableau’s great feature set, pre-aggregations are redundant.

Make sure there are no blank rows or columns within the data. If there are merged cells or blank cells within the dataset, ensure that they are either filled with appropriate data or values that showcase missing data. [The values can be None, Null or NA. This is similar to Programming languages where missing data is explicitly stated with special values like Null to distinguish it from 0].

For example, in a customer database, email id and phone number field values might not be present for all records. We can add Null instead of leaving these values blank. Or in case of Sales data for a company, we might not have records for a newly launched category. We can add NA values here to indicate that there is no record for this category. This is especially important to differentiate the Nulls from 0 sales value, which can occur for failing or seasonal goods category.

Formatting for JSON files

Another common type of files that we can import into Tableau Public is the JSON format. JavaScript Object Notation is an open-standard, language-independent data format and is commonly the response output when extracting data from APIs. Tableau Public can accept the data in JSON format too but has some guidelines to ensure that the data is read correctly and displayed accurately.

Do not exceed the 10x10 limit for nested arrays

Since JSON data is nested, and each level creates a new row, deeply nested arrays create a massive number of rows, which cause memory issues and result in an error. Work around this issue by reducing the nesting or selecting fewer schema levels from the output.

Tableau explicitly asks for the levels of schema to be selected on loading JSON files. This step occurs after the data is loaded but before the blank worksheets are shown.

Loading time may be slow for nested or deep JSON files

Consider for example, a file with over 50 levels of JSON objects. Note this is different from the 10x10 limit for nested arrays in the JSON format. The processing time shoots up for this file and it can take a long time to load. In this case as well, limit the number of schema levels. As showcased above, select only the schema levels that are required for analysis.

Remove comments

Ensure your JSON file is saved with only the JSON objects and not comments or other text. This can be the case if you received the JSON files as a report and includes explanatory text.

Text Files

Another common format for our data is a text file. Tableau can connect to delimited text files such as *.txt, *.csv, *.tsv . Note: A delimited text file is simply a text file used to store data, in which each line represents a single entity or record [analogous to row], and each line has fields [analogous to columns] separated by the delimiter. The delimiter is commonly a comma, space, or semicolon.

Example

Here, we shall try loading a sample TSV text file from the data.gov website into Tableau.

We will use the Rotten Tomatoes movie ratings dataset — you can download it here.

Open the text file from link above into Tableau.

From the menu, navigate to the directory where the file can be found.

Observe the left panel that shows the Data Interpreter.

Data Interpreter

The Data Interpreter is an excellent tool that can be used to specify delimiters and instruct Tableau how to read the text files. It can prevent most of the issues outlined previously.

Data Interpreter is used when our dataset contains features such as sub-tables, extra headers or footers, empty tables, etc. In short, Data Interpreter can make our data look pretty and cleaner without changing the underlying data. Please note that on selecting the Interpreter, based on file size and complexity, load times might be large.

However, data interpreter can throw errors or be unavailable in some cases such as:

If Tableau Desktop doesn’t need extra help from Data Interpreter to handle unique formatting or extraneous information, Data Interpreter option is not available.
Data contains more than 2000 columns or in cases where data contains more than 3000 rows and more than 150 columns
Data Interpreter is only available when your connector supports it. For Excel, your data must be in the .xls or .xlsx format.

Example

Let us first download the dataset from the link here. Go to the bottom and click on ”Download data sets”. Once you’ve downloaded the zip file and unzipped it, load the refugee dataset into Tableau. The filename will be 3 - UNHCR syrian refugees.

Open the dataset in Tableau and choose the Microsoft Excel option. You will be prompted with the following dialog:

Select the checkbox next to Use Data Interpreter. Observe the changes. We can see here that additional tabs are added as datasets to our Tableau ‘sheets’ list under Connections. This is because Tableau has changed the Excel sheet to rectify the issues outlined in the previous section. The additional data, such as cross tabs, pivot tables, etc. are separated to new tabs of the same Excel sheet.

Click on Review the results to see the changes that were made.

The color codes on the first tab indicate the changes done for cleaning the data. View these alterations by clicking on the second tab.

The extra columns headers and values are appropriately understood and handled by Tableau.

On clicking each tab, we see that the sub tables are added as separate tables in the new Excel sheet. For instance there were a few rows in the middle of our original Excel as:

If we observe the above section, we find that this is data about Funding Requirements. We further note that it is in tabular format. The title or header of this information is “Funding Requirements (2016)“. Then we have 5 lines of data followed by a link.

Since, we used Data Interpreter, we study the transformation. This entire section, being a table of data, has been shifted to a new tab. Screenshot is as below.

As seen here, we have Row 1, which is the title of the table (“Funding Requirements”) reposed here. It is hence, labeled as “Header”. The remaining rows represent ‘Data points’ such as Gap, Coverage, etc. from original table. Hence, they are labelled as “Data”.

Thus, we see that the data has been cleaned up and made more functional for analysis and visualization purposes by Tableau. Our data for visualization looks like:

Pivot Tables

Next, we move on to the concept of Pivot tables in Tableau.

First, what is a Pivot table?

Excel users might be familiar with the concept. A pivot is a transformation of data structure from a wide view composed of many columns into a long view with condensed columns. Pivoting can be a useful method for building charts more efficiently and displaying visualizations with little effort.

Pivots are generally used when we have different columns that we wish to represent clubbed together. The columns necessarily need to have some relationship where they are individual items of a variable or have one overarching superset.

For instance, if we have Sales data in separate columns for each Quarter of different years, we can combine them to get aggregate Sales figures for each year. Since, they all have a common superset of Sales, we can use pivoting. But, Sales and Discounts while both being numeric cost-related figures, cannot be combined into a pivot as they do not have any common binding term.

Example

For this example, we take the World Indicators data from Kaggle. Link here: World Development Indicators

To pivot the data, first we will load the data. Then we will need to pick two or more columns to club.

Here, we see that different population groups are listed as separate columns.

We can combine these with the help of a Pivot table.

Select the three columns.

Right click on the third columns and select the Pivot option.

Our values are merged as ‘Pivot Field Values’ as follows.

Now, we can proceed with creating the visualizations as usual.

Splitting a Column into Multiple Columns

Moving along, we come to the concept of splitting a column into multiple columns.

To illustrate this concept, we can use our Sales dataset. Load this data.

We can split the customer name into first name and last name columns.

In Tableau, we can use split or custom split options to separate column values based on a specific separator or a repeated pattern of values present in each row of the field. Here, we shall utilize the space character.

We can simply right click on the column header and use the Split function, by clicking on Custom Split .

Select the data to be split into two columns based on space character. We choose the Customer Name.

The Customer Name has the first name and last name separated by a space character. Automatically, Tableau splits this customer name into two with one column containing only first names and other with only last names. Notice how the new column names are prefixed with ‘Customer Name-’ and also contain an ‘equals’ sign to highlight that they are derived columns, i.e. =Abc in blue below.

Our data is then split into two columns as above. Here, we could also have simply used the Split function with same results.

Metadata Grid

Now, we progress to the last concept in our tutorial — Metadata Grid.

We have often looked at our data source page, which holds the dataset under consideration. This page consists of four main areas. These are the left pane, canvas, data grid, and metadata grid. Of these, the first three are commonly used.

Left Pane: This holds the connected datasources and sub-tables if any.

Canvas: The central section that holds the dataset under consideration along with any joins or blends.

Data Grid: The actual dataset under consideration and shows all the included variables and values.

Metadata grid

This section helps to quickly examine the structure of our Tableau data source and perform routine management tasks, such as renaming fields or hiding multiple fields at once.

Click on the list icon atop the data grid.

It changes the data grid to show us the underlying metadata of the dataset.

As seen above, we can view the fields in our data source as rows. We also see the columns as a result of split. In my case, since I used my previously saved Sales dataset, I can also scroll down to see the calculated fields and parameters I created as follows:

Congratulations!

You have now mastered the essential of concepts that will make understanding and wrangling datasets for analysis and visualization easy.