In this tutorial, we will learn the following advanced topics in Tableau:
- Understanding how LEFT, RIGHT, INNER, and OUTER Joins Work
- Joining Data in Tableau
- Joins With Duplicate Values
- Joining on Multiple Fields
- Joining Data v.s. Blending Data in Tableau
- Data Blending in Tableau
- Creating Calculated Fields in a Blend (Advanced Topic)
- Creating Table Calculations
- Creating Bins and Distributions
- Leveraging the Power of Parameters
First, we will try and understand the concept of joins.
Often our data is not a single table but a collection of tables. These are related to each other by single or multiple fields (or columns). Joins are a method for combining this related data based on the common fields. The resultant table is typically a virtual table that contains columns from all of the parent tables (that resulted in the join).
Note: Depending on the kind of join, the number of records may or may not be identical to the number of rows in original tables.
For instance, if we take our Sales data: data from both Orders and Returns table can be combined via a Join. This will add more details to the sales orders that were returned and allow us to investigate the potential causes.
There are four types of Joins in Tableau or any database programming language like SQL, Postgres, etc.
These are inner, left, right, and full outer joins. The type of join depends on the line of exploration and tables under consideration.
Inner Join: When you use an inner join to combine tables, the result is a table that contains values that have matches in both tables.
Representative of Inner Join’s corresponding Venn diagram as per set theory. Notice that the intersection of the two tables is highlighted.
Left Join: When you use a left join to combine tables, the result is a table that contains all values from the left table and corresponding matches from the right table.
When a value in the left table doesn’t have a corresponding match in the right table, you see a null value in the data grid.
Right Join: When you use a right join to combine tables, the result is a table that contains all values from the right table and corresponding matches from the left table.
When a value in the right table doesn’t have a corresponding match in the left table, you see a null value in the data grid.
Full Outer Join: When you use a full outer join to combine tables, the result is a table that contains all values from both tables.
When a value from either table doesn’t have a match with the other table, you see a null value in the data grid.
To analyze joins, let us once again open Tableau and load the Sales dataset in Excel format. Here, we shall try to holistically analyze all our transactions — that means we will look at not only our sales transactions but also any returns of our orders. Since this data is housed in two separate datasets, we necessarily have to combine them using Joins.
Orders into the section that says “Drag Sheets here”.
As observed, the Orders data was loaded. Drag
Returns to this sheet as well. Tableau will automatically join the two tables with an Inner Join. Hover to see the details of the Join.
As can be seen, the Join happened on the common variable —
Order ID. Click on the concentric circles (representing the Inner Join).
We can view the different types of joins and select accordingly. We can also add new clauses i.e. conditions on which the join occurs.
A point worth remembering is that Tableau will only make the Join automatically if it finds common variables with identical names. In case the matching parameter does not have identical field names, we will have to manually add the Join condition. E.g. If Returns table had Order ID as
Order_ID, we would have had to specify the condition by stating that
Order ID (data source which here is Orders table) =
Order_ID (from Returns table).
Now, in this Join, we only have records for orders that got matched with orders on Returns table. This means only records for returned order are available in this dataset. Since, we would like to inspect all the Orders with the matched records of those that got returned, we make a left join. Simply click on the Left Join option and select.
The data will get loaded. Now open a new worksheet for visualization.
Notice how the visualization has a
Order ID from Orders and Returns table both. Also, all variables from both tables are present in Dimensions and Measures section.
We can also add multiple joins by adding another table. For instance, we shall try to see profits made by each representative as well as losses due to returned orders. For this, we need:
- The data of representatives from the People table
- The Profits values from Orders sheet
- The Returns table → If data exists in Returns table, the transaction is unsuccessful, else it is profitable.
To make this visualization, we shall join the data from the People table, Orders table, and Returns table. As we have already made the join with Returns data, just add the People table to the worksheet. Go to Data Source on the bottom section.
Drag the People table to other worksheets. An inner join is automatically performed between Orders and People, as the
Region variable is common to both.
Now, head back to the visualization worksheet. All variables from the three tables will auto-populate in the Dimensions and Measures sections.
As we can see above, we have no common factor between People and the Returns table.
This newly joined dataset now has all the transactions that have contributed to revenue — both the actual Sales that led to profit (or loss) but also the nullifying transactions through Returns. This is significant as many seemingly profitable sales transactions might actually not accrue revenue if the corresponding orders were returned.
We also add the additional data of the in-charge person for these regions to analyze if any particular individual (or teams in case of larger personnel data) contributed more to profit or loss.
We can create a simple viz to see the data in action.
Sub-category to to the Rows field. Add
Profit to the Columns section in the center of the worksheet from Measures.
Returned from the Returns dataset. Add it to the Colors Card as well as the Filter section. Our viz is transformed as below.
If any value is not displayed due to negative values, add it to the sheet using absolute values.
Here, we showcase the overall Profits and sales figures in the backdrop of Profits.
Add the Returned Dimension from Returns table to the Color Card in Marks panel.
Our viz transforms on adding to color as follows.
Tableau has automatically split the number based on the categorization of being Returned or not. Referring to the Color Legend on the side, we can see that 800 orders were returned while the remaining 9195 orders were completed.
Now, we shall look at another important concept called Data Blending in Tableau.
Data blending is a method for combining data where we supplement a table of data from one data source with columns of data from another data source.
Data blending is used instead of Joins in the following instances:
- When we have to join data from different databases that are not supported by cross-database joins
- Data is at coded at different levels of detail in the various individual tables being considered for the join
- Data needs to be cleaned or transformed
- When Joins might cause duplicate data or performance issues
In order to use data blending features, we need to:
- Assign one data source as primary and one or more as secondary.
- Define the common dimension(s) between the two data sources
In general, we use data blending instead of Joins when we have our data in different types of sources: say an Excel sheet and the result from a database query in JSON format or need to aggregate data from secondary data source to first.
Here, we will try and compare Profits from two sources — our Sales dataset and the Coffee chain dataset. This is similar to comparing revenue from say, two different companies or departments while performing an annual review. Or you can imagine comparing the growth of two different stocks to see which of the two to purchase. Another useful analogy is viewing expenditure from two different bank accounts. In all these examples, while the variable being compared is similar, the actual dataset and even format can be different.
Now, let’s start practicing.
Open a new worksheet in your Tableau workbook.
Select from the Menu options: Data → New Data Source.
Add the Sample coffee chain data in Excel format. You can download it from here. Drag all the three table as inner Joins in the worksheet. This is additional practice for Joins.
Now head over to Sheet 2 by clicking on the Sheet 2 tab from the bottom panel. We can see the variables from this Excel sheet in Dimensions and Measures card. Also, seen in the screenshot below are the three data sources loaded until this point.
State to the Rows section from the Sales dataset. Add
Profit to the Columns section. The visualization is automatically created as a bar chart.
Now select the Coffee dataset from the loaded data sources in top left panel.
When we select this completely disparate data source, we can still view our viz on the sheet. In other words, even though we added a new data source, it did not replace the original data source. Instead the two data sources seamlessly blended. This is significant as we want to combine the two different different data sources on the same viz.
As seen here, we have a red link/chain symbol on the State variable as this is detected as a common variable by Tableau for the two data sources.
Profit form Measures and add to Columns section.
Our visual is now transformed as below.
Now swap the rows and columns using the Swap symbol.
Our X-axis labels, which are State names are now not visible fully.
Click on the X-axis to see the list of Menu options.
Rotate Label option. Voila, the labels are now readable. Our visualization is now aesthetically pleasing and functional too. We can see the Profits both from the Superstore dataset as well as Coffee Shop data on the same visualization.
Save the file to Tableau public and publish it on your profile. Congratulations! You have now successfully performed data blending and created a visualization comparing variables from the blended sources.
Calculated fields allow us to create new data from records and data that already exists in our data source. When we create a calculated field, we are essentially creating a new field (or variable) in the data source. Its constituent values or members are determined by the calculation that we specify. This new calculated field is saved to our data source in Tableau, and can be used to create more robust visualizations.
Now, right click on the Profit parameter again. Select the Calculated field from Menu options.
This opens a new window where we can create a custom computation.
We create a variable that calculates the Profit over Sales metric.
Do this for the Coffee dataset as well.
Now add these new variables to the Columns section.
Click on the Y-axis for second plot.
Click on Edit axis and reverse the scale as below.
Our final visual is as below.
Observe that we used the reversed scale to allow easy comparison of the two images.
Table Calculations are another feature available in Tableau that allow us to perform computations on values in a table.
Open a new worksheet after saving this file. Add the
Profit from the Coffee data to the Columns and Rows respectively.
In the pop-up that opens, assign table calculation as:
Let’s dissect this popup for a moment — we are creating running totals of Profit Sums for every state, hence we will add up all of the profits for all transactions for each state.
Note: Running Total table calculation aggregates values cumulatively in a partition. Thus, instead of absolute values, we will see a running total of profits for each year, such that each month’s profits are added to all previous months’ values.
Similarly, add the
Product Id to our plot. Our visual now has moving totals as below.
Next on our list is understanding bins and distributions. For this, we need to create a histogram.
First of all, a little background about histograms, bins and distributions.
A histogram is a plot that lets us discover, and show the underlying frequency distribution (shape) of a set of continuous data. To construct a histogram from a continuous variable you first need to split the data into intervals, called bins. Each bin contains the number of occurrences of values in the data set that are contained within that bin.
Open a new worksheet and select our Sales data. We can view the sales figures using a histogram plot.
Sales to Columns in the sheet.
Click on the Show Me button on top right and select Histogram from the plots types.
Our histogram is ready.
As we can see that the sales distribution has automatically been rendered. The buckets in which counts are placed are based on the value of sales.
However, there are too few buckets. We can change these bins by specifying values of the bin ranges ourselves.
Simply right click on Sales in Measures and click on Create → Bins in the menu options.
A window pops up as follows.
Enter value as 15.
The histogram changes as follows.
Another way to create bins is by using Parameters.
What are parameters? A parameter is a variable that allows us to pass a value. They allow us to add values into our visualization that are currently not present in the dataset. They can be used in filters and calculations, and subsequently in visualizations.
Once again, we shall create a histogram of profit variable using Parameters. For creating a parameter, click on any variable to get the Menu options and select Create → Parameter. Note how, even though we are creating a parameter for use on the Profit variable, we can get the Menu options by selecting right clicking on the Sales variable.
Next, select options from the window as:
The fields in this part of the Create Bins dialog box show us the data that Tableau uses to suggest a bin size. Here, we also encounter the minimum and maximum values. We can use these values to set the bin size dynamically. While doing so, the user of our dashboard will be able to use the parameter control to adjust the size of the bins in increments, with a minimum size of 50 and a maximum size of 100.
Please note that we are using the parameter as the size variable. Change this variable to set the bin size and change the histogram dynamically.
Our histogram changes as:
Thus, we have used Parameters and bins to change our histograms.
Our histogram clearly shows that maximum profits lie within the bins of -1K to 1K since we are grouping by the profits.