Unit II Notes
Unit II Notes
With the wide variety of verticals, use-cases, types of users, and systems utilizing
enterprise data today, the specifics of munging can take on myriad forms.
Data exploration: Munging usually begins with data exploration. Whether an analyst is
merely peeking at completely new data in initial data analysis (IDA), or a data scientist
begins the search for novel associations in existing records in exploratory data analysis
(EDA), munging always begins with some degree of data discovery.
Data transformation: Once a sense of the raw data’s contents and structure have been
established, it must be transformed to new formats appropriate for downstream processing.
This step involves the pure data scientist, for example un-nesting hierarchical JSON data,
denormalizing disparate tables so relevant information can be accessed from one place, or
reshaping and aggregating time series data to the dimensions and spans of interest.
Data enrichment: Optionally, once data is ready for consumption, data mungers might
choose to perform additional enrichment steps. This involves finding external sources of
information to expand the scope or content of existing records. For example, using an open-
source weather data set to add daily temperature to an ice-cream shop’s sales figures.
Data validation: The final, perhaps most important, munging step is validation. At this
point, the data is ready to be used, but certain common-sense or sanity checks are critical if
one wishes to trust the processed data. This step allows users to discover typos, incorrect
mappings, problems with transformation steps, even the rare corruption caused by
computational failure or error.
The most basic munging operations can be performed in generic tools like Excel or Tableau
—from searching for typos to using pivot tables, or the occasional informational
visualization and simple macro. But for regular mungers and wranglers, a more flexible,
powerful programming language is far more effective.
Python is often lauded as the most flexible popular programming language, and this is no
exception when it comes to data munging. With one of the largest collections of third-party
libraries, especially rich data processing and analysis tools like Pandas, NumPy, and SciPy,
Python simplifies many complex data munging tasks. Pandas in particular is one of the fastest-
growing and best-supported data munging libraries, while still only a tiny part of the massive
Python ecosystem.
Python is also easier to learn than many other languages thanks to simpler, more intuitive
formatting, as well as a focus on legible english-language-adjacent syntax. Python’s wide
applicability, rich libraries, and online support, new practitioners will additionally find the
language useful far beyond data processing use cases, everywhere from web development
to workflow automation.Data science is the study of data. Like biological sciences is a
study of biology, physical sciences, it’s the study of physical reactions. Data is real, data
has real properties, and we need to study them if we’re going to work on them. Data
Science involves data and some signs.
Data wrangling is the process of cleaning and unifying messy and complex data sets for easy
access and analysis.
With the amount of data and data sources rapidly growing and expanding, it is getting
increasingly essential for large amounts of available data to be organized for analysis. This
process typically includes manually converting and mapping data from one raw form into
another format to allow for more convenient consumption and organization of the data.
The Goals of Data Wrangling
• Reveal a "deeper intelligence" by gathering data from multiple sources
• Provide accurate, actionable data in the hands of business analysts in a timely matter
• Reduce the time spent collecting and organizing unruly data before it can be utilized
• Enable data scientists and analysts to focus on the analysis of data, rather than the
wrangling
• Drive better decision-making skills by senior leaders in an organization
Joining Data: Combine the edited data for further use and analysis.
Data Cleansing: Redesign the data into a usable and functional format and correct/remove
any bad data.
Package Managers are tools that help you manage the dependencies for your project. A
dependency is code that is required for your program to function properly. These often come
in the form of packages.
Packages can also have their own dependencies. Managing all these dependencies can be
hard because packages may require specific versions of their dependencies. It’s easy to break
something by modifying dependencies manually.
Data Science is kinda blended with various tools, algorithms, and machine learning
principles. Most simply, it involves obtaining meaningful information or insights from
structured or unstructured data through a process of analyzing, programming and business
skills. It is a field containing many elements like mathematics, statistics, computer science,
etc. Those who are good at these respective fields with enough knowledge of the domain in
which you are willing to work can call themselves as Data Scientist. It’s not an easy thing to
do but not impossible too. You need to start from data, it’s visualization, programming,
formulation, development, and deployment of your model. In the future, there will be great
hype for data scientist jobs. Taking in that mind, be ready to prepare yourself to fit in this
world.
In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.
Melting and Casting are one of the interesting aspects in R programming to change the shape
of the data and further, getting the desired shape. R programming language has many
methods to reshape the data using reshape package. melt() and cast() are the functions that
efficiently reshape the data. There are many packages in R that require data reshaping. Each
data is specified in multiple rows of dataframe with different details in each row and this type
of format of data is known as long format.
Melting in R
Melting in R programming is done to organize the data. It is performed using melt() function
which takes dataset and column values that has to be kept constant. Using melt(), dataframe
is converted into long format and stretches the data frame.
Syntax:
melt(data, na.rm = FALSE, value.name = “value”)
Parameters:
data: represents dataset that has to be reshaped
na.rm: if TRUE, removes NA values from dataset
value.name: represents name of variable used to store values
Example:
# Required library for ships dataset
install.packages("MASS")
# Create dataframe
n <- c(1, 1, 2, 2)
time <- c(1, 2, 1, 2)
x <- c(6, 3, 2, 5)
y <- c(1, 4, 6, 9)
df <- data.frame(n, time, x, y)
# Original data frame
cat("Original data frame:\n")
print(df)
Output:
Syntax:
cast(data, formula, fun.aggregate)
Parameters:
data: represents dataset
formula: represents the form in which data has to be reshaped
fun.aggregate: represents aggregate function
Example:
# Print recasted dataset using cast() function
cast.data <- cast(molten.data, n~variable, sum)
print(cast.data)
cat("\n")
time.cast <- cast(molten.data, time~variable, mean)
print(time.cast)
Output:
nx y
119 5
2 2 7 15
time x y
1 1 4 3.5
2 2 4 6.5
2.5 Tableau
Tableau is a leading data visualization tool used for data analysis and business intelligence.
Gartner’s Magic Quadrant classified Tableau as a leader for analytics and business
intelligence. This Tableau has lot of features including how to create different charts and
graphs, in addition to visualizing data through reports and dashboards in Tableau to derive
meaningful insights.
What is Tableau?
Tableau is an excellent data visualization and business intelligence tool used for reporting
and analyzing vast volumes of data. It is an American company that started in 2003—in June
2019, Salesforce acquired Tableau. It helps users create different charts, graphs, maps,
dashboards, and stories for visualizing and analyzing data, to help in making business
decisions.Tableau has a lot of unique, exciting features that make it one of the most popular
tools in business intelligence (BI). Let’s learn more about some of the essential Tableau
Desktop features. Now that we know what is tableau exactly, let us understand some of its
salient features.
Tableau Features
Tableau supports powerful data discovery and exploration that enables users to answer
important questions in seconds
No prior programming knowledge is needed; users without relevant experience can start
immediately with creating visualizations using Tableau
It can connect to several data sources that other BI tools do not support. Tableau enables
users to create reports by joining and blending different datasets
Tableau Server supports a centralized location to manage all published data sources within an
organization.
DownloadTableau
Once you have downloaded the file, go ahead and install it.
Launch Tableau Desktop, and it will show you the Tableau registration form, which is where
you can register and activate your product.
/ActivateTableau
Enter your product key, or sign in to Tableau Server or Tableau Online to activate your
Tableau license. Upon successful activation, you can start using Tableau Desktop.
Tableau provides us various services according to our business need Tableau Desktop,
Tableau Public, and Tableau Online, all these offer Data Visual Creation. Choice of Tableau
depends upon the type of work.
• Tableau also gives us some flexibility to create new columns, rename, split, edit alias,
join tables, some preprocessing before loading the data into Tableau.
• The below image will demonstrate to you how to load data and perform some
preprocessing.•
Tableau supports various data formats which can be loaded by choosing those options.
• Under a file we see various options to load data from the local directory and under to
a server, we see options to load data from cloud servers.
• For loading CSV files we select Text file options, for excel and SQL files we choose
their respective options
• To open the application, click the Tableau icon on your desktop (or in your Start
menu).
• In the Connect panel at the left side of the Start page, click the Excel link under the
“To a File” heading to the open file selection option.
• Using the file selection box, select the Excel worksheet that you want to open, and
then click the Open button to continue
• Select the Orders sheet from the navigation menu on the left and drag it ontothe Drag
Sheets Here area, as shown in the above gif.
• After loading we can perform data cleaning, data preprocessing, feature extraction to
some extent.
• Tableau loaded with global-superstore data and now we can see Tableau work-page.
• Tableau work-page consist of different section. Let’s understand them first before
plotting our graphs
• Menu Bar: Here you’ll find various commands such as File, Data, and Format.
• Toolbar Icon: The toolbar contains a number of buttons that enable you to perform
various tasks with a click, such as Save, Undo, and New Worksheet.
• Dimension Shelf: This shelf contains all the categorical columns under it. example:
categories, segments, gender, name, etc
• Measure Shelf: This shelf contains all numerical columns under it like profit, total
sales, discount, etc
• Page Shelf: This shelf is used for joining pages and create animations. we will come
on it later
• Filter Shelf: You can choose which data to include and exclude using the Filters shelf,
for example, you might want to analyze the profit for each customer segment, but only for
certain shipping containers and delivery times. You may make a view like this by putting
fields on the Filters tier.
• Marks Card: The visualization can be designed using the Marks card. The markings
card can be used to change the data components of the visualization, such as color, size,
shape, path, label, and tooltip.
• Worksheet: In the workbook, the worksheet is where the real visualization may be
seen. The worksheet contains information about the visual’s design and functionality
• Data Source: Using Data Source we can add new data, modify, remove data.
• Current Sheet: The current sheets are those sheets which we have created and to
those, we can give some names.
• New Sheet: If we want to create a new worksheet ( blank canvas ) we can do using
this tab.
• Number: These are values that are numeric. Values can be integers or floating-point
numbers (numbers with decimals).
• String: This is a sequence of characters encased in single or double quotation marks.
• Drag the dimension and measure in row and column input field and it will
automatically suggest a graph best fitted on data.
• You can change the graph by clicking on the show me button and select whichever
graph you want.
• You can also remove the axis just by dragging and dropping them under the marks
card (remove field).
• Show Me: When you click this label, a palette appears, giving you rapid access to
many options for showing the selected types of fields. The palette changes depending on the
fields in the worksheet you’ve selected or are active.
• From the above image, you might have observed that the default aggregation on the
measure is sum but you can change the aggregation to sum, avg, min, max, etc, you can also
customize the axis name, orientation, size, show-hide axis as shown in the above image.
Create a hierarchy
The hierarchy in Tableau is an arrangement where entities are presented at various levels. So,
there's an entity or dimension under which there are further entities present as levels. In
Tableau, we can create hierarchies by bringing one dimension as a level under the principle
dimension.
To create a hierarchy:
In the Data pane, drag a field and drop it directly on top of another field.
Note: When you want to create a hierarchy from a field inside a folder, right-click
(control-click on a Mac) the field and then select Create Hierarchy.
When prompted, enter a name for the hierarchy and click OK.
Drag additional fields into the hierarchy as needed. You can also re-order fields in the
hierarchy by dragging them to a new position.
Drill up or down in a hierarchy
When you add a field from a hierarchy to the visualization, you can quickly drill up or
down in the hierarchy to add or subtract more levels of detail.
To drill up or down in a hierarchy in Tableau Desktop or in web authori ng:
In the visualization, click the + or - icon on the hierarchy field.
When you are editing or viewing the visualization on the web, you have the option of
clicking the + or - icon next to a field label.
Remove a hierarchy
To remove a hierarchy:
In the Data pane, right-click (control-click on a Mac) the hierarchy and select Remove
Hierarchy.
The fields in the hierarchy are removed from the hierarchy and the hierarchy
disappears from the Data pane.
Tableau performs actions on your view in a very specific order; this is called the Order of
Operations. Filters are executed in the following order:
Extract filters
Data source filters
Context filters
Filters on dimensions (whether on the Filters shelf or in filter cards in the view)
Filters on measures (whether on the Filters shelf or in filter cards in the view)
After creating some plots you might want to use different filters, to do so follow these steps:
• On the filter shelf, you can drag any measure or dimension whichever you want to
apply a filter on.
• By dropping the field a box will appear, now you can select any particular category,
or top-n rows according to measure values or you can write some rules to select top rows or
by using some parameters.
• Apply multiple filters, to do so you will need to add previous filters into context by
clicking on add to context here Context Filter is a Tableau filter that is applied before all
other filters.
• You can choose different options standard, fit width, fit height, entire view from the
toolbar in order to fit the visualization into the worksheet.
Select to keep or exclude data points in your view
You can filter individual data points (marks), or a selection of data points from your view.
For example, if you have a scatter plot with outliers, you can exclude them from the view so
you can better focus on the rest of the data.
To filter marks from the view, select a single mark (data point) or click and dr ag in the view
to select several marks. On the tooltip that appears, you can:
Create a group
There are multiple ways to create a group. You can create a group from a field in the Data
pane, or by selecting data in the view and then clicking the group icon.
The selected members are combined into a single group. A default name is created using
the combined member names.
To rename the group, select it i n the list and click Rename.
Include an Other Group
When you create groups in Tableau, you have the option to group all remaining, or non-
grouped members in an Other group.
The Include Other option is useful for highlighting certain groups or comparing specific
groups against everything else. For example, if have a view that shows sales versus profit
product category, you might want to highlight the high and low performing categories in
the view, and group all the other categories into an "Other" group.
In the Data pane, right-click the group field and select Edit Group.
In the Data pane, right-click the group field, and then click Edit Group
In the Edit Group dialog box, select one or more members and drag them into the
group you want.
Click OK.
To remove members from an existing group:
In the Data pane, right-click the group field, and then click Edit Group.
In the Edit Group dialog box, select one or more members, and then click Ungroup.
The members are removed from the current group. If you have an Other group, the
members are added to it.
Click OK.
In the Data pane, right-click the group field, and then click Edit Group.
In the Edit Group dialog box, select one or more members, and then click Group.
Click OK
General: Use the General tab to select one or more values that will be
considered when computing the set.
You can alternatively select the Use all option to always consider all members
even when new members are added or removed.
For example, you might specify a condition that is based on total sales that only includes
products with sales over $100,000.
Top: Use the Top tab to define limits on what members to include in the set.
For example, you might specify a limit that is based on total sales that only includes the top
5 products based on their sales.
When finished, click OK.
The new set is added to the b ottom of the Data pane, under the Sets section. A set icon
indicates the field is a set..
In the visualization, select the data points you want to add or remove.
In the tooltip that appears, click the Sets drop-down menu icon, and then select Add to [set
name] or Remove from [set na me] to add or remove data from a particular set.
Combine sets
You can combine two sets to compare the members. When you combine sets you create a
new set containing either the combination of all members, just the members that exist in
both, or members that exist in one set but not the other.
Combining sets allows you to answer complex questions and compare cohorts of your data.
For example, to determine the percentage of customers who purchased both last year and
this year, you can combine two sets containing the customers from each year and return
only the customers that exist in both sets.
To combine two sets, they must be based on the same dimensions. That is, you can combine
a set containing the top customers with another set containing the customers that purchased
last year. However, you cannot combine the top customers set with a top products set.
To combine sets:
In the Data pane, under Sets, select the two sets you want to combine.
Verify that the two sets you want to combine are selected in the two drop-down menus.
Select one of the following options for how to combine the sets:
All Members in Both Sets - the combined set will contain all of the members from both
sets.
Shared Members in Both Sets - the combined set will only contain members that exist in
both sets.
Except Shared Members - the combined set will contain all members from the specified set
that don't exist in the second set. These options are equivalent to subtracting one set from
another. For example, if the first set contains Apples, Oranges, and Pears and the second set
contains Pears and Nuts; combining the first set except the shared members would contain
just Apples and Oranges. Pears is removed because it exists in the second set.
Optionally specify a character that will separate the members if the sets represent multiple
dimensions.
When finished, click OK.
To switch a set to list the individual members:
In the visualization workspace, right-click the set and select Show Members in Set
In the Calculation Editor that opens, give the calculated field a name.
SUM([Profit])/SUM([Sales])
Formulas use a combination of functions, fields, and operators. To learn more about
creating formulas in Tableau, see Formatting Calculations in Tableau(Link opens in a new
window) and Functions in Tableau(Link opens in a new window).
The new calculated field is added to the Data pane. If the new field computes quantitative
data, it is added to Measures. If it computes qualitative data, it is added to Dimensions
2.11 Map based visualisations
If you want to analyze your data geographically, you can plot your data on a map in Tableau.
This topic explains why and when you should put your data on a map visualization. It also
describes some of the types of maps you can create in Tableau, with links to topics that
demonstrate how to create each one.
There are many reasons to put your data on a map. Perhaps you have some location data in
your data source? Or maybe you think a map could really make your data pop? Both of those
are good enough reasons to create a map visualization, but it’s important to keep in mind that
maps, like any other type of visualization, serve a particular purpose: they answer spatial
questions.
You make a map in Tableau because you have a spatial question, and you need to use a map
to understand the trends or patterns in your data.
If you have a spatial question, a map view might be a great way to answer it. However, that
might not always be the case.
Take for example, the first question from the list above: Which state has the most farmers
markets?
If you had a data source with a list of farmers markets per state, you might create a map view
like the one below. Can you easily tell the difference between New York and California?
Which one has more farmers markets?
With Tableau, you can create the following common map types:
Proportional symbol maps are great for showing quantitative data for individual locations.
For example, you can plot earthquakes around the world and size them by magnitude.
Heatmaps, or density maps, can be used when you want to show a trend for visual clusters of
data. For example, if you want to find out which areas of Manhattan have the most taxi
pickups, you can create a densit ymap to see which areas are most popular.
Flow maps (path maps)
You can use flow maps to connect paths across a map and to see where something went over
time. For example, you can track the paths of major storms across the world over a period of
time.
Spider maps (origin-destination maps)
You can use a spider map to show how an origin location and one or more destination
locations interact. For example, you can connect paths between metro stations t o plot them on
a map, or you can track bike share rides from an origin to one or more destinations.
2.12 build interactive dashboa rds
You’ve created four worksheets, and they're communicating important information that needs
to know. Now you need a way to show the negative profits in Tennessee, North Carolina, and
Florida and explain some of the reasons why profits are low.
To do this, you can use dashboards to display multiple worksheets at once, and—if you
want—make them interact with one another.
You want to emphasize that certain items sold in certain places are doing poorly. Your bar
graph view of profit and your map
a view demonstrate this point nicely.
It's not easy to see details for each item under Sub-Category from your Sales in the South bar
chart. Also, because we have the map in view, we probably don't need the South region
column in Sales in the South, either.
Resolving these issues will give you more room to communicate the informatio n you need.
On Sales in the South, right-click in the column area under the Region column header, and
clear Show header.
The title Profit Map is hidden fr om the dashboard and even more space is created.
Repeat this step for the Sales in the South view title.
Select the first Sub-Category filter card on the right side of your view, and at the top of the
card, click the Remove icon The Remove From Dashboard icon.
Repeat this step for the second Sub-Category filter card and one of the Year of Order Date
filter cards.
Click on the Profit color legend and drag it from the right to below Sales in the South.
Finally, select the remaining Year of Order Date filter, click its drop-down arrow, and then
select Floating. Move it to the w hite space in the map view. In this example, it is placed just
off the East Coast, in the Atlantic Ocean.
Try selecting different years on the Year of Order Date filter. Your data is quickly filtered to
show that state performance varies year by year. That's nice, but it could be made even easier
to compare.
Click the drop-down arrow at the top of the Year of Order Date filter, and select Single Value
(Slider).
Add interactivity
Wouldn't it be great if you could view which sub-categories are profitable in specific states?
Select Profit Map in the dashboard, and click the Use as filter icon The Use as Filter icon in
the upper right corner.
The Sales in the South bar chart automatically updates to show just the sub-category sales in
the selected state. You can quickly see which sub-categories are profitable.
Click an area of the map other than the colored Southern states to clear your selection.
You also want viewers to be able to see the change in profits based on the order date.
Select the Year of Order Date filter, click its drop-down arrow, and select Apply to
Worksheets > Selected Worksheets.
In the Apply Filter to Worksheets dialog box, select All in dashboard, and then click OK.
This option tells Tableau to apply the filter to all worksheets in the dashboard that use this
same data source
Rename and go
You show your boss your dashboard, and she loves it. She's named it "Regional Sales and
Profit," and you do the same by double-clicking the Dashboard 1 tab and typing Regional
Sales and Profit.
In her investigations, your boss also finds that the decision to introduce machines in the North
Carolina market in 2021 was a bad idea.
Your boss is glad she has this dashboard to explore, but she also wants you to present a clear
action plan to the larger team. She asks you to create a presentation with your findings.
2.13 Data Stories
A story is a sheet, so the meth ods you use to create, name, and manage worksheets and
dashboards also apply to stories (for more details, see Workbooks and Sheets). At the same
time, a story is also a collection of sheets, arranged in a sequence. Each individual sheet in a
story is called a story point.
When you share a story —for example, by publishing a workbook to Tableau Public, Tableau
Server, or Tableau Online—users can interact with the story to reveal new findings or ask
new questions of the data.