0% found this document useful (0 votes)

54 views150 pages

Unit-3 Intr Data Science

The document provides an overview of Data Science, including its definition, importance, and the processes involved in data collection and analysis. It outlines the roles within data science projects, such as Data Scientist and Data Engineer, and discusses the significance of big data and various data collection methods. Additionally, it highlights the challenges faced in the field and the applications of data science in different industries.

Uploaded by

manu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views150 pages

Unit-3 Intr Data Science

Uploaded by

manu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Introduction to Data Science

• Data Science Terminology

• Data Science Process
• Data Science Project Roles
What is Data Science?

• Data Science is the area of study which

involves extracting insights from vast amounts
of data using various scientific methods,
algorithms, and processes.
• It helps you to discover hidden patterns from
the raw data.
• The term Data Science has emerged because
of the evolution of mathematical statistics,
data analysis, and big data. V4
What is Big Data?

• Big Data is a collection of data that is huge in

volume, yet growing exponentially with time.
It is a data with so large size and complexity
that none of traditional data management
tools can store it or process it efficiently. Big
data is also a data but with huge size.
• Data Science is an interdisciplinary field that
allows you to extract knowledge from
structured or unstructured data.
• Data science enables you to translate a
business problem into a research project and
then translate it back into a practical solution.
Why Data Science?
• Data is the oil for today’s world. With the right tools,
technologies, algorithms, we can use data and convert it
into a distinct business advantage
• Data Science can help you to detect fraud using advanced
machine learning algorithms
• It helps you to prevent any significant monetary losses
• Allows to build intelligence ability in machines
• You can perform sentiment analysis to gauge customer
brand loyalty
• It enables you to take better and faster decisions
• It helps you to recommend the right product to the right
customer to enhance your business
Evolution of DataSciences
Data Science Components
Statistics

• Statistics is the most critical unit of Data

Science basics, and it is the method or science
of collecting and analyzing numerical data in
large quantities to get useful insights.
Visualization

• Visualization technique helps you access huge

amounts of data in easy to understand and
digestible visuals.
https://www.google.com/search?q=data+visuali
zation&tbm=isch&chips=q:data+visualization,g_
1:beautiful:Krt0Cu_YPJs%3D&rlz=1C1CHWA_enI
N733IN733&hl=en&sa=X&ved=2ahUKEwiwsM2
OvI3_AhXzluYKHZBIBF4Q4lYoBHoECAEQLQ&biw
=1349&bih=600#imgrc=jPXTRAOn70qyXM
Machine Learning

• Machine Learning is a system of computer

algorithms that can learn from example through
self-improvement without being explicitly coded by
a programmer. Machine learning is a part of
artificial Intelligence which combines data with
statistical tools to predict an output which can be
used to make actionable insights.
• Machine Learning explores the building and study
of algorithms that learn to make predictions about
unforeseen/future data.
Deep Learning

• Deep Learning is a computer software that mimics

the network of neurons in a brain. It is a subset of
machine learning based on artificial neural
networks with representation learning. It is called
deep learning because it makes use of deep neural
networks. This learning can be supervised, semi-
supervised or unsupervised.
• Deep Learning method is new machine learning
research where the algorithm selects the analysis
model to follow.
Data Science Process
Discovery

• Discovery step involves acquiring data from all

the identified internal & external sources,
which helps you answer the business question.
The data can be:
• Logs from webservers
• Data gathered from social media
• Census datasets
• Data streamed from online sources using APIs
Preparation

• Data can have many inconsistencies like

missing values, blank columns, an incorrect
data format, which needs to be cleaned. You
need to process, explore, and condition data
before modelling. The cleaner your data, the
better are your predictions.
Model Planning

• In this stage, you need to determine the

method and technique to draw the relation
between input variables. Planning for a model
is performed by using different statistical
formulas and visualization tools. SQL analysis
services, R, and SAS/access are some of the
tools used for this purpose.
• Data visualization tools are cloud-based
applications that help you to represent raw
data in easy to understand graphical formats.
You can use these programs to produce
customizable bar charts, pie charts, column
charts, and more.
TOP Data Visualization Tools

Name
FusionCharts
Power BI
Whatagraph
Tableau
Qlik
Model Building

• In this step, the actual model building process

starts. Here, Data scientist distributes datasets
for training and testing. Techniques like
association, classification, and clustering are
applied to the training data set. The model,
once prepared, is tested against the “testing”
dataset.
Operationalize

• You deliver the final baseline model with

reports, code, and technical documents in this
stage. Model is deployed into a real-time
production environment after thorough
testing.
Communicate Results

• In this stage, the key findings are

communicated to all stakeholders. This helps
you decide if the project results are a success
or a failure based on the inputs from the
model.
Data Science Jobs Roles

• Data Scientist
• Data Engineer
• Data Analyst
• Statistician
• Data Architect
• Data Admin
• Business Analyst
• Data/Analytics Manager
Data Scientist

• Role: A Data Scientist is a professional who

manages enormous amounts of data to come
up with compelling business visions by using
various tools, techniques, methodologies,
algorithms, etc.
• Languages: R, SAS, Python, SQL, Hive, Matlab,
Pig, Spark
Data Engineer

• Role: The role of a data engineer is of working

with large amounts of data. He develops,
constructs, tests, and maintains architectures
like large scale processing systems and
databases.
• Languages: SQL, Hive, R, SAS, Matlab, Python,
Java, Ruby, C + +, and Perl
Data Analyst

• Role: A data analyst is responsible for mining

vast amounts of data. They will look for
relationships, patterns, trends in data. Later he
or she will deliver compelling reporting and
visualization for analyzing the data to take the
most viable business decisions.
• Languages: R, Python, HTML, JS, C, C+ + , SQL
Statistician

• Role: The statistician collects, analyses, and

understands qualitative and quantitative data
using statistical theories and methods.
• Languages: SQL, R, Matlab, Tableau, Python,
Perl, Spark, and Hive
Data Administrator

• Role: Data admin should ensure that the

database is accessible to all relevant users. He
also ensures that it is performing correctly and
keeps it safe from hacking.
• Languages: Ruby on Rails, SQL, Java, C#, and
Python
Business Analyst
• Role: This professional needs to improve
business processes. He/she is an intermediary
between the business executive team and the
IT department.
• Languages: SQL, Tableau, Power BI and,
Python
Tools for Data Science
Data Data
Data Analysis Warehousing Visualization Machine
Learning

R, Spark, Hadoop, R, Tableau, Spark, Azure

Python and ML studio,
SAS SQL, Hive Raw Mahout
Difference Between Data Science with BI
(Business Intelligence)

Parameters Business Intelligence Data Science

Perception Looking Backward Looking Forward

Structured and Unstructured

Structured Data. Mostly
data.
Data Sources SQL, but some time Data
Like logs, SQL, NoSQL, or
Warehouse)
text

Statistics, Machine Learning,

Approach Statistics & Visualization
and Graph

Analysis & Neuro-linguistic

Emphasis Past & Present
Programming

Pentaho. Microsoft Bl,

Tools R, TensorFlow
QlikView,
Applications of Data Science
Internet Search:
• Google search uses Data science technology to search for a specific result within a
fraction of a second
Recommendation Systems:
• To create a recommendation system. For example, “suggested friends” on
Facebook or suggested videos” on YouTube, everything is done with the help of
Data Science.
Image & Speech Recognition:
• Speech recognizes systems like Siri, Google Assistant, and Alexa run on the Data
science technique. Moreover, Facebook recognizes your friend when you upload a
photo with them, with the help of Data Science.
Gaming world:
• EA Sports, Sony, Nintendo are using Data science technology. This enhances your
gaming experience. Games are now developed using Machine Learning techniques,
and they can update themselves when you move to higher levels.
Online Price Comparison:
• PriceRunner, Junglee, Shopzilla work on the Data science mechanism. Here, data is
fetched from the relevant websites using APIs.
Challenges of Data Science Technology

• A high variety of information & data is required for accurate

analysis
• Not adequate data science talent pool available
• Management does not provide financial support for a data
science team
• Unavailability of/difficult access to data
• Business decision-makers do not effectively use data Science
results
• Explaining data science to others is difficult
• Privacy issues
• Lack of significant domain expert
• If an organization is very small, it can’t have a Data Science team
Data Collection and
Management
• Introduction
• Sources of data
• Data collection and APIs
• Exploring and fixing data
• Data storage and
management
• Using multiple data sources
• Data Preparation
• Feature Engineering
• Data Visualization in R.
Data collection is the process of gathering and
measuring information on variables of interest, in
an established systematic fashion that enables one
to answer stated research questions, test
hypotheses, and evaluate outcomes.
• The data collection component of research is
common to all fields of study including physical and
social sciences, humanities, business, etc.
• While methods vary by discipline, the emphasis on
ensuring accurate and honest collection remains
the same.
What is Data Collection?
• In Statistics, data collection is a process of
gathering information from all the relevant
sources to find a solution to the research problem.
• It helps to evaluate the outcome of the problem.
The data collection methods allow a person to
conclude an answer to the relevant question.
• Most of the organizations use data collection
methods to make assumptions about future
probabilities and trends. Once the data is
collected, it is necessary to undergo the
data organization process.
• The main sources of the data collections
methods are “Data”. Data can be classified
into two types, namely primary data and
secondary data.
• The primary importance of data collection in
any research or business process is that it
helps to determine many important things
about the company, particularly the
performance. So, the data collection process
plays an important role in all the streams.
Depending on the type of data, the data collection
method is divided into two categories namely,
• Primary Data Collection methods
• Secondary Data Collection methods
Primary Data Collection Methods
• Primary data or raw data is a type of information that
is obtained directly from the first-hand source through
– Experiments,
– Surveys or
– Observations.
• The primary data collection method is further
classified into two types.
They are
• Quantitative Data Collection Methods
• Qualitative Data Collection Methods
Quantitative Data Collection Methods

• It is based on mathematical calculations using

various formats like close-ended questions,
– Correlation and
– Regression methods,
– Mean, Median or Mode measures.
This method is cheaper than qualitative data
collection methods and it can be applied in a
short duration of time.
Qualitative Data Collection Methods

• It does not involve any mathematical

calculations.
• This method is closely associated with
elements that are not quantifiable.
• This qualitative data collection method
includes interviews, questionnaires,
observations, case studies, etc. There are
several methods to collect this type of data.
Observation Method
• Observation method is used when the study
relates to behavioral science. This method is
planned systematically. It is subject to many
controls and checks. The different types of
observations are:
– Structured and unstructured observation
– Controlled and uncontrolled observation
– Participant, non-participant and disguised
observation
Interview Method
The method of collecting data in terms of verbal
responses. It is achieved in two ways, such as
• Personal Interview – In this method, a person known
as an interviewer is required to ask questions face to
face to the other person. The personal interview can
be structured or unstructured, direct investigation,
focused conversation, etc.
• Telephonic Interview – In this method, an interviewer
obtains information by contacting people on the
telephone to ask the questions or views, verbally.
Questionnaire Method
In this method, the set of questions are mailed to the
respondent. They should read, reply and subsequently return
the questionnaire. The questions are printed in the definite
order on the form. A good survey should have the following
features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
• Should have good physical appearance such as colour, quality
of the paper to attract the attention of the respondent
Schedules
• This method is similar to the questionnaire
method with a slight difference. The
enumerations are specially appointed for the
purpose of filling the schedules. It explains the
aims and objects of the investigation and may
remove misunderstandings, if any have come
up. Enumerators should be trained to perform
their job with hard work and patience.
Secondary Data Collection Methods

• Secondary data is data collected by someone

other than the actual user. It means that the
information is already available, and someone
analyses it. The secondary data includes
magazines, newspapers, books, journals, etc.
It may be either published data or
unpublished data.
Published data are available in various resources
including
• Government publications
• Public records
• Historical and statistical documents
• Business documents
• Technical and trade journals
Unpublished data includes
• Diaries
• Letters
• Unpublished biographies, etc.
Sources of data

• Data collection is the process of acquiring,

collecting, extracting, and storing the
voluminous amount of data which may be in
the structured or unstructured form like text,
video, audio, XML files, records, or other
image files used in later stages of data
analysis.
• In the process of big data analysis, “Data
collection” is the initial step before starting
to analyze the patterns or useful information
in data. The data which is to be analyzed
must be collected from different valid
sources.
• The data which is collected is known as raw
data which is not useful now but on cleaning
the impure and utilizing that data for further
analysis forms information, the information
obtained is known as “knowledge”.
Knowledge has many meanings like business
knowledge or sales of enterprise products,
disease treatment, etc. The main goal of data
collection is to collect information-rich data.
• Data collection starts with asking some
questions such as what type of data is to be
collected and what is the source of collection.
Most of the data collected are of two types
known as “qualitative data“ which is a group
of non-numerical data such as words,
sentences mostly focus on behavior and
actions of the group and another one is
“quantitative data” which is in numerical
forms and can be calculated using different
scientific tools and sampling data.
The actual data is then further divided mainly
into two types known as:
• Primary data
• Secondary data
1.Primary data:
• The data which is Raw, original, and extracted directly
from the official sources is known as primary data. This
type of data is collected directly by performing
techniques such as questionnaires, interviews, and
surveys. The data collected must be according to the
demand and requirements of the target audience on
which analysis is performed otherwise it would be a
burden in the data processing.
• Few methods of collecting primary data:
• 1. Interview method
• 2. Survey method
• 3. Observation method
• 4. Experimental method
1. Interview method:
• The data collected during this process is through
interviewing the target audience by a person
called interviewer and the person who answers
the interview is known as the interviewee. Some
basic business or product related questions are
asked and noted down in the form of notes, audio,
or video and this data is stored for processing.
These can be both structured and unstructured like
personal interviews or formal interviews through
telephone, face to face, email, etc.
2. Survey method:
• The survey method is the process of research
where a list of relevant questions are asked and
answers are noted down in the form of text, audio,
or video. The survey method can be obtained in
both online and offline mode like through website
forms and email. Then that survey answers are
stored for analyzing data. Examples are online
surveys or surveys through social media polls.
3. Observation method:
• The observation method is a method of data
collection in which the researcher keenly observes
the behavior and practices of the target audience
using some data collecting tool and stores the
observed data in the form of text, audio, video, or
any raw formats. In this method, the data is
collected directly by posting a few questions on
the participants. For example, observing a group of
customers and their behavior towards the
products. The data obtained will be sent for
processing.
4. Experimental method:
• The experimental method is the process of
collecting data through performing experiments,
research, and investigation. The most frequently
used experiment methods are
1. CRD
2. RBD
3. LSD
4. FD
• CRD- Completely Randomized design is a simple experimental
design used in data analytics which is based on randomization and
replication. It is mostly used for comparing the experiments.
• RBD- Randomized Block Design is an experimental design in which
the experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are
drawn using a technique known as analysis of variance (ANOVA).
RBD was originated from the agriculture sector.
• LSD – Latin Square Design is an experimental design that is similar
to CRD and RBD blocks but contains rows and columns. It is an
arrangement of NxN squares with an equal amount of rows and
columns which contain letters that occurs only once in a row.
Hence the differences can be easily found with fewer errors in the
experiment. Sudoku puzzle is an example of a Latin square design.
• FD- Factorial design is an experimental design where each
experiment has two factors each with possible values and on
performing trail other combinational factors are derived.
2. Secondary data:
Secondary data is the data which has already been
collected and reused again for some valid purpose.
This type of data is previously recorded from primary
data and it has two types of sources named internal
source and external source.
• Internal source
• External source
Internal source:
• These types of data can easily be found
within the organization such as market
record, a sales record, transactions, customer
data, accounting resources, etc. The cost and
time consumption is less in obtaining internal
sources.
External source:
• The data which can’t be found at internal
organizations and can be gained through external
third party resources is external source data. The
cost and time consumption is more because this
contains a huge amount of data. Examples of
external sources are Government publications,
news publications, Registrar General of India,
planning commission, international labor bureau,
syndicate services, and other non-governmental
publications
Other sources:
• Sensors data: With the advancement of IoT devices, the
sensors of these devices collect data which can be used
for sensor data analytics to track the performance and
usage of products.
• Satellites data: Satellites collect a lot of images and data
in terabytes on daily basis through surveillance cameras
which can be used to collect useful information.
• Web traffic: Due to fast and cheap internet facilities
many formats of data which is uploaded by users on
different platforms can be predicted and collected with
their permission for data analysis. The search engines
also provide their data through keywords and queries
searched mostly.
Technology
forecasting
method
Data collection and APIs
https://www.r-bloggers.com/2022/03/how-to-get-twitter-data-using-r/
Exploring and fixing data
What is Data Exploration?

• Data exploration is the first step of data

analysis used to explore and visualize data to
uncover insights from the start or identify
areas or patterns to dig into more. Using
interactive dashboards and point-and-click
data exploration, users can better understand
the bigger picture and get to insights faster.
• Data exploration is the first step in data analysis
involving the use of data visualization tools and
statistical techniques to uncover data set
characteristics and initial patterns.
• During exploration, raw data is typically reviewed
with a combination of manual workflows and
automated data-exploration techniques to
visually explore data sets, look for similarities,
patterns and outliers and to identify the
relationships between different variables.
• This is also sometimes referred to as exploratory
data analysis, which is a statistical technique
employed to analyze raw data sets in search of their
broad characteristics.
Why is Data Exploration Important?
• Humans are visual learners, able to process visual
data much more easily than numerical data.
Consequently, it's challenging for data scientists
to review thousands of rows of data points and
infer meaning without assistance.
• Data visualization tools and elements like colors,
shapes, lines, graphs and angles aid in effective
data exploration of metadata, enabling
relationships or anomalies to be detected
What industries use data exploration?
• Any business or industry that collects or utilizes data can
benefit from data exploration. A few common industries
include software development, healthcare and education.
• The advanced visualization techniques employed by data
exploration and business intelligence tools enable businesses
and stakeholders to better understand performance metrics
by
making raw data more comprehensible and creating a "story"
around it.
• By visualizing patterns and finding commonalities in complex
data flows, data exploration can help enterprises make data-
driven decisions to streamline processes, better target their
ideal audience, increase productivity and achieve greater
returns.
Data exploration vs. data mining
• In data science, there are two primary methods for extracting data
from disparate sources: data exploration and data mining.
• Data exploration is a broad process that is performed by business
users and an increasing numbers of citizen data scientists with no
formal training in data science or analytics, but whose jobs depend
on understanding data trends and patterns. Visualization tools help
this wide-ranging group to better export and examine a variety of
metrics and data sets.
• Data mining is a specific process, usually undertaken by data
professionals. Data analysts create association rules and parameters
to sort through extremely large data sets and identify patterns and
future trends.
• Typically, data exploration is performed first to assess the
relationships between variables. Then the data mining begins.
Through this process, data models are created to gather additional
How machine learning is applied to data exploration
• Machine learning can significantly aid in data exploration
when large quantities of data are involved. However, for a
machine learning model to be accurate, data analysts must
take the following steps before performing the analysis:
• Identify and define all variables in the data set.
• Conduct univariate analysis for single variables, using a
histogram, box plot or scatter plot. For categorical variables
(those that can be grouped by category), bar charts can be
used.
• Conduct bivariate analysis, to determine the relationship
between pairs of variables. This can be completed using
data visualization tools, such as Tableau.
• Account for any missing values and outliers
What is the best language for data exploration?

• The most commonly used statistical methods

in data exploration are the
R programming language and Python. Both
are open source data analytics languages.
• While R is best for statistical analysis, Python is
better suited for machine learning algorithms.
Data exploration tools
• Data exploration tools make data analysis easier to present
and understand through interactive, visual elements, making
it easier to share and communicate key insights.
• Data exploration tools include data visualization software
and business intelligence platforms, such as
Microsoft Power BI, Qlik and Tableau.
• Available open source data exploration tools can also
incorporate regression functionality, data profiling and
visualization capabilities, which enables businesses to
integrate various, disparate data sources for faster data
exploration.
• Some popular open source tools include Knime, OpenRefine,
NodeXL, Pentaho, R programming and RapidMiner.
Data Preparation
• Data preparation is the process of cleaning and
transforming raw data prior to processing and
analysis.
• It is an important step prior to processing and often
– involves reformatting data,
– making corrections to data and
– the combining of data sets to enrich data.
• Data preparation is often a lengthy undertaking for
data professionals or business users, but it is
essential as a prerequisite to put data in context in
order to turn it into insights and eliminate bias
resulting from poor data quality.
• The data preparation process usually includes
– standardizing data formats,
– enriching source data, and/or
– removing outliers.
• Data Preparation is the process of
– collecting,
– cleaning, and
– consolidating data into one file or data table,
primarily for use in analysis.
Why Prepare Data?
• There are several reasons why we need to prepare the data.
By preparing data, we actually prepare the miner so that
when using prepared data, the miner produces better
models faster.
1. Good data is essential for producing efficient models of
any type.
2. Data should be formatted according to required software
tool.
3. Data need to be made adequate for given method.
4. Data in the real world is dirty.
• Incomplete data: Some data lack attribute values,
lacking certain attributes of interest, or containing
only aggregate data.
For example, First name = “” or Last name = “”
• Noisy: Some data contains errors.
For example, Age = -10
• Inconsistent: Some data contain discrepancies in
codes and names
For example, Age = 56, Birthdate = ’04–05–1995’
Benefits of Data Preparation:
• Fix errors quickly — Data preparation helps catch
errors before processing. After data has been
removed from its original source, these errors
become more difficult to understand and correct.
• Produce top-quality data — Cleaning and
reformatting datasets ensures that all data used in
analysis will be high quality.
• Make better business decisions — higher quality
data that can be processed and analyzed more
quickly and efficiently leads to more timely,
efficient and high-quality business decisions.
Data preparation steps:
The major tasks in data preparation are as follows:
1) Data discretization
2) Data cleaning
3) Data integration
4) Data transformation
5) Data reduction
1) Data discretization:
• It is a part of data reduction which contains
particular importance especially for numerical data.
2) Data cleaning:
Manual data prep is error-prone, time-consuming and
costly. Business decisions rely on analytics. But, if the
data is inaccurate or incomplete, your analytics inform
wrong businesses decisions. Bad analytics means poor
business decisions. Altair Monarch is programmed
with over 80 pre-built data preparation functions to
speed up arduous data cleansing projects.
3) Data integration:
• Access data from any source — no matter the
origin, format or narrative and integrating them
together. Monarch excels at intelligently and
automatically extracting data from complex
unstructured and semi-structured sources, like
PDFs. Increased access to data means less
manual work, faster insights and faster time to
value realized by your organization.
• 4) Data transformation:
• Being able to quickly change the way data is
summarized and presented enables business
analysts and executives to quickly consider
different perspectives and views of data. Monarch
make it easy to package your clean and blended
data for insightful reporting you can confidently
share with the rest of your organization.
• 5) Data reduction:
• It obtains reduced representation in volume but
produces the same or similar analytical results.
Feature Engineering

• Feature engineering refers to manipulation —

addition, deletion, combination, mutation —
of your data set to improve machine learning
model training, leading to better performance
and greater accuracy. Effective feature
engineering is based on sound knowledge of
the business problem and the available data
sources.
Feature Engineering in ML Lifecycle
Some common types of feature engineering include:

• Scaling and normalization

• Filling missing values
• Feature selection
• Feature coding
• Feature construction
• Feature extraction
• Scaling and normalization means adjusting the
range and center of data to ease learning and
improve the interpretation of the results.
• Feature coding involves choosing a set of symbolic values to represent
different categories. Concepts can be captured with a single column that
comprises multiple values, or they can be captured with multiple
columns, each of which represents a single value and has a true or false
in each field. For example, feature coding can indicate whether a
particular row of data was collected on a holiday. This is a form of
feature construction.
• Feature construction creates a new feature(s) from one or more other
features. For example, using the date you can add a feature that
indicates the day of the week. With this added insight, the algorithm
could discover that certain outcomes are more likely on a Monday or a
weekend.
• Feature extraction means moving from low-level features that are
unsuitable for learning — practically speaking, you get poor testing
results — to higher-level features that are useful for learning. Often
feature extraction is valuable when you have specific data formats — like
images or text — that have to be converted to a tabular row-column,
example-feature format.
Filling missing values
Feature selection
Feature coding
Feature construction
Combine existing features
Feature extraction
Feature Extraction
Facial Expression Detection
Facial Expression Detection
Facial Expression Detection

https://sefiks.com/2018/01/10/real-time-facial-expression-recognition-on-streamin
g-data
/
Data Visualization in R.

• Data visualization is the technique used to

deliver insights in data using visual cues such
as graphs, charts, maps, and many others. This
is useful as it helps in intuitive and easy
understanding of the large quantities of data
and thereby make better decisions regarding
it.
Data Visualization in R Programming Language

• The popular data visualization tools that are available

are Tableau, Plotly, R, Google Charts, Infogram, and
Kibana. The various data visualization platforms have
different capabilities, functionality, and use cases. They
also require a different skill set. This article discusses
the use of R for data visualization.
• R is a language that is designed for statistical
computing, graphical data analysis, and scientific
research. It is usually preferred for data visualization as
it offers flexibility and minimum required coding
through its packages.
Types of Data Visualizations
Some of the various types of visualizations offered by
R are:
Bar Plot
Histogram
Box Plot
Scatter Plot
Heat Map
3D Graphs in R
• Consider the following airquality data set for
visualization in R:

Ozone Solar R. Wind Temp Month Day

41 190 7.4 67 5 1

36 118 8.0 72 5 2

12 149 12.6 74 5 3

18 313 11.5 62 5 4

NA NA 14.3 56 5 5

28 NA 14.9 66 5 6
Bar Plot
• There are two types of bar plots- horizontal and vertical
which represent data points as horizontal or vertical
bars of certain lengths proportional to the value of the
data item. They are generally used for continuous and
categorical variable plotting. By setting
the horiz parameter to true and false, we can get
horizontal and vertical bar plots respectively.
• Example:
• # Horizontal Bar Plot for
• # Ozone concentration in air
• barplot(airquality$Ozone,
• main = 'Ozone Concenteration in air',
• xlab = 'ozone levels', horiz = TRUE)
# Horizontal Bar Plot for
# Ozone concentration in air
barplot(airquality$Ozone,
main = 'Ozone Concenteration in air',
xlab = 'ozone levels', horiz = TRUE)

# Vertical Bar Plot for

# Ozone concentration in air
barplot(airquality$Ozone, main = 'Ozone
Concenteration in air', xlab = 'ozone levels',
col ='blue', horiz = FALSE)
Bar plots are used for the following scenarios:
• To perform a comparative study between the
various data categories in the data set.
• To analyze the change of a variable over time in
months or years.
• Histogram
• A histogram is like a bar chart as it uses bars of
varying height to represent data distribution.
However, in a histogram values are grouped
into consecutive intervals called bins. In a
Histogram, continuous values are grouped and
displayed in these bins whose size can be
varied. # Histogram for Maximum Daily Temperature
data(airquality)

hist(airquality$Temp, main ="La Guardia

Airport's\
Maximum Temperature(Daily)",
xlab ="Temperature(Fahrenheit)",
xlim = c(50, 125), col ="yellow",
freq = TRUE)
Histograms are used in the following scenarios:
• To verify an equal and symmetric distribution
of the data.
• To identify deviations from expected values.
• Box Plot
• The statistical summary of the given data is
presented graphically using a boxplot. A
boxplot depicts information like the minimum
and maximum data point, the median value,
first and third quartile, and interquartile
range.
# Box plot for average wind speed
data(airquality)

boxplot(airquality$Wind, main = "Average wind speed\

at La Guardia Airport",
xlab = "Miles per hour", ylab = "Wind",
col = "orange", border = "brown",
horizontal = TRUE, notch = TRUE)
• Multiple box plots can also be generated at
once through the following code:

# Multiple Box plots, each representing

# an Air Quality Parameter
boxplot(airquality[, 0:4],
main ='Box Plots for Air Quality Parameters')
Box Plots are used for:
• To give a comprehensive statistical description of
the data through a visual cue.
• To identify the outlier points that do not lie in the
inter-quartile range of data.
Scatter Plot
• A scatter plot is composed of many points on a
Cartesian plane. Each point denotes the value
taken by two parameters and helps us easily
identify the relationship between them.
# Scatter plot for Ozone Concentration per month
data(airquality)

plot(airquality$Ozone, airquality$Month,
main ="Scatterplot Example",
xlab ="Ozone Concentration in parts per
billion",
ylab =" Month of observation ", pch = 19)
Scatter Plots are used in the following
scenarios:
• To show whether an association exists
between bivariate data.
• To measure the strength and direction of such
a relationship.
Heat Map
• Heatmap is defined as a graphical representation of
data using colors to visualize the value of the
matrix. heatmap() function is used to plot heatmap.
• Syntax: heatmap(data)
• Parameters: data: It represent matrix data, such as
values of rows and columns
• Return: This function draws a heatmap.
# Set seed for reproducibility
# set.seed(110)

# Create example data

data <- matrix(rnorm(50, 0, 5), nrow = 5, ncol = 5)

# Column names
colnames(data) <- paste0("col", 1:5)
rownames(data) <- paste0("row", 1:5)

# Draw a heatmap
heatmap(data)
Map visualization in R
• Here we are using maps package to visualize and
display geographical maps using an R
programming language.
• install.packages("maps")
# Read dataset and convert it into
# Dataframe
data <- read.csv("worldcities.csv")
df <- data.frame(data)

# Load the required libraries

library(maps)
map(database = "world")

# marking points on map

points(x = df$lat[1:500], y = df$lng[1:500], col =
"Red")
• 3D Graphs in R
• Here we will use preps() function, This function is used
to create 3D surfaces in perspective view. This
function will draw perspective plots of a surface over
the x–y plane.
• Syntax: persp(x, y, z)
• Parameter: This function accepts different parameters
i.e. x, y and z where x and y are vectors defining the
location along x- and y-axis. z-axis will be the height of
the surface in the matrix z.
• Return Value: persp() returns the viewing
transformation matrix for projecting 3D coordinates
(x, y, z) into the 2D plane using homogeneous 4D
coordinates (x, y, z, t).
# Adding Titles and Labeling Axes to Plot
cone <- function(x, y){
sqrt(x ^ 2 + y ^ 2)
}

# prepare variables.
x <- y <- seq(-1, 1, length = 30)
z <- outer(x, y, cone)

# plot the 3D surface

# Adding Titles and Labeling Axes to Plot
persp(x, y, z,
main="Perspective Plot of a Cone",
zlab = "Height",
theta = 30, phi = 15,
col = "orange", shade = 0.4)
Advantages of Data Visualization in R:

R has the following advantages over other tools for

data visualization:
• R offers a broad collection of visualization libraries
along with extensive online guidance on their
usage.
• R also offers data visualization in the form of 3D
models and multipanel charts.
• Through R, we can easily customize our data
visualization by changing axes, fonts, legends,
annotations, and labels.
Disadvantages of Data Visualization in R:

R also has the following disadvantages:

• R is only preferred for data visualization when
done on an individual standalone server.
• Data visualization using R is slow for large
amounts of data as compared to other
counterparts.
Application Areas:

• Presenting analytical conclusions of the data to the non-

analysts departments of your company.
• Health monitoring devices use data visualization to track
any anomaly in blood pressure, cholesterol and others.
• To discover repeating patterns and trends in consumer and
marketing data.
• Meteorologists use data visualization for assessing
prevalent weather changes throughout the world.
• Real-time maps and geo-positioning systems use
visualization for traffic monitoring and estimating travel
time.

Introduction To Data Science
No ratings yet
Introduction To Data Science
24 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Introduction To Data Science What Is Data Science?
No ratings yet
Introduction To Data Science What Is Data Science?
11 pages
DS Notes
No ratings yet
DS Notes
159 pages
IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
Data Science PDF
No ratings yet
Data Science PDF
8 pages
Data Science Course Overview and Skills
100% (2)
Data Science Course Overview and Skills
18 pages
00 Introduction To Data Science
No ratings yet
00 Introduction To Data Science
4 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Introductiontodatascience 230122140841 B90a0856
No ratings yet
Introductiontodatascience 230122140841 B90a0856
44 pages
Fundamentals of Data Science Course
75% (4)
Fundamentals of Data Science Course
62 pages
Data Science Modern Technology5
No ratings yet
Data Science Modern Technology5
6 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Vishwha D
No ratings yet
Vishwha D
29 pages
Key Components of Data Science Explained
No ratings yet
Key Components of Data Science Explained
7 pages
Introductiontodatascience 230122140841 B90a0856 1
No ratings yet
Introductiontodatascience 230122140841 B90a0856 1
44 pages
Intro To Data and Data Science
No ratings yet
Intro To Data and Data Science
9 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
CD101 Fundamental of Data Science
No ratings yet
CD101 Fundamental of Data Science
41 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science: Key Roles and Benefits
No ratings yet
Data Science: Key Roles and Benefits
32 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Introduction To Data Science Data Science
No ratings yet
Introduction To Data Science Data Science
40 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
16 pages
Unit I
No ratings yet
Unit I
52 pages
Chapter one-DSA
No ratings yet
Chapter one-DSA
20 pages
Unit-1 - IDS
No ratings yet
Unit-1 - IDS
29 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Module 1 Applied Data Science 1.1 and 1.2
No ratings yet
Module 1 Applied Data Science 1.1 and 1.2
104 pages
Data Science Notes Mtech
No ratings yet
Data Science Notes Mtech
115 pages
Unit - 1 DS
No ratings yet
Unit - 1 DS
24 pages
Introduction To Data-Science
No ratings yet
Introduction To Data-Science
246 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
Foundation of Data Science (BSC)
No ratings yet
Foundation of Data Science (BSC)
64 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
AI UNIT 1 Data Science
No ratings yet
AI UNIT 1 Data Science
16 pages
HUI-CMP201 Note 5
No ratings yet
HUI-CMP201 Note 5
62 pages
Data Science Fundamentals Lecture Notes
No ratings yet
Data Science Fundamentals Lecture Notes
101 pages
Data Science Study Materials
No ratings yet
Data Science Study Materials
47 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
171 pages
Fundamentals of Data Science Overview
No ratings yet
Fundamentals of Data Science Overview
18 pages
Unit 1
No ratings yet
Unit 1
60 pages
Data Science & Big Data Course Guide
No ratings yet
Data Science & Big Data Course Guide
119 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
17 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
12 pages
Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
Fundamentals of Data Science Course Overview
No ratings yet
Fundamentals of Data Science Course Overview
65 pages
Fundamentals of Data Science
100% (1)
Fundamentals of Data Science
53 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
DS B&V-1
No ratings yet
DS B&V-1
30 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
Unit 2 Data Science
No ratings yet
Unit 2 Data Science
53 pages
ChatGPT - MyLearning On Big Data, Data Science and Machine Learning
No ratings yet
ChatGPT - MyLearning On Big Data, Data Science and Machine Learning
44 pages
Data Science
No ratings yet
Data Science
19 pages
Baddi Y. Big Data Intelligence For Smart Applications 2022
No ratings yet
Baddi Y. Big Data Intelligence For Smart Applications 2022
343 pages
RDBMS Design for Developers
No ratings yet
RDBMS Design for Developers
11 pages
Shreya Parjan: STEM Education & Research
No ratings yet
Shreya Parjan: STEM Education & Research
1 page
Managerial Communication & IT Guide
No ratings yet
Managerial Communication & IT Guide
14 pages
Data Warehousing and Mining Q&A Guide
100% (1)
Data Warehousing and Mining Q&A Guide
3 pages
Data Encryption and Decription
No ratings yet
Data Encryption and Decription
10 pages
Course End Survey DS - B SEC
No ratings yet
Course End Survey DS - B SEC
3 pages
DSA - Week 3 - Linked List
No ratings yet
DSA - Week 3 - Linked List
32 pages
Code Review
No ratings yet
Code Review
2 pages
Mcsl-210 Solved Assignment 2025-26 @ignouhub
No ratings yet
Mcsl-210 Solved Assignment 2025-26 @ignouhub
19 pages
Indexes in Database
100% (1)
Indexes in Database
38 pages
Evaluating Cephfs Performance vs. Cost On High Density Commodity Disk Servers
No ratings yet
Evaluating Cephfs Performance vs. Cost On High Density Commodity Disk Servers
10 pages
GIS: Science and Technology Explained
No ratings yet
GIS: Science and Technology Explained
4 pages
Memory System Overview and Technologies
No ratings yet
Memory System Overview and Technologies
44 pages
AWS Backup: Centralized Data Protection
No ratings yet
AWS Backup: Centralized Data Protection
11 pages
Query Processing
No ratings yet
Query Processing
121 pages
Dbmslab PDF
100% (1)
Dbmslab PDF
114 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Data Warehousing Concepts and Basics
No ratings yet
Data Warehousing Concepts and Basics
21 pages
Data Base Management System (DBMS) Unit - 1
No ratings yet
Data Base Management System (DBMS) Unit - 1
99 pages
Level 5 Social Work O.S.
No ratings yet
Level 5 Social Work O.S.
104 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
12 pages
Lesson 3 - Information Literacy
No ratings yet
Lesson 3 - Information Literacy
14 pages
Advanced Database Query Systems Techniques Applications and Technologies 1st Edition Li Yan PDF Download
No ratings yet
Advanced Database Query Systems Techniques Applications and Technologies 1st Edition Li Yan PDF Download
76 pages
CSE 344 Midterm SQL & DB Queries
No ratings yet
CSE 344 Midterm SQL & DB Queries
8 pages
Database Structure of Accounting Systems
67% (3)
Database Structure of Accounting Systems
4 pages
Use of Computer in Social Science Research - August - 2012 - 5996565139 - 1304002
No ratings yet
Use of Computer in Social Science Research - August - 2012 - 5996565139 - 1304002
2 pages
Answers by GRP
No ratings yet
Answers by GRP
22 pages
Datalakehouse Implementation Proposal - v1.1
No ratings yet
Datalakehouse Implementation Proposal - v1.1
5 pages
Character Device Driver
No ratings yet
Character Device Driver
42 pages

Unit-3 Intr Data Science

Uploaded by

Unit-3 Intr Data Science

Uploaded by

Introduction to Data Science

• Data Science Terminology

• Data Science is the area of study which

• Big Data is a collection of data that is huge in

• Statistics is the most critical unit of Data

• Visualization technique helps you access huge

• Machine Learning is a system of computer

• Deep Learning is a computer software that mimics

• Discovery step involves acquiring data from all

• Data can have many inconsistencies like

• In this stage, you need to determine the

• In this step, the actual model building process

• You deliver the final baseline model with

• In this stage, the key findings are

• Role: A Data Scientist is a professional who

• Role: The role of a data engineer is of working

• Role: A data analyst is responsible for mining

• Role: The statistician collects, analyses, and

• Role: Data admin should ensure that the

R, Spark, Hadoop, R, Tableau, Spark, Azure

Parameters Business Intelligence Data Science

Perception Looking Backward Looking Forward

Structured and Unstructured

Statistics, Machine Learning,

Analysis & Neuro-linguistic

Pentaho. Microsoft Bl,

• A high variety of information & data is required for accurate

• It is based on mathematical calculations using

• It does not involve any mathematical

• Secondary data is data collected by someone

• Data collection is the process of acquiring,

• Data exploration is the first step of data

• The most commonly used statistical methods

• Feature engineering refers to manipulation —

• Scaling and normalization

• Data visualization is the technique used to

• The popular data visualization tools that are available

Ozone Solar R. Wind Temp Month Day

# Vertical Bar Plot for

hist(airquality$Temp, main ="La Guardia

boxplot(airquality$Wind, main = "Average wind speed\

# Multiple Box plots, each representing

# Create example data

# Load the required libraries

# marking points on map

# plot the 3D surface

R has the following advantages over other tools for

R also has the following disadvantages:

• Presenting analytical conclusions of the data to the non-

You might also like