0% found this document useful (0 votes)
5 views

Unit-3 Intr Data Science

The document provides an overview of Data Science, including its definition, importance, and the processes involved in data collection and analysis. It outlines the roles within data science projects, such as Data Scientist and Data Engineer, and discusses the significance of big data and various data collection methods. Additionally, it highlights the challenges faced in the field and the applications of data science in different industries.

Uploaded by

manu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit-3 Intr Data Science

The document provides an overview of Data Science, including its definition, importance, and the processes involved in data collection and analysis. It outlines the roles within data science projects, such as Data Scientist and Data Engineer, and discusses the significance of big data and various data collection methods. Additionally, it highlights the challenges faced in the field and the applications of data science in different industries.

Uploaded by

manu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 150

Introduction to Data Science

• Data Science Terminology


• Data Science Process
• Data Science Project Roles
What is Data Science?

• Data Science is the area of study which


involves extracting insights from vast amounts
of data using various scientific methods,
algorithms, and processes.
• It helps you to discover hidden patterns from
the raw data.
• The term Data Science has emerged because
of the evolution of mathematical statistics,
data analysis, and big data. V4
What is Big Data?

• Big Data is a collection of data that is huge in


volume, yet growing exponentially with time.
It is a data with so large size and complexity
that none of traditional data management
tools can store it or process it efficiently. Big
data is also a data but with huge size.
• Data Science is an interdisciplinary field that
allows you to extract knowledge from
structured or unstructured data.
• Data science enables you to translate a
business problem into a research project and
then translate it back into a practical solution.
Why Data Science?
• Data is the oil for today’s world. With the right tools,
technologies, algorithms, we can use data and convert it
into a distinct business advantage
• Data Science can help you to detect fraud using advanced
machine learning algorithms
• It helps you to prevent any significant monetary losses
• Allows to build intelligence ability in machines
• You can perform sentiment analysis to gauge customer
brand loyalty
• It enables you to take better and faster decisions
• It helps you to recommend the right product to the right
customer to enhance your business
Evolution of DataSciences
Data Science Components
Statistics

• Statistics is the most critical unit of Data


Science basics, and it is the method or science
of collecting and analyzing numerical data in
large quantities to get useful insights.
Visualization

• Visualization technique helps you access huge


amounts of data in easy to understand and
digestible visuals.
https://www.google.com/search?q=data+visuali
zation&tbm=isch&chips=q:data+visualization,g_
1:beautiful:Krt0Cu_YPJs%3D&rlz=1C1CHWA_enI
N733IN733&hl=en&sa=X&ved=2ahUKEwiwsM2
OvI3_AhXzluYKHZBIBF4Q4lYoBHoECAEQLQ&biw
=1349&bih=600#imgrc=jPXTRAOn70qyXM
Machine Learning

• Machine Learning is a system of computer


algorithms that can learn from example through
self-improvement without being explicitly coded by
a programmer. Machine learning is a part of
artificial Intelligence which combines data with
statistical tools to predict an output which can be
used to make actionable insights.
• Machine Learning explores the building and study
of algorithms that learn to make predictions about
unforeseen/future data.
Deep Learning

• Deep Learning is a computer software that mimics


the network of neurons in a brain. It is a subset of
machine learning based on artificial neural
networks with representation learning. It is called
deep learning because it makes use of deep neural
networks. This learning can be supervised, semi-
supervised or unsupervised.
• Deep Learning method is new machine learning
research where the algorithm selects the analysis
model to follow.
Data Science Process
Discovery

• Discovery step involves acquiring data from all


the identified internal & external sources,
which helps you answer the business question.
The data can be:
• Logs from webservers
• Data gathered from social media
• Census datasets
• Data streamed from online sources using APIs
Preparation

• Data can have many inconsistencies like


missing values, blank columns, an incorrect
data format, which needs to be cleaned. You
need to process, explore, and condition data
before modelling. The cleaner your data, the
better are your predictions.
Model Planning

• In this stage, you need to determine the


method and technique to draw the relation
between input variables. Planning for a model
is performed by using different statistical
formulas and visualization tools. SQL analysis
services, R, and SAS/access are some of the
tools used for this purpose.
• Data visualization tools are cloud-based
applications that help you to represent raw
data in easy to understand graphical formats.
You can use these programs to produce
customizable bar charts, pie charts, column
charts, and more.
TOP Data Visualization Tools

Name
FusionCharts
Power BI
Whatagraph
Tableau
Qlik
Model Building

• In this step, the actual model building process


starts. Here, Data scientist distributes datasets
for training and testing. Techniques like
association, classification, and clustering are
applied to the training data set. The model,
once prepared, is tested against the “testing”
dataset.
Operationalize

• You deliver the final baseline model with


reports, code, and technical documents in this
stage. Model is deployed into a real-time
production environment after thorough
testing.
Communicate Results

• In this stage, the key findings are


communicated to all stakeholders. This helps
you decide if the project results are a success
or a failure based on the inputs from the
model.
Data Science Jobs Roles

• Data Scientist
• Data Engineer
• Data Analyst
• Statistician
• Data Architect
• Data Admin
• Business Analyst
• Data/Analytics Manager
Data Scientist

• Role: A Data Scientist is a professional who


manages enormous amounts of data to come
up with compelling business visions by using
various tools, techniques, methodologies,
algorithms, etc.
• Languages: R, SAS, Python, SQL, Hive, Matlab,
Pig, Spark
Data Engineer

• Role: The role of a data engineer is of working


with large amounts of data. He develops,
constructs, tests, and maintains architectures
like large scale processing systems and
databases.
• Languages: SQL, Hive, R, SAS, Matlab, Python,
Java, Ruby, C + +, and Perl
Data Analyst

• Role: A data analyst is responsible for mining


vast amounts of data. They will look for
relationships, patterns, trends in data. Later he
or she will deliver compelling reporting and
visualization for analyzing the data to take the
most viable business decisions.
• Languages: R, Python, HTML, JS, C, C+ + , SQL
Statistician

• Role: The statistician collects, analyses, and


understands qualitative and quantitative data
using statistical theories and methods.
• Languages: SQL, R, Matlab, Tableau, Python,
Perl, Spark, and Hive
Data Administrator

• Role: Data admin should ensure that the


database is accessible to all relevant users. He
also ensures that it is performing correctly and
keeps it safe from hacking.
• Languages: Ruby on Rails, SQL, Java, C#, and
Python
Business Analyst
• Role: This professional needs to improve
business processes. He/she is an intermediary
between the business executive team and the
IT department.
• Languages: SQL, Tableau, Power BI and,
Python
Tools for Data Science
Data Data
Data Analysis Warehousing Visualization Machine
Learning

R, Spark, Hadoop, R, Tableau, Spark, Azure


Python and ML studio,
SAS SQL, Hive Raw Mahout
Difference Between Data Science with BI
(Business Intelligence)

Parameters Business Intelligence Data Science

Perception Looking Backward Looking Forward

Structured and Unstructured


Structured Data. Mostly
data.
Data Sources SQL, but some time Data
Like logs, SQL, NoSQL, or
Warehouse)
text

Statistics, Machine Learning,


Approach Statistics & Visualization
and Graph

Analysis & Neuro-linguistic


Emphasis Past & Present
Programming

Pentaho. Microsoft Bl,


Tools R, TensorFlow
QlikView,
Applications of Data Science
Internet Search:
• Google search uses Data science technology to search for a specific result within a
fraction of a second
Recommendation Systems:
• To create a recommendation system. For example, “suggested friends” on
Facebook or suggested videos” on YouTube, everything is done with the help of
Data Science.
Image & Speech Recognition:
• Speech recognizes systems like Siri, Google Assistant, and Alexa run on the Data
science technique. Moreover, Facebook recognizes your friend when you upload a
photo with them, with the help of Data Science.
Gaming world:
• EA Sports, Sony, Nintendo are using Data science technology. This enhances your
gaming experience. Games are now developed using Machine Learning techniques,
and they can update themselves when you move to higher levels.
Online Price Comparison:
• PriceRunner, Junglee, Shopzilla work on the Data science mechanism. Here, data is
fetched from the relevant websites using APIs.
Challenges of Data Science Technology

• A high variety of information & data is required for accurate


analysis
• Not adequate data science talent pool available
• Management does not provide financial support for a data
science team
• Unavailability of/difficult access to data
• Business decision-makers do not effectively use data Science
results
• Explaining data science to others is difficult
• Privacy issues
• Lack of significant domain expert
• If an organization is very small, it can’t have a Data Science team
Data Collection and
Management
• Introduction
• Sources of data
• Data collection and APIs
• Exploring and fixing data
• Data storage and
management
• Using multiple data sources
• Data Preparation
• Feature Engineering
• Data Visualization in R.
Data collection is the process of gathering and
measuring information on variables of interest, in
an established systematic fashion that enables one
to answer stated research questions, test
hypotheses, and evaluate outcomes.
• The data collection component of research is
common to all fields of study including physical and
social sciences, humanities, business, etc.
• While methods vary by discipline, the emphasis on
ensuring accurate and honest collection remains
the same.
What is Data Collection?
• In Statistics, data collection is a process of
gathering information from all the relevant
sources to find a solution to the research problem.
• It helps to evaluate the outcome of the problem.
The data collection methods allow a person to
conclude an answer to the relevant question.
• Most of the organizations use data collection
methods to make assumptions about future
probabilities and trends. Once the data is
collected, it is necessary to undergo the
data organization process.
• The main sources of the data collections
methods are “Data”. Data can be classified
into two types, namely primary data and
secondary data.
• The primary importance of data collection in
any research or business process is that it
helps to determine many important things
about the company, particularly the
performance. So, the data collection process
plays an important role in all the streams.
Depending on the type of data, the data collection
method is divided into two categories namely,
• Primary Data Collection methods
• Secondary Data Collection methods
Primary Data Collection Methods
• Primary data or raw data is a type of information that
is obtained directly from the first-hand source through
– Experiments,
– Surveys or
– Observations.
• The primary data collection method is further
classified into two types.
They are
• Quantitative Data Collection Methods
• Qualitative Data Collection Methods
Quantitative Data Collection Methods

• It is based on mathematical calculations using


various formats like close-ended questions,
– Correlation and
– Regression methods,
– Mean, Median or Mode measures.
This method is cheaper than qualitative data
collection methods and it can be applied in a
short duration of time.
Qualitative Data Collection Methods

• It does not involve any mathematical


calculations.
• This method is closely associated with
elements that are not quantifiable.
• This qualitative data collection method
includes interviews, questionnaires,
observations, case studies, etc. There are
several methods to collect this type of data.
Observation Method
• Observation method is used when the study
relates to behavioral science. This method is
planned systematically. It is subject to many
controls and checks. The different types of
observations are:
– Structured and unstructured observation
– Controlled and uncontrolled observation
– Participant, non-participant and disguised
observation
Interview Method
The method of collecting data in terms of verbal
responses. It is achieved in two ways, such as
• Personal Interview – In this method, a person known
as an interviewer is required to ask questions face to
face to the other person. The personal interview can
be structured or unstructured, direct investigation,
focused conversation, etc.
• Telephonic Interview – In this method, an interviewer
obtains information by contacting people on the
telephone to ask the questions or views, verbally.
Questionnaire Method
In this method, the set of questions are mailed to the
respondent. They should read, reply and subsequently return
the questionnaire. The questions are printed in the definite
order on the form. A good survey should have the following
features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
• Should have good physical appearance such as colour, quality
of the paper to attract the attention of the respondent
Schedules
• This method is similar to the questionnaire
method with a slight difference. The
enumerations are specially appointed for the
purpose of filling the schedules. It explains the
aims and objects of the investigation and may
remove misunderstandings, if any have come
up. Enumerators should be trained to perform
their job with hard work and patience.
Secondary Data Collection Methods

• Secondary data is data collected by someone


other than the actual user. It means that the
information is already available, and someone
analyses it. The secondary data includes
magazines, newspapers, books, journals, etc.
It may be either published data or
unpublished data.
Published data are available in various resources
including
• Government publications
• Public records
• Historical and statistical documents
• Business documents
• Technical and trade journals
Unpublished data includes
• Diaries
• Letters
• Unpublished biographies, etc.
Sources of data

• Data collection is the process of acquiring,


collecting, extracting, and storing the
voluminous amount of data which may be in
the structured or unstructured form like text,
video, audio, XML files, records, or other
image files used in later stages of data
analysis.
• In the process of big data analysis, “Data
collection” is the initial step before starting
to analyze the patterns or useful information
in data. The data which is to be analyzed
must be collected from different valid
sources.
• The data which is collected is known as raw
data which is not useful now but on cleaning
the impure and utilizing that data for further
analysis forms information, the information
obtained is known as “knowledge”.
Knowledge has many meanings like business
knowledge or sales of enterprise products,
disease treatment, etc. The main goal of data
collection is to collect information-rich data.
• Data collection starts with asking some
questions such as what type of data is to be
collected and what is the source of collection.
Most of the data collected are of two types
known as “qualitative data“ which is a group
of non-numerical data such as words,
sentences mostly focus on behavior and
actions of the group and another one is
“quantitative data” which is in numerical
forms and can be calculated using different
scientific tools and sampling data.
The actual data is then further divided mainly
into two types known as:
• Primary data
• Secondary data
1.Primary data:
• The data which is Raw, original, and extracted directly
from the official sources is known as primary data. This
type of data is collected directly by performing
techniques such as questionnaires, interviews, and
surveys. The data collected must be according to the
demand and requirements of the target audience on
which analysis is performed otherwise it would be a
burden in the data processing.
• Few methods of collecting primary data:
• 1. Interview method
• 2. Survey method
• 3. Observation method
• 4. Experimental method
1. Interview method:
• The data collected during this process is through
interviewing the target audience by a person
called interviewer and the person who answers
the interview is known as the interviewee. Some
basic business or product related questions are
asked and noted down in the form of notes, audio,
or video and this data is stored for processing.
These can be both structured and unstructured like
personal interviews or formal interviews through
telephone, face to face, email, etc.
2. Survey method:
• The survey method is the process of research
where a list of relevant questions are asked and
answers are noted down in the form of text, audio,
or video. The survey method can be obtained in
both online and offline mode like through website
forms and email. Then that survey answers are
stored for analyzing data. Examples are online
surveys or surveys through social media polls.
3. Observation method:
• The observation method is a method of data
collection in which the researcher keenly observes
the behavior and practices of the target audience
using some data collecting tool and stores the
observed data in the form of text, audio, video, or
any raw formats. In this method, the data is
collected directly by posting a few questions on
the participants. For example, observing a group of
customers and their behavior towards the
products. The data obtained will be sent for
processing.
4. Experimental method:
• The experimental method is the process of
collecting data through performing experiments,
research, and investigation. The most frequently
used experiment methods are
1. CRD
2. RBD
3. LSD
4. FD
• CRD- Completely Randomized design is a simple experimental
design used in data analytics which is based on randomization and
replication. It is mostly used for comparing the experiments.
• RBD- Randomized Block Design is an experimental design in which
the experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are
drawn using a technique known as analysis of variance (ANOVA).
RBD was originated from the agriculture sector.
• LSD – Latin Square Design is an experimental design that is similar
to CRD and RBD blocks but contains rows and columns. It is an
arrangement of NxN squares with an equal amount of rows and
columns which contain letters that occurs only once in a row.
Hence the differences can be easily found with fewer errors in the
experiment. Sudoku puzzle is an example of a Latin square design.
• FD- Factorial design is an experimental design where each
experiment has two factors each with possible values and on
performing trail other combinational factors are derived.
2. Secondary data:
Secondary data is the data which has already been
collected and reused again for some valid purpose.
This type of data is previously recorded from primary
data and it has two types of sources named internal
source and external source.
• Internal source
• External source
Internal source:
• These types of data can easily be found
within the organization such as market
record, a sales record, transactions, customer
data, accounting resources, etc. The cost and
time consumption is less in obtaining internal
sources.
External source:
• The data which can’t be found at internal
organizations and can be gained through external
third party resources is external source data. The
cost and time consumption is more because this
contains a huge amount of data. Examples of
external sources are Government publications,
news publications, Registrar General of India,
planning commission, international labor bureau,
syndicate services, and other non-governmental
publications
Other sources:
• Sensors data: With the advancement of IoT devices, the
sensors of these devices collect data which can be used
for sensor data analytics to track the performance and
usage of products.
• Satellites data: Satellites collect a lot of images and data
in terabytes on daily basis through surveillance cameras
which can be used to collect useful information.
• Web traffic: Due to fast and cheap internet facilities
many formats of data which is uploaded by users on
different platforms can be predicted and collected with
their permission for data analysis. The search engines
also provide their data through keywords and queries
searched mostly.
Technology
forecasting
method
Data collection and APIs
https://www.r-bloggers.com/2022/03/how-to-get-twitter-data-using-r/
Exploring and fixing data
What is Data Exploration?

• Data exploration is the first step of data


analysis used to explore and visualize data to
uncover insights from the start or identify
areas or patterns to dig into more. Using
interactive dashboards and point-and-click
data exploration, users can better understand
the bigger picture and get to insights faster.
• Data exploration is the first step in data analysis
involving the use of data visualization tools and
statistical techniques to uncover data set
characteristics and initial patterns.
• During exploration, raw data is typically reviewed
with a combination of manual workflows and
automated data-exploration techniques to
visually explore data sets, look for similarities,
patterns and outliers and to identify the
relationships between different variables.
• This is also sometimes referred to as exploratory
data analysis, which is a statistical technique
employed to analyze raw data sets in search of their
broad characteristics.
Why is Data Exploration Important?
• Humans are visual learners, able to process visual
data much more easily than numerical data.
Consequently, it's challenging for data scientists
to review thousands of rows of data points and
infer meaning without assistance.
• Data visualization tools and elements like colors,
shapes, lines, graphs and angles aid in effective
data exploration of metadata, enabling
relationships or anomalies to be detected
What industries use data exploration?
• Any business or industry that collects or utilizes data can
benefit from data exploration. A few common industries
include software development, healthcare and education.
• The advanced visualization techniques employed by data
exploration and business intelligence tools enable businesses
and stakeholders to better understand performance metrics
by
making raw data more comprehensible and creating a "story"
around it.
• By visualizing patterns and finding commonalities in complex
data flows, data exploration can help enterprises make data-
driven decisions to streamline processes, better target their
ideal audience, increase productivity and achieve greater
returns.
Data exploration vs. data mining
• In data science, there are two primary methods for extracting data
from disparate sources: data exploration and data mining.
• Data exploration is a broad process that is performed by business
users and an increasing numbers of citizen data scientists with no
formal training in data science or analytics, but whose jobs depend
on understanding data trends and patterns. Visualization tools help
this wide-ranging group to better export and examine a variety of
metrics and data sets.
• Data mining is a specific process, usually undertaken by data
professionals. Data analysts create association rules and parameters
to sort through extremely large data sets and identify patterns and
future trends.
• Typically, data exploration is performed first to assess the
relationships between variables. Then the data mining begins.
Through this process, data models are created to gather additional
How machine learning is applied to data exploration
• Machine learning can significantly aid in data exploration
when large quantities of data are involved. However, for a
machine learning model to be accurate, data analysts must
take the following steps before performing the analysis:
• Identify and define all variables in the data set.
• Conduct univariate analysis for single variables, using a
histogram, box plot or scatter plot. For categorical variables
(those that can be grouped by category), bar charts can be
used.
• Conduct bivariate analysis, to determine the relationship
between pairs of variables. This can be completed using
data visualization tools, such as Tableau.
• Account for any missing values and outliers
What is the best language for data exploration?

• The most commonly used statistical methods


in data exploration are the
R programming language and Python. Both
are open source data analytics languages.
• While R is best for statistical analysis, Python is
better suited for machine learning algorithms.
Data exploration tools
• Data exploration tools make data analysis easier to present
and understand through interactive, visual elements, making
it easier to share and communicate key insights.
• Data exploration tools include data visualization software
and business intelligence platforms, such as
Microsoft Power BI, Qlik and Tableau.
• Available open source data exploration tools can also
incorporate regression functionality, data profiling and
visualization capabilities, which enables businesses to
integrate various, disparate data sources for faster data
exploration.
• Some popular open source tools include Knime, OpenRefine,
NodeXL, Pentaho, R programming and RapidMiner.
Data Preparation
• Data preparation is the process of cleaning and
transforming raw data prior to processing and
analysis.
• It is an important step prior to processing and often
– involves reformatting data,
– making corrections to data and
– the combining of data sets to enrich data.
• Data preparation is often a lengthy undertaking for
data professionals or business users, but it is
essential as a prerequisite to put data in context in
order to turn it into insights and eliminate bias
resulting from poor data quality.
• The data preparation process usually includes
– standardizing data formats,
– enriching source data, and/or
– removing outliers.
• Data Preparation is the process of
– collecting,
– cleaning, and
– consolidating data into one file or data table,
primarily for use in analysis.
Why Prepare Data?
• There are several reasons why we need to prepare the data.
By preparing data, we actually prepare the miner so that
when using prepared data, the miner produces better
models faster.
1. Good data is essential for producing efficient models of
any type.
2. Data should be formatted according to required software
tool.
3. Data need to be made adequate for given method.
4. Data in the real world is dirty.
• Incomplete data: Some data lack attribute values,
lacking certain attributes of interest, or containing
only aggregate data.
For example, First name = “” or Last name = “”
• Noisy: Some data contains errors.
For example, Age = -10
• Inconsistent: Some data contain discrepancies in
codes and names
For example, Age = 56, Birthdate = ’04–05–1995’
Benefits of Data Preparation:
• Fix errors quickly — Data preparation helps catch
errors before processing. After data has been
removed from its original source, these errors
become more difficult to understand and correct.
• Produce top-quality data — Cleaning and
reformatting datasets ensures that all data used in
analysis will be high quality.
• Make better business decisions — higher quality
data that can be processed and analyzed more
quickly and efficiently leads to more timely,
efficient and high-quality business decisions.
Data preparation steps:
The major tasks in data preparation are as follows:
1) Data discretization
2) Data cleaning
3) Data integration
4) Data transformation
5) Data reduction
1) Data discretization:
• It is a part of data reduction which contains
particular importance especially for numerical data.
2) Data cleaning:
Manual data prep is error-prone, time-consuming and
costly. Business decisions rely on analytics. But, if the
data is inaccurate or incomplete, your analytics inform
wrong businesses decisions. Bad analytics means poor
business decisions. Altair Monarch is programmed
with over 80 pre-built data preparation functions to
speed up arduous data cleansing projects.
3) Data integration:
• Access data from any source — no matter the
origin, format or narrative and integrating them
together. Monarch excels at intelligently and
automatically extracting data from complex
unstructured and semi-structured sources, like
PDFs. Increased access to data means less
manual work, faster insights and faster time to
value realized by your organization.
• 4) Data transformation:
• Being able to quickly change the way data is
summarized and presented enables business
analysts and executives to quickly consider
different perspectives and views of data. Monarch
make it easy to package your clean and blended
data for insightful reporting you can confidently
share with the rest of your organization.
• 5) Data reduction:
• It obtains reduced representation in volume but
produces the same or similar analytical results.
Feature Engineering

• Feature engineering refers to manipulation —


addition, deletion, combination, mutation —
of your data set to improve machine learning
model training, leading to better performance
and greater accuracy. Effective feature
engineering is based on sound knowledge of
the business problem and the available data
sources.
Feature Engineering in ML Lifecycle
Some common types of feature engineering include:

• Scaling and normalization


• Filling missing values
• Feature selection
• Feature coding
• Feature construction
• Feature extraction
• Scaling and normalization means adjusting the
range and center of data to ease learning and
improve the interpretation of the results.
• Feature coding involves choosing a set of symbolic values to represent
different categories. Concepts can be captured with a single column that
comprises multiple values, or they can be captured with multiple
columns, each of which represents a single value and has a true or false
in each field. For example, feature coding can indicate whether a
particular row of data was collected on a holiday. This is a form of
feature construction.
• Feature construction creates a new feature(s) from one or more other
features. For example, using the date you can add a feature that
indicates the day of the week. With this added insight, the algorithm
could discover that certain outcomes are more likely on a Monday or a
weekend.
• Feature extraction means moving from low-level features that are
unsuitable for learning — practically speaking, you get poor testing
results — to higher-level features that are useful for learning. Often
feature extraction is valuable when you have specific data formats — like
images or text — that have to be converted to a tabular row-column,
example-feature format.
Filling missing values
Feature selection
Feature coding
Feature construction
Combine existing features
Feature extraction
Feature Extraction
Facial Expression Detection
Facial Expression Detection
Facial Expression Detection

https://sefiks.com/2018/01/10/real-time-facial-expression-recognition-on-streamin
g-data
/
Data Visualization in R.

• Data visualization is the technique used to


deliver insights in data using visual cues such
as graphs, charts, maps, and many others. This
is useful as it helps in intuitive and easy
understanding of the large quantities of data
and thereby make better decisions regarding
it.
Data Visualization in R Programming Language

• The popular data visualization tools that are available


are Tableau, Plotly, R, Google Charts, Infogram, and
Kibana. The various data visualization platforms have
different capabilities, functionality, and use cases. They
also require a different skill set. This article discusses
the use of R for data visualization.
• R is a language that is designed for statistical
computing, graphical data analysis, and scientific
research. It is usually preferred for data visualization as
it offers flexibility and minimum required coding
through its packages.
Types of Data Visualizations
Some of the various types of visualizations offered by
R are:
Bar Plot
Histogram
Box Plot
Scatter Plot
Heat Map
3D Graphs in R
• Consider the following airquality data set for
visualization in R:

Ozone Solar R. Wind Temp Month Day


41 190 7.4 67 5 1

36 118 8.0 72 5 2

12 149 12.6 74 5 3

18 313 11.5 62 5 4

NA NA 14.3 56 5 5

28 NA 14.9 66 5 6
Bar Plot
• There are two types of bar plots- horizontal and vertical
which represent data points as horizontal or vertical
bars of certain lengths proportional to the value of the
data item. They are generally used for continuous and
categorical variable plotting. By setting
the horiz parameter to true and false, we can get
horizontal and vertical bar plots respectively.
• Example:
• # Horizontal Bar Plot for
• # Ozone concentration in air
• barplot(airquality$Ozone,
• main = 'Ozone Concenteration in air',
• xlab = 'ozone levels', horiz = TRUE)
# Horizontal Bar Plot for
# Ozone concentration in air
barplot(airquality$Ozone,
main = 'Ozone Concenteration in air',
xlab = 'ozone levels', horiz = TRUE)

# Vertical Bar Plot for


# Ozone concentration in air
barplot(airquality$Ozone, main = 'Ozone
Concenteration in air', xlab = 'ozone levels',
col ='blue', horiz = FALSE)
Bar plots are used for the following scenarios:
• To perform a comparative study between the
various data categories in the data set.
• To analyze the change of a variable over time in
months or years.
• Histogram
• A histogram is like a bar chart as it uses bars of
varying height to represent data distribution.
However, in a histogram values are grouped
into consecutive intervals called bins. In a
Histogram, continuous values are grouped and
displayed in these bins whose size can be
varied. # Histogram for Maximum Daily Temperature
data(airquality)

hist(airquality$Temp, main ="La Guardia


Airport's\
Maximum Temperature(Daily)",
xlab ="Temperature(Fahrenheit)",
xlim = c(50, 125), col ="yellow",
freq = TRUE)
Histograms are used in the following scenarios:
• To verify an equal and symmetric distribution
of the data.
• To identify deviations from expected values.
• Box Plot
• The statistical summary of the given data is
presented graphically using a boxplot. A
boxplot depicts information like the minimum
and maximum data point, the median value,
first and third quartile, and interquartile
range.
# Box plot for average wind speed
data(airquality)

boxplot(airquality$Wind, main = "Average wind speed\


at La Guardia Airport",
xlab = "Miles per hour", ylab = "Wind",
col = "orange", border = "brown",
horizontal = TRUE, notch = TRUE)
• Multiple box plots can also be generated at
once through the following code:

# Multiple Box plots, each representing


# an Air Quality Parameter
boxplot(airquality[, 0:4],
main ='Box Plots for Air Quality Parameters')
Box Plots are used for:
• To give a comprehensive statistical description of
the data through a visual cue.
• To identify the outlier points that do not lie in the
inter-quartile range of data.
Scatter Plot
• A scatter plot is composed of many points on a
Cartesian plane. Each point denotes the value
taken by two parameters and helps us easily
identify the relationship between them.
# Scatter plot for Ozone Concentration per month
data(airquality)

plot(airquality$Ozone, airquality$Month,
main ="Scatterplot Example",
xlab ="Ozone Concentration in parts per
billion",
ylab =" Month of observation ", pch = 19)
Scatter Plots are used in the following
scenarios:
• To show whether an association exists
between bivariate data.
• To measure the strength and direction of such
a relationship.
Heat Map
• Heatmap is defined as a graphical representation of
data using colors to visualize the value of the
matrix. heatmap() function is used to plot heatmap.
• Syntax: heatmap(data)
• Parameters: data: It represent matrix data, such as
values of rows and columns
• Return: This function draws a heatmap.
# Set seed for reproducibility
# set.seed(110)

# Create example data


data <- matrix(rnorm(50, 0, 5), nrow = 5, ncol = 5)

# Column names
colnames(data) <- paste0("col", 1:5)
rownames(data) <- paste0("row", 1:5)

# Draw a heatmap
heatmap(data)
Map visualization in R
• Here we are using maps package to visualize and
display geographical maps using an R
programming language.
• install.packages("maps")
# Read dataset and convert it into
# Dataframe
data <- read.csv("worldcities.csv")
df <- data.frame(data)

# Load the required libraries


library(maps)
map(database = "world")

# marking points on map


points(x = df$lat[1:500], y = df$lng[1:500], col =
"Red")
• 3D Graphs in R
• Here we will use preps() function, This function is used
to create 3D surfaces in perspective view. This
function will draw perspective plots of a surface over
the x–y plane.
• Syntax: persp(x, y, z)
• Parameter: This function accepts different parameters
i.e. x, y and z where x and y are vectors defining the
location along x- and y-axis. z-axis will be the height of
the surface in the matrix z.
• Return Value: persp() returns the viewing
transformation matrix for projecting 3D coordinates
(x, y, z) into the 2D plane using homogeneous 4D
coordinates (x, y, z, t).
# Adding Titles and Labeling Axes to Plot
cone <- function(x, y){
sqrt(x ^ 2 + y ^ 2)
}

# prepare variables.
x <- y <- seq(-1, 1, length = 30)
z <- outer(x, y, cone)

# plot the 3D surface


# Adding Titles and Labeling Axes to Plot
persp(x, y, z,
main="Perspective Plot of a Cone",
zlab = "Height",
theta = 30, phi = 15,
col = "orange", shade = 0.4)
Advantages of Data Visualization in R:

R has the following advantages over other tools for


data visualization:
• R offers a broad collection of visualization libraries
along with extensive online guidance on their
usage.
• R also offers data visualization in the form of 3D
models and multipanel charts.
• Through R, we can easily customize our data
visualization by changing axes, fonts, legends,
annotations, and labels.
Disadvantages of Data Visualization in R:

R also has the following disadvantages:


• R is only preferred for data visualization when
done on an individual standalone server.
• Data visualization using R is slow for large
amounts of data as compared to other
counterparts.
Application Areas:

• Presenting analytical conclusions of the data to the non-


analysts departments of your company.
• Health monitoring devices use data visualization to track
any anomaly in blood pressure, cholesterol and others.
• To discover repeating patterns and trends in consumer and
marketing data.
• Meteorologists use data visualization for assessing
prevalent weather changes throughout the world.
• Real-time maps and geo-positioning systems use
visualization for traffic monitoring and estimating travel
time.

You might also like