0% found this document useful (0 votes)
38 views101 pages

BECE352E Module 1-2 Slides

Uploaded by

Anime Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views101 pages

BECE352E Module 1-2 Slides

Uploaded by

Anime Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

BECE352E - IoT Domain Analyst

1
Topics
• Module 1: Data Models
• Module 2: Data Preprocess and EDA
• Module 3: ML & Cloud Computing
for IoT
• Module 4: IoT-Cloud Convergence
• Module 5: Smart Computing over IoT-
Cloud
• Module 6: User-Centric IoT
Architecture
• Module 7: Value Engineering and
Analysis
2
Topics in Module-1-
• SDLC
• Development
Models – Waterfall
• Rapid Application
Development
• Agile
• Spiral Models,
Object-Relational
databases
• Database models

3
Software Development Life Cycle (SDLC)
• A software life cycle model (also
termed process model) is a pictorial
and diagrammatic representation of
the software life cycle.
• A life cycle model represents all the
methods required to make a
software product transit through its
life cycle stages.
• A life cycle model maps the various
activities performed on a software
product from its inception to
retirement.
• Need: The development team must
determine a suitable life cycle
model for a particular plan and then
observe to it.

4
Software Development Life Cycle (SDLC)
• Stage1: Planning and requirement analysis: Planning
for the quality assurance requirements and identifications
of the risks associated with the projects is also done at
this stage.
• Business analyst and Project organizer set up a
meeting with the client to gather all the data like
what the customer wants to build, who will be the
end user, what is the objective of the product.
• Stage2: Defining Requirements: Represent and
document the software requirements and get them
accepted from the project stakeholders.
• "SRS"- Software Requirement Specification
document which contains all the product
requirements to be constructed and developed during
the project life cycle. 5
Software Development Life Cycle (SDLC)
• Stage3: Designing the Software: Bring down all the
knowledge of requirements, analysis, and design of the
software project.
• This phase is the product of the last two, like
inputs from the customer and requirement
gathering.
• Stage4: Developing the project: The actual
development begins, and the programming is built.
• The implementation of design begins concerning
writing code.
• Developers have to follow the coding guidelines
described by their management and programming
tools like compilers, interpreters, debuggers, etc.
are used to develop and implement the code.
6
Software Development Life Cycle (SDLC)
• Stage5: Testing: After the code is generated, it is
tested against the requirements to make sure that the
products are solving the needs addressed and gathered
during the requirements stage.
• During this stage, unit testing, integration testing,
system testing, acceptance testing are done.
• Stage6: Deployment: Once the software is certified,
and no bugs or errors are stated, then it is deployed.
• Then based on the assessment, the software may
be released as it is or with suggested enhancement
in the object segment.
• Stage7: Maintenance: Once when the client starts
using the developed systems, then the real issues come
up and requirements to be solved from time to time.
7
SDLC Models
• Software Development life cycle (SDLC) is used
in project management that defines the stages
included in an information system development
project, from an initial feasibility study to the
maintenance of the completed application.
• There are different SDLC models specified and
designed, which are followed during the
software development phase.
• These models are also called "Software
Development Process Models." Each process
model follows a series of phases unique to its
type to ensure success in the step of software
development.

8
Waterfall model
• The steps always follow in this order and do not
overlap. The developer must complete every phase
before the next phase begins.
• “Waterfall Model”-Diagrammatic representation
resembles a cascade of waterfalls.
• Circumstances where the use of the Waterfall model is
most suited are:
• When the requirements are constant and not changed
regularly.
• A project is short
• The situation is calm
• Where the tools and technology used is consistent
and is not changing
• When resources are well prepared and are
available to use. 9
Waterfall model
• This model has five phases: Requirements analysis and
specification, design, implementation, and unit testing, integration
and system testing, and operation and maintenance.
• 1. Requirements analysis and specification phase:
• This phase aims to understand the exact requirements of the
customer and to document them properly.
• Both the customer and the software developer works together
to document all the functions, performance, and interfacing
requirements of the software.
• It describes the “what” of the system to be produced and not
“how”.
• In this phase, a large document called Software Requirement
Specification (SRS) document is created which contains a
detailed description of what the system will do in the
common language. 10
Waterfall model
• 2. Design Phase: This phase aims to transform the requirements gathered in
the SRS into a suitable form that permits further coding in a programming
language.
• It defines the overall software architecture together with high-level and
detailed design.
• All this work is documented as a Software Design Document (SDD).
• 3. Implementation and unit testing: During this phase, design is
implemented.
• If the SDD is complete, the implementation or coding phase proceeds
smoothly, because all the information needed by software developers
is contained in the SDD.
• During testing, the code is thoroughly examined and modified. Small
modules are tested in isolation initially. After that these modules are
tested by writing some overhead code to check the interaction between
these modules and the flow of intermediate output.
11
Waterfall model
• 4. Integration and System Testing: This phase is
highly crucial as the quality of the end product is
determined by the effectiveness of the testing carried
out.
• The better output will lead to satisfied customers,
lower maintenance costs, and accurate results.
Unit testing determines the efficiency of
individual modules.
• However, in this phase, the modules are tested for
their interactions with each other and with the
system.
• 5. Operation and maintenance phase: Maintenance
is the task performed by every user once the software
has been delivered to the customer, installed, and
operational. 12
Waterfall model
Advantages of the Waterfall model
• This model is simple to implement and the
number of resources that are required for it is
minimal.
• The requirements are simple and explicitly
declared; they remain unchanged during the entire
project development.
• The start and end points for each phase are fixed,
which makes it easy to cover progress.
• The release date for the complete product, as well
as its final cost, can be determined before
development.
• It gives easy control and clarity for the customer
due to a strict reporting system.
13
Waterfall model
Disadvantages of the Waterfall model
• In this model, the risk factor is higher, so this
model is not suitable for more significant and
complex projects.
• This model cannot accept changes in requirements
during development.
• It becomes tough to go back to the phase. For
example, if the application has now shifted to the
coding phase, and there is a change in
requirement, it becomes tough to go back and
change it.
• Since the testing is done at a later stage, it does
not allow identifying the challenges and risks in
the earlier phase, so the risk reduction strategy is
difficult to prepare. 14
Rapid Application Development Model(RAD)
• RAD is a linear sequential software development process
model that emphasizes a concise development cycle using an
element-based construction approach.
• If the requirements are well understood and described, and the
project scope is a constraint, the RAD process enables a
development team to create a fully functional system within a
concise time.
• When to use RAD model:
• The system should need to create a project that
modularizes in a short period (2-3 months).
• When the requirements are well-known.
• When the technical risk is limited.
• When there's a necessity to make a system, which is
modularized in 2-3 months of period.
• It should be used only if the budget allows the use of
automatic code-generating tools. 15
Rapid Application Development Model(RAD)
• RAD (Rapid Application Development) is a
concept that products can be developed faster and
of higher quality through:
• Gathering requirements using workshops or
focus groups
• Prototyping and early, reiterative user testing
of designs
• The re-use of software components
• A rigidly paced schedule that refers to design
improvements to the next product version
• There is less formality in reviews and other
team communication

16
Rapid Application Development Model(RAD)
• 1. Business Modeling: The information flow among
business functions is defined by answering
questions like what data drives the business process,
what data is generated, who generates it, where the
information goes, who processes it, and so on.
• 2. Data Modeling: The data collected from business
modeling is refined into a set of data objects
(entities) that are needed to support the business.
The attributes (character of each entity) are
identified, and the relation between these data
objects (entities) is defined.

17
Rapid Application Development Model(RAD)
• 3. Process Modeling: The information objects defined
in the data modeling phase are transformed to achieve
the data flow necessary to implement a business
function. Processing descriptions are created for
adding, modifying, deleting, or retrieving a data
object.
• 4. Application Generation: Automated tools are used
to facilitate the construction of the software; even they
use the 4th GL techniques.
• 5. Testing & Turnover: Many of the programming
components have already been tested since RAD
emphasizes reuse. This reduces the overall testing
time. However the new part must be tested, and all
interfaces must be fully exercised.
18
Rapid Application Development Model(RAD)
Advantages of the RAD Model
• This model is flexible for change.
• In this model, changes are adoptable.
• Each phase in RAD brings the highest priority
functionality to the customer.
• It reduced development time.
• It increases the reusability of features.
Disadvantage of the RAD Model
• It required highly skilled designers.
• All applications are not compatible with RAD.
• For smaller projects, we cannot use the RAD model.
• On the high technical risk, it's not suitable.
• Required user involvement.

19
Agile Model
• “Agile process model” refers to a software
development approach based on iterative
development.
• Agile methods break tasks into smaller iterations or
parts that do not directly involve long-term planning.
• The project scope and requirements are laid down at
the beginning of the development process.
• The plans regarding the number of iterations, the
duration, and the scope of each iteration is clearly
defined in advance.
• Each iteration is considered as a short time “frame” in
the Agile process model, which typically lasts from
one to four weeks.

20
Agile Model
• The division of the entire project into smaller parts
helps to minimize the project risk and reduce the
overall project delivery time requirements.
• Each iteration involve a team working through a full
software development life cycle including planning,
requirements analysis, design, coding, and testing
before a working product is demonstrated to the client
When to use Agile:
• When frequent changes are required.
• When a highly qualified and experienced team is
available.
• When a customer is ready to have a meeting with a
software team all the time.
• When the project size is small.
21
Agile Model
• 1. Requirements gathering: In this phase,
you must define the requirements. You
should explain business opportunities and
plan the time and effort needed to build the
project. Based on this information, you can
evaluate technical and economic feasibility.
• 2. Design the requirements: When you have
identified the project, work with
stakeholders to define requirements. You can
use the user flow diagram or the high-level
UML diagram to show the work of new
features and show how it will apply to your
existing system.

22
Agile Model
• 3. Construction/iteration: When the team defines the
requirements, the work begins. Designers and developers
start working on their project, which aims to deploy a
working product. The product will undergo various stages
of improvement, so it includes simple, minimal
functionality
• 4. Testing: In this phase, the Quality Assurance team
examines the product's performance and looks for the
bug.
• 5. Deployment: In this phase, the team issues a product
for the user's work environment.
• 6. Feedback: After releasing the product, the last step is
feedback. In this, the team receives feedback about the
product and works through the feedback.
23
Agile Model
Advantage (Pros) of the Agile Method:
• Frequent Delivery
• Face-to-face communication with clients.
• Efficient design and fulfills the business requirement.
• Anytime changes are acceptable.
• It reduces total development time.
Disadvantages (Cons) of the Agile Model:
• Due to the shortage of formal documents, it creates
confusion and crucial decisions taken throughout various
phases can be misinterpreted at any time by different team
members.
• Due to the lack of proper documentation, once the project
is completed and the developers allotted to another project,
maintenance of the finished project can become difficult.
24
Spiral Model
• The spiral model is an evolutionary software
process model that couples the iterative
feature of prototyping with the controlled and
systematic aspects of the linear sequential
model.
• It implements the potential for rapid
development of new versions of the software.
• Using the spiral model, the software is
developed in a series of incremental releases.
• During the early iterations, the additional
release may be a paper model or prototype.
During later iterations, more and more
complete versions of the engineered system
are produced
25
Spiral Model
• Objective setting: Each cycle in the spiral starts
with the identification of the purpose for that
cycle, the various alternatives that are possible
for achieving the targets, and the constraints that
exist.
• Risk Assessment and Reduction: The next phase
in the cycle is to calculate these various
alternatives based on the goals and constraints.
The focus of evaluation in this stage is located on
the risk perception for the project.
• Development and validation: The next phase is to
develop strategies that resolve uncertainties and
risks. This process may include activities such as
benchmarking, simulation, and prototyping.
26
Spiral Model
• Planning: Finally, the next step is
planned. The project is reviewed, and a
choice is made whether to continue with a
further period of the spiral. If it is
determined to keep, plans are drawn up
for the next step of the project.
• The development phase depends on the
remaining risks. For example, if
performance or user-interface risks are
treated more essential than the program
development risks, the next phase may be
an evolutionary development that
includes developing a more detailed
prototype for solving the risks.
27
Spiral Model
When to use Spiral Model:
• When deliverance is required to be frequent.
• When the project is large
• When requirements are unclear and complex
• When changes may be required at any time
• Large and high-budget projects
• Advantages
• A high amount of risk analysis
• Useful for large and mission-critical
projects.
• Disadvantages
• Can be a costly model to use.
• Risk analysis needs highly particular
expertise
• Doesn't work well for smaller projects. 28
Object and Object-Relational Databases
Object Oriented Databases:
• They are designed to store and
manipulate objects.
• It is similar to object-oriented
programming languages, i.e. Java and
Python. Objects can contain data,
methods, and relationships to other
objects.
• In OODB, the object itself is the storage
rather than the representation of the data.
This allows for more efficient and
natural handling of complex data
structures and relationships between
objects.
29
Object and Object-Relational Databases
Advantages of OODBs
• They work well with object-oriented programming
languages.
• OODBs are made to work well with languages like
Java and Python. They can work with object-
oriented concepts like encapsulation and
inheritance.
• They can handle complex data structures well.
• OODBs are good at handling complex data
structures because they store objects, rather than • They are easy to model.
breaking them down into individual parts. • OODBs are easy to model because they
• They are fast for object-oriented workloads. allow developers to work with objects instead
• OODBs are fast because they are designed to work of worrying about the underlying database
with object-oriented programming languages. structure.
30
Object-Relational Databases
• They are a hybrid between
traditional relational
databases and OODBs.
• They are designed to handle
both structured and
unstructured data, much like
OODBs, but they also
support SQL queries and
transactions, much like
traditional relational
databases. ● OODBs are designed to store and manipulate objects and are well-suited for complex data structures
and object-oriented programming languages.
● ORDBs are a hybrid of traditional relational databases and OODBs. It supports SQL queries and
transactions. It is good for structured data and integration with existing systems.
The decision to choose between OODBs and ORDBs depends on specific project requirements.
31
Object-Relational Databases
• Object-Relational Databases (ORDBs) have many advantages −
• They work with SQL.
• ORDBs can use SQL, which is a language that many developers already know. This makes them
more familiar and easier to use.
• They can be integrated.
• ORDBs can be integrated into existing systems. It makes them a good choice for companies that
want to upgrade their infrastructure without starting from scratch.
• They are good at handling structured data.
• ORDBs can handle structured data well. They are good for applications that need to do a lot of
searching and sorting.
• They have good support for transactions.
• ORDBs can handle transactions well. Even if there are errors or problems, the data stays
consistent and accurate.
• Examples of ORDBs include PostgreSQL, Oracle Database, and Microsoft SQL Server.
32
Object vs Object-Relational Databases

33
Topics in Module-2-
• Data Cleaning
• Data Integration
• Data
Transformation
• Data Reduction
• Significance of
Exploratory Data
Analysis
• Making sense of
Data.

34
What is Data preprocessing?

• Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model.
• Whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis. 35
What is Data preprocessing?

• A technique of preparing (cleaning and organizing) the raw data to make it suitable for a
building and training Machine Learning models.
• Any type of processing performed on raw data to prepare it for another data processing procedure.
36
Need for Data preprocessing?

• For achieving better results from the applied model in Machine Learning projects the format of the
data has to be in a proper manner.
• Some specified Machine Learning model needs information in a specified format, for example,
Random Forest algorithm does not support null values, therefore to execute random forest
algorithm null values have to be managed from the original raw data set.
• Another aspect is that the data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithm are executed in one data set, and best out of
them is chosen.
37
Forms of Data preprocessing

38
Forms of Data preprocessing

Missing Data

Noisy Data

39
Sequence of Data preprocessing

40
Effective Data preprocessing for a ML model

41
Data and related definitions

42
Data and related definitions
Categories of Data

43
Data Cleaning

44
Data Cleaning

45
Data Cleaning

46
Data Cleaning-Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time
of entry
• not register history or changes of the data
• Missing data may need to be inferred
47
Data Cleaning-How to handle missing data?
• Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill in the
missing value: smarter
• Use the most probable value to fill in the missing value: inference-
based such as regression, Bayesian formula, decision tree
48
Data Cleaning-How to handle missing data?
• Ignore the tuple(store multiple items in a single variable ): usually done
when class label is missing (assuming the task is classification—not
effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill in the
missing value: smarter
• Use the most probable value to fill in the missing value: inference-based
such as regression, Bayesian formula, decision tree 49
Data Cleaning-Noisy data?
• Random error in a measured variable.
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
50
Data Cleaning-How to handle Noisy data?
• Binning method: (Bins- Smallest unit of space inside
a database)
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• used also for discretization
• Clustering
• detect and remove outliers
• Semi-automated method: combined computer and human
inspection
• detect suspicious values and check manually
• Regression
• smooth by fitting the data into regression functions
51
Data Cleaning-How to handle Inconsistent
data?
• Manual correction using external references
• Semi-automatic using various tools
• To detect violation of known functional dependencies
and data constraints
• To correct redundant data

52
Data Cleaning
•Step 1: Remove duplicate or
irrelevant observations. Remove
unwanted observations from your
dataset, including duplicate
observations or irrelevant
observations. ...
•Step 2: Fix structural errors. ...
•Step 3: Filter unwanted outliers. ...
•Step 4: Handle missing data. ...
•Step 5: Validate and QA.
53
Data Cleaning-Benefits
• Removal of errors when multiple
sources of data.
• Ability to map the different
functions and what your data is
intended to do.
• Monitoring errors and better
reporting to see where errors are
coming from, making it easier to fix
incorrect or corrupt data for future
applications.
• Using tools for data cleaning will
make for more efficient business
practices and quicker decision-
making.
54
Data Integration
• Data integration:
combines data from
multiple sources into a
coherent store
• Schema integration:
• Integrate metadata
from different
sources
• Entity identification
problem: identify
real world entities
from multiple data
sources 55
Data Integration
• Detecting and resolving data
value conflicts
• for the same real world
entity, attribute values
from different sources are
different
• possible reasons: different
representations, different
scales, e.g., metric vs.
British units, different
currency

56
Why Data Integration is important?
• Gathering enormous volumes of data
from various sources needs to be
meaningful.
• Ease of accessible for analysis, when
fresh data enters the database every
second.
• Integrated data unlocks a layer of
connectivity thereby improving the
productivity.
• By connecting systems that contain
valuable data and integrating them,
ease of achieve data continuity and
seamless knowledge transfer can be
achieved.
57
Data Integration

58
Ways to create Data Integration
• Creating a data warehouse: Data warehouses allow you to integrate
different sources of data into a master relational database.
• When critical data is collected, stored and easily available, it’s much easier
to assess micro and macro processes, manage operations and make
strategic decisions based on this business intelligence.

59
Ways to create Data Integration
• In this case, data integration works by providing a cohesive and centralized
look at the entirety of an organization’s information, streamlining the process
of gaining business intelligence insights.
• To achieve this, the managed service provider would a process called ETL.

60
Ways to create Data Integration
• In this case, data integration works by providing a cohesive and centralized
look at the entirety of an organization’s information, streamlining the
process of gaining business intelligence insights.
• To achieve this, the managed service provider would a process called ETL.

61
Ways to create Data Integration
• ETL (Extract, Transform, Load): ETL is the process of sending
data from source systems an organization possesses to the data
warehouse where this information will be viewed and used.
• Most data integration systems involve one or more ETL pipelines,
which make data integration easier, simpler, and quicker.

62
Ways to create Data Integration
• Building Data Pipelines:
• Avariety of built-in data connectors (for data ingestion), pre-defined
transformations, and built-in job scheduler for automating the ETL
pipeline.
• Such tools make data integration easier, faster, and more cost
effective by reducing the dependency on human expertise for
manual operation.

63
Different types of Data Integration methods
Uniform Data Access:
• With Uniform Data Access,
enterprise data can be accessed from
very disparate sets and present it
uniformly.
• Uniform Data Access does this
while allowing the data to stay in its
original location.
• It leaves the data in the source
system and defines a set that can
provide a unified view to various
customers across a platform.
• There is zero latency from the
source system to the consolidated
view. 64
Different types of Data Integration methods
Common Data storage:
• Common Data Storage (or CDS or Data Warehouse) is a storage space that enables you
to manage and securely store data used by multiple applications or programs.
• Helps collecting data from various sources, combining them to a central space and
management (Database files, mainframes, and flat files).

65
Different types of Data Integration methods
Application based Integration:
• Application Based Integration solutions are specialized programs that locate, retrieve
and integrate your data.
• Application Based Integration accesses various data sources and returns integrated
results to the user.
• It has limitations if you are handling large volumes of data and large numbers of
sources because it requires the applications to implement all the integration efforts.

66
Different types of Data Integration methods
Common User Interface: Common User Interface means manually locating the
information in each data source and comparing or cross-referencing them yourself in
order to get the insight you need.
• The users must deal with different user interfaces and query languages and therefore
need to have detailed knowledge on location, logical data representation, and data
semantics.

An example for Common User Interface

67
Different types of Data Integration methods
Middle ware Data integration:
• Middleware is a layer of software that creates a common platform for all
interactions internal and external to the organization—system-to-system, system-to-
database, human-to-system, web-based, and mobile-device-based interactions.
• Middleware integration refers to applications that connect two or more applications.

68
Handling redundant data in Data Integration
• Redundant data occur often when integrating multiple DBs
• The same attribute may have different names in different
databases
• One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Redundant data may be able to be detected by correlational
analysis
( A − A)( B − B )
rA, B =
(n − 1) A B
• Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
69
Advantages of Data Integration
• Elimination of errors
• Saves time
• Better sense of all the available information.
• Streamline the processes and improve the
efficacy of data usage.
• Inter-system cooperation
• Seamless knowledge transfer between systems.
• Data integrity and data quality.
70
Five challenges and its solution of Data
Solution to challenges:
Integration
• Clean up your data
• Introduce clear
processes for data
management
• Back up your data
• Choose the right
software to assist you
with data integration
• Manage and maintain
your data 71
Data Transformation

72
Data Transformation
• Smoothing: Remove noise
from data (binning,
clustering, regression)
• Aggregation:
Summarization, data cube
construction
• Generalization: Concept
hierarchy climbing(Given a
large set of inputs and a
good heuristic function, it
tries to find a sufficiently
good solution to the
problem.)

73
Data Transformation
• Normalization: Scaled to fall within a small,
specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given
ones

74
Uses of Data
Transformation

75
Data Transformation

76
Data Reduction

77
Data Reduction-Types

78
Data Reduction-Types

79
Data Reduction-Strategies
• Data cube aggregation- Aggregation operations
are applied to the data in the construction of a
data cube.
• Attribute subset selections- Irrelevant, weakly
relevant or redundant attributes or dimensions
may be detected and removed.
• Dimensionality reduction- Encoding
mechanism are used to reduce the data set size.
• Numerosity reductions- Data are replaced or
estimated by alternative, smaller data
representations such as parametric models or non
parametric method such as clustering, sampling,
and the use of histograms.
• Discretization and concept hierarchy
generation- Raw data values for attributes are
replaced by range or higher conceptual levels.
80
Principal Component Analysis (PCA)
Dimension Reduction is defined as-
•It is a process of converting a data set having vast dimensions into a data set
with lesser dimensions.
•It ensures that the converted data set conveys similar information concisely.
In machine learning,
•Using both these dimensions convey similar information. Also, they introduce a lot of noise
in the system.
•So, it is better to use just one dimension. Using dimension reduction techniques-
•We convert the dimensions of data from 2 dimensions
(x1 and x2) to 1 dimension (z1).
•It makes the data relatively easier to explain.

81
Principal Component Analysis (PCA)
Benefits of Dimension reduction:
•It reduces the time required for computation since less dimensions require less
computation.
•It eliminates the redundant features.
•It improves the model performance.
•It compresses the data and thus reduces the storage space requirements.

82
Principal Component Analysis (PCA)
•Principal Component Analysis transforms the variables into a new set of variables called
as principal components.
•Principal components are linear combination of original variables and are orthogonal.

•The first principal component accounts for most of the possible variation of original data.
•The second principal component does its best to capture the variance in the data.
•There can be only two principal components for a two-dimensional data set.
83
Principal Component Analysis (PCA)
The steps involved in PCA Algorithm are as follows-
• Step-01: Get data.
• Step-02: Compute the mean vector (µ).
• Step-03: Subtract mean from the
given data.
• Step-04: Calculate the covariance
matrix.
• Step-05: Calculate the eigen vectors
and eigen values of the covariance
matrix.
• Step-06: Choosing components and
forming a feature vector.
• Step-07: Deriving the new data set.

84
Principal Component Analysis (PCA)#Solved Example-1
Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
Compute the principal component using PCA Algorithm.
Get data.

85
Principal Component Analysis (PCA)#Solved Example-1

86
Principal Component Analysis (PCA)#Solved Example-1

87
Principal Component Analysis (PCA)#Solved Example-1

88
Principal Component Analysis (PCA)#Solved Example-1

89
Principal Component Analysis (PCA)#Solved Example-1

90
Principal Component Analysis (PCA)#Solved Example-1

91
Principal Component Analysis (PCA)#Solved Example-1

92
Principal Component Analysis (PCA)#Solved Example-1

93
Principal Component Analysis (PCA)#Solved Example-1

94
Principal Component Analysis (PCA)#Solved Example-2
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the previous
question.

95
Principal Component Analysis (PCA)-Summary

96
Significance of Exploratory Data Analysis (EDA)
• Exploratory Data Analysis provides the
context needed to develop an appropriate
model – and interpret the results correctly.
• The purpose of Exploratory Data Analysis
is essential to tackle specific tasks such
as:
•Spotting missing and erroneous data
•Mapping and understanding the
underlying structure of your data
•Identifying the most important variables
in your dataset
•Testing a hypothesis or checking
assumptions related to a specific model
•Establishing a parsimonious model (one
that can explain your data using
minimum variables)

97
How important Exploratory Data Analysis (EDA)

98
Making sense of Data
Importance of quality data
5 characteristics of
quality data
• Validity. The degree to
which your data
conforms to defined
rules or constraints.
• Accuracy. Ensure your
data is close to the true
values. • Consistency. Ensure your data is consistent
• Completeness. The within the same dataset and/or across
degree to which all multiple data sets.
required data is known. • Uniformity. The degree to which the data is
specified using the same unit of measure. 99
Making sense of Data

100
Making sense of Data
Use Case and Examples

101

You might also like