AIA 6550 Module 5

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

Module 1 Guiding Question

This module will teach you about data mining, statistical modeling,
and optimization techniques. These techniques help you gain deeper insights into the
data you analyze.

Guiding Question
How will a comprehensive understanding of the primary data mining techniques make
statistical modelling and valuable optimisation for business problem-solving?

Guiding Question Answer


Organizations use qualitative data mining techniques to develop even more granular
insights into their analyzed data and content. These techniques help gather and
organize data and ensure data and content integrity. This normalization process leads
us to statistical modeling, a mathematical representation of observed data and its
application to the data relative to understanding and interpreting the information. From
there, we can apply the application technologies. Many of these qualitative data-mining
techniques have been hybrid techniques from many other disciplines. For example,
Applied Statistics, Mathematics, Continuous, Linear, Discreet types, etc. In Module 6,
we will discuss the Quantitative techniques applied to the analyzed data and content.
Using statistical modeling and optimization techniques has dramatically improved an
origination's decision-making ability. Based on the deployment of Big Data Analytics
applications and supporting technologies, this ability to model different views of the data
and content and to utilize one or more modeling optimization techniques has made this
possible for organizations to extract immediate value from this data and content.
Whether the goal is designing new products, streamlining a production process, or
evaluating current customer data, as opposed to potential customers, these techniques
have positioned today’s organizations to meet the complex business challenges they
face. It allows businesses to learn how data integrity is verified and validated, the
application of modeling, and the use of application technologies to develop Machine
Learning Algorithms for gaining further insights and guidance. This is compelling for an
organization.

Data Mining Prerequisites


We fully understood Big Data Analytics applications and technologies per the
proceeding modules. The second aspect of Big Data Analytics is the Information
Systems and Technology Infrastructure comprised of Engineered Systems and Cloud-
based technologies. Based on this detailed understanding and in support of the
knowledge gained, we acquired an Applied Knowledge of how Big Data Analytics
applications and technologies have been architected, developed, and deployed by
many organizations across different industries and in the public sector.

Data Mining Prerequisite Video


While reviewing the video, consider the following questions:

 What is the goal of applying dimensionality reduction?


 You are trying to find time-ordered data. What process allows you to determine that another
event will likely happen next if a particular event occurs?

Data Mining Prerequisites Links to an external site.

Based on this background, we will focus on Big Data Analytics and Qualitative Data
Mining Techniques, specifically for the many organizations that have large volumes and
massive amounts of unstructured and structured data and content that span many
decades to help them understand the monetary value of the data and content and how
this data and content can strategically position them in the marketplace. Before the
state of an organization’s realization of a Big Data Value Chain model, an organization
must actively assess that data and content as defined in the previous module in detail.
Machine Learning Algorithms are applied in helping to prepare this data and content
supported by Hadoop and MapReduce, as stated in the last module. At that point, this
analyzed data and content can begin to provide the series of actionable steps needed to
generate the value and provide insights needed by an organization for implementing
improvements. This can be highly challenging because of the large volumes of analyzed
data and content. Based on this state, the process of Data Mining and the technologies
needed to accomplish this mining can help organizations detect patterns in the analyzed
data and content to gain insights relevant to their business needs.
There are many data mining techniques that organizations have and continue to apply
to develop even more granular insights into the analyzed data and content. Let's learn
about those next.

Data Cleaning and Preparation


Data Cleaning and Preparation
Data Cleaning and Preparation is a critical component of data mining because raw data
and content must be cleansed and formatted to use one more analytical method. Data
cleaning and preparation includes different elements of data modeling, transformation,
data migration, and data movement relative to large volumes of heterogeneous data
through the deployment of Extraction, Transformation, and Load (ETL) or Extraction,
Load, and Transformation (ELT). In ETL, data moves from the data source to staging
into the data warehouse and can assist with data privacy and compliance, cleansing
sensitive data, and securing data. This happens while ELT processes a data pipeline
and uses it to replicate data from a source system into a target system, such as a cloud
data warehouse. The value of prepared and cleansed data to an organization is
invaluable relative to the potential monetization of that data for decision-making.

Tracking Patterns
Tracking Patterns
Tracking Patterns is a fundamental data mining technique. It involves identifying and
monitoring trends or patterns in data to make intelligent inferences about business
outcomes. Once an organization identifies a trend in sales data, for example, there’s a
basis for capitalizing on that insight. Suppose it is determined that a specific product is
selling more than others for a particular demographic. In that case, an organization can
use this knowledge to create similar products or services or simply better stock the
original product for this demographic.

Classifications
Classifications are data mining techniques that analyze the various
attributes of different data types. Once organizations identify the main characteristics of
these data types, organizations can categorize or classify related data. For example,
this is critical for determining personally identifiable information organizations may want
to protect or redact from documents.

Association
Association is a data mining technique related to statistics. It indicates that specific data or
events are linked to other data or data-driven events. It is similar to the notion of co-occurrence
in machine learning, where another shows the likelihood of one data-driven event. The
statistical concept of correlation is also identical to the idea of association. This means that
the data analysis shows a relationship between two data events; for example, when
someone purchases a house or an automobile, it is followed by purchasing an insurance policy.
Outlier Detection
Outlier Detection
Outlier Detection determines any anomalies in datasets. Once organizations find
abnormalities in their data, then there is an understanding of why these anomalies occur
and how to mitigate future occurrences. For example, suppose there is an electrical
power surge in a geographical area due to an intentional or unintentional increase in
voltage in an electrical power supply, as compared to the data points over a 24-hour
cycle. In that case, it is considered not part of that period’s standard system data
trend. This Control Chart for Mean and Range has a centerline called the Mean
and an Upper Control Limit (UCL) and Lower Control Limit (LCL). The anomalies can be
seen in data points beyond the UCL and LCL.
Clustering
Clustering is an analytics technique that relies on visual approaches for understanding
data. Clustering mechanisms use graphics to show the distribution of data concerning
different metrics. Clustering techniques also use different colors to show the distribution
of data. Graphical approaches and their models are best suited for using cluster
analytics. In this example, you will see that all the data points are near the mean.
Watch this video on clustering data for additional insights. Make sure to take notes on
important points.

Clustering Data Links to an external site.

Regression Techniques
Regression techniques help identify the nature of the relationship between variables in a
dataset. Those relationships could be causal in some instances or correlate in others.
Regression is a straightforward white box technique that reveals how variables are
related. Regression techniques are used in aspects of forecasting and data modeling.
The following model represents a Linear Regression model that explains two sets of
variables.
Machine Learning and Artificial Intelligence
Machine Learning and Artificial Intelligence (AI) represent some of the most advanced
developments in data mining. Advanced forms of Machine Learning, like Deep
Learning, provide highly accurate predictions when working with vast amounts of data.
Machine Learning can be used for processing data such as imaging, language
recognition, and text-based analytics using natural language processing.

Predictions
Prediction is a powerful aspect of data mining, representing one of four analytics
branches. Predictive analytics uses patterns found in current or historical data to extend
them into the future, providing organizations with insight into what trends will occur next
in their data. There are several different approaches to using predictive analytics. For
example, aspects of Machine Learning and Artificial Intelligence can
be facilitated by learning algorithms.

Sequential Patterns
Sequential Patterns are data mining techniques that focus on uncovering a series of
events in a sequence. This is particularly useful for data mining transactional data. This
technique can reveal what items of clothing customers are more likely to buy after an
initial purchase, let’s say, a pair of shoes. In addition, understanding sequential patterns
can help organizations recommend additional items to customers to increase sales.

Decision Trees
Decision Trees are a specific type of predictive model that allows organizations to mine
data effectively and are part of Machine Learning. It is called a white box Machine
Learning technique based on its logic. They give users a context to understand how the
data inputs affect the outputs. When various decision tree models are combined, they
create predictive analytics models known as a random forest. Complicated random
forest models are considered black box Machine Learning techniques relative to the
complexities in understanding their outputs based on their inputs. The following model
depicts a decision tree where decisions are made, with outcomes being the by-
products of these decisions. Think of decisions as the cause and outcomes as the
effect.
Statistical Techniques
Statistical Techniques are at the core of most analytics involved in the data mining
process. The different analytics models are based on statistical concepts, which output
numerical values that apply to specific business objectives. For example, Artificial
Neural Networks use complex statistics based on different weights and measures
to determine if a set of spatial images can be distinguished relative to understanding the
payload differences of enemy missiles through image recognition systems.

Data Visualizations
Data Visualizations are another essential element of data mining. They provide users
insights into data based on sensory perceptions that people can view.
These dynamic visualizations help stream data in real time, providing trends and
patterns in data. This is a graphical statistical model generated with the R language.

Artificial Neural Network (ANN)


An Artificial Neural Network is a specific type of Machine Learning model often used
with Machine Learning, specifically Deep Learning Algorithms. It is one of the
most accurate Machine Learning models deployed. These models are very complex.
Therefore, applied knowledge of Deep Learning Algorithms is needed to
understand how an Artificial Neural Network model determines an output.
Read this article to learn more about the concept of artificial neural networks.

How Artificial Neural Networks can be Used for Data Mining Links to an external site.

Data Warehousing
Data Warehousing is integral to data mining because it stores structured data in
relational database management systems. For informational reporting, the data can
be analyzed through Business Intelligence Technologies, for example, Oracle Business
Intelligence or Cognos. These provide an understanding of the data, such as the
‘Who,’ ‘What’, ‘Where,’ and Why, and a comparison to a data report of rows/columns of
data records. With the advent of Cloud-based technologies, data
warehouses, and relational and object-oriented databases are autonomous and self-
managing. These technologies are built on Artificial Neural Networks and can address
semi-structured and unstructured data stores like Hadoop. The following technical
architecture model details the technical aspects of data warehousing and integration
through Robotic Process Automation agents and pseudo-like APIs, which apply varied
Data-Mining techniques. For example, data cleansing and preparation,
data visualization supported by the R applications, pattern matching, and Python
scripts. This model depicts a Clinical Systems Information and Technology
Infrastructure.
Long-Term Memory Processing
Long Term Memory Processing is the ability to analyze data over extended periods,
providing well-defined and accurate patterns that help extend time sequences. Long
Term Memory Processing can identify hard-to-detect patterns, for example, by
analyzing monetary increases in the maintenance of the Information Technology Legacy
Systems over extended periods.
Many of these qualitative data-mining techniques have been a hybrid set of plans from
many other disciplines, such as applied Statistics and mathematics, continuous, linear,
discreet types, etc... Module 6 will discuss the quantitative techniques applied to the
analyzed data and content. This hybrid model provides an overview of these multi-
disciplines contributing to data mining.

Data Mining Techniques Reflection


The following ungraded activity allows you to think critically through what data-mining
techniques may provide more significant insights into the data and content.
Consider a scenario whereby you work for a major electronics manufacturer and have
been given a critical set of goals and objectives relative to a Big Data Analytics initiative
involving several sub-projects that must be successfully deployed. As the program
manager for these sub-projects, you have prioritized the projects that must be
implemented soon. The first project is data cleansing and Preparation.
Upon completion of the assessment of knowing the types of data and content your
organization has retained over many years of business operations, you need to go
through a four-step process as depicted in the model under that data-mining technique
so that the Big Data Analytics applications and technologies can be applied against that
data for the benefit of providing insights into their operations.
As the program manager, complete the following steps and answer the prompts.

1. List four types of data and content that may be present: Relational, Information, Static and
Dynamic Content, and Multi-Media. Describe these examples.
2. Construct a Data cleansing and Preparation Model, use the example developed for you, and
assign the four types of data and content you listed.
3. Would the data-mining techniques of Patterns and Classifications provide any value from the
perspective of obtaining a different understanding of the data?

Ensure you put this in your journal/notes for this class.

Statistical Analysis and


Modeling
Statistical Analysis and Modeling
Statistical modeling and optimization techniques have dramatically improved
an organization’s decision-making abilities. Based on the deployment of Big Data
Analytics applications and supporting technologies to address large amounts and varied
types of data and content, this ability to model different views of the data and content
and utilize one or more modeling optimization techniques has made it possible
for organizations to extract immediate value from this data and content.

Statistical Analysis
Statistical analysis of a representative group of customers can provide a
reasonably accurate and cost-effective view of the market and gauge the marketplace
competition. Applied statistics has three sub-disciplines: descriptive, predictive, and
inferential. Applied statistics provides the capability
of modeling statistics. Applying optimization techniques to these statistical models is
valuable from a business decision-making perspective and also for developing the
strategic and tactical business strategies needed to compete with your basis of
competition.

Statistical Modeling
Statistical modeling also offers value by defining immediately actionable objectives,
such as setting the direction for the actionable steps, understanding potential business
risks that need to be mitigated, and defining a solid correlation of data and content to
the current problem areas of the organization. For example, the relationship between
sales offers and changes in revenue or the relationship between dissatisfied customers
and products purchased. Statistical modeling provides the means to measure and
control the production processes to minimize variations that lead to error or waste. This
ensures consistency throughout the process, saving money by reducing the materials
required to make or remake products, materials lost to overage and scrap, and the cost
of honoring warranties due to defective shipping products.
The rationale for organizations to deploy Big Data Analytics applications and supporting
technologies for achieving the value that statistical modeling provides is that Data
Analysts can construct these statistical models and utilize these optimization techniques
in business functional groups. These groups own the related data and content
compared to a Data Scientist responsible for developing Machine Learning and Deep
Learning Algorithms, Intelligent SMART Workflow, Cognitive Agents, etc. There are
many different types of statistical models, and a practical data analyst must
comprehensively understand them.
In each scenario, you should be able to identify which model will help best answer the
question at hand and which model is most appropriate for the data you’re working with.
Data is rarely ready for analysis in its raw form; therefore, it must first be cleansed and
prepared to ensure your analysis is accurate and viable. This cleansing and preparing
process often includes organizing the gathered data and content and quarantining
corrupted and incomplete data. You must explore and understand the data before any
statistical model can be completed. Once you know how various statistical models work
and how they leverage data, it is easier to determine what data is most relevant to the
questions you are trying to answer and the insights. An organization must have a
thorough understanding of statistical modeling to obtain and gain those answers;
therefore, data visualizations are very helpful in communicating complex solutions and
insights.

Data Mining Techniques


Data mining techniques are commonly referred to as statistical techniques. For
example, regression models examine relationships between variables. Organizations
often use regression models to determine which independent variables influence
dependent variables most. This is information that can be leveraged to make essential
business decisions. The following is a list of different types of regressing models. The
first three are standard models that have been used extensively over time.

1. Logistic Regression
2. Linear Regression
3. Polynomial Regression
4. Stepwise Regression
5. Ridge Regression
6. Lasso Regression
7. Elastic Net Regression

Classification Models
Classification Models
Classification models are a process in which an algorithm analyses an existing data set
of known points. The understanding achieved through that analysis is then leveraged to
classify the data appropriately. A classification is a form of machine learning that can be
particularly helpful in analyzing massive, complex data sets and content to help make
more accurate predictions.
Classification models are a form of Supervised Machine Learning used to understand
how each data point was derived from historical data. These models provide more
information that can be used to explain the results of specific predictive models. Some
of the most common classification models include the following.

1. Decision Trees
2. Random Forests
3. Naive Bayes Nearest Neighbor (NBNN)
4. Artificial Neural Networks

Modeling Optimization Techniques


The following modeling optimization techniques are used to develop and refine
statistical models:

Continuous Optimization
In a branch of optimization in applied mathematics, instead of discrete optimization, the
variables used in the objective function must be continuous.

Bound Constrained Optimization


Consider the problem of optimizing an objective function subject to bound constraints on
the values of the variables.

Constrained Optimization
The process of optimizing an objective function concerning some variables in the
presence of constraints on those variables.

Derivative-Free Optimization
A discipline in mathematical optimization that does not use derivative information in the
classical sense to find optimal solutions: Sometimes, information about the derivative of
the objective function f is unavailable, unreliable, or impractical.

Discrete Optimization or Combinatorial Optimization


Searching for an optimal solution in a finite or countably infinite set of potential
solutions, optimality is defined concerning some criterion function and is to
be minimized or maximized.

Global Optimization
Finding the best set of admissible conditions to achieve your objective, formulated in
mathematical terms and part of Non-Linear Programming.

Linear Programming
A method to achieve the best outcome (such as maximum profit or lowest cost) in a
mathematical model whose requirements are represented by linear relationships.

Non-differentiable Optimization
A category of optimization that deals with an objective that, for various reasons, is non-
differentiable and thus non-convex.These functions, although
continuous, often contain sharp points or corners that do not allow for the solution of a
tangent and are thus non-differentiable.
An essential step in optimization is classifying your optimization model since algorithms
for solving optimization problems are tailored to a particular type of problem. The
following guides to help you organize your optimization model based on the
various optimization problem types.
Problem Type 1: Continuous Optimization vs. Discrete Optimization: The variables
take on values from a discrete set, often a subset of integers. Other models contain
variables that can take on any real value. Models with discrete variables
are optimizationLinks to an external site. problems; models with continuous variables
are optimizationLinks to an external site. problems.
Problem Type 2: Unconstrained Optimization vs. Constrained Optimization: Variables
and problems with constraints on the variables. OptimizationLinks to an external
site. problems arise directly in many practical applications. They also occur in
reformulating constrained optimization problems in which a penalty term replaces the
constraints in the objective function. OptimizationLinks to an external site. problems
arise from applications with explicit constraints on the variables.
Problem Type 3: Deterministic Optimization vs. Stochastic Optimization: It is assumed
that the data for the given problem are known accurately; however, for many actual
problems, the data cannot be known accurately for various reasons.
Statisitical Modeling and Optimization Techniques Practice Activity
Statisitical Modeling and Optimization Techniques Practice Activity

The following ungraded activity allows you to think critically through what statistical
modeling techniques may be applicable and the optimization problem types it may
provide you with. Use your work from the previous practice activity and select a
statistical modeling technique or set of techniques that you can now begin to optimize
those four types of cleansed and prepared data.

Complete the following steps and then answer the prompt.

1. List at least one statistical modeling technique for data and content optimization
and explain why you selected this technique.
2. Classify your Optimization Model based on at least one optimization problem
type.

Would there be additional statistical modeling techniques that would also benefit you,
and why?

Python
Python is a general-purpose and high-level programming language designed to be easy
to read and simple to implement. Python can be used for developing desktop GUI
applications, websites, and web applications using the Python Django framework. It is
considered a high-level programming language that allows a Software Engineer to focus
on the application’s core functionality by taking care of everyday programming tasks. It
is an open-source application language and a scripting language, like Ruby or Perl.
Python can be summarised as follows:

 It is interpreted and object-oriented based, which means that Python code is translated and
executed by an interpreter one statement at a time. The entire source code is compiled and
then executed in a compiled language.
 It is an object-oriented programming language. Data in Python are objects created from
classes. A class is a type that defines objects of the same kind with properties and methods
for manipulating objects. Object-oriented programming is a powerful tool for developing
reusable software.

Pyhon Tutorial
The tutorial provides many examples as they explain in detail the syntax and constructs
of this scripting language.

Learning Python Links to an external site.

Companies that Use Python


Read this to learn more about companies that use Python.

Companies that use Python Links to an external site.

R
R is a language and environment for statistical computing and graphics. The R
language provides a wide variety of statistical linear and non-linear modelling, classical
statistical tests, time-series analysis, classification, clustering, and graphical techniques
to a Software Engineer. One of R’s strengths is the ease with which well-designed
publication-quality plots can be produced, including mathematical symbols and formulae
where needed. R is also open-source software and works with critical operating
systems such as Open Source Linux, UNIX, etc.
R can be summarised as follows:

 It is a highly extensible and flexible language


 It is straightforward to implement these high-end and complex statistical methods
 It has flexible graphics and intelligent defaults

 It has a very steep learning curve


 It is slow for processing large datasets

Choosing Python or R for Data Analysis


While reviewing the article, consider the following questions:

 Based on the article, which program is recommended for someone with little coding
experience?
 Do you have any Python or R experience? If so, which do you prefer?

Python vs. R: What Should You Choose? Links to an external site.

SQL (Structured Query Language)


The tutorial allows beginners to learn and apply the syntax in structured exercises.

SQL Tutorial by W3C Links to an external site.

SQL's Relationship to Python and R


The relationship between Python, R, and SQL is as follows

 Python is an application language with scripting capabilities that can be brought to bear on
several statistical models.
 R is a programming language also used to provide well-defined, multi-dimensional models
that can be graphically displayed and used by Data Visualization technologies by data
analysts for changing any sets of the logic of these models with Boolean and implied
Boolean logic.
 SQL’s relationship to Python and R is that it is used to extract data sets from the databases
to examine the data integrity and understand the relationships and constructs of application
tables and reference tables, etc. SQL is beneficial when used with several data-mining
techniques, for example, data cleansing and preparation.

Python, R, and SQL Reflection


The following ungraded activity is an opportunity for you to think critically about how you
can apply Python, R, and SQL to statistical models. Use your work from the previous
practice activities. Complete the following Steps and then answer the prompt.

1. List at least one example of a Python application or script that may be helpful and explain the
logic for it.
2. List at least one example of an R application that may be helpful and explain its logic.
3. List at least one example of a fundamental SQL statement that can be used to verify and
validate data and content integrity.

Would R provide a better graphical representation of the information derived from


statistical modeling?
Answer these questions in your journal for the course.

Module 5 Milestone: Business Recommendation


Framework for Big Data Analytics in E-
Commerce
Assignment Overview
Using the knowledge from above do this assignment.
For this assignemnt, you will use the same e-commerce company from Milestone 1
which is Amazon and create an 8-10 slide PowerPoint presentation in which you
recommend a framework for analyzing big data, specifically supplier relationship
management (SRM). Your framework will then assist you with showcasing a set of use
cases to address your forecasting and processing of SRM data.

Assignment Description
Use Amazon and create a PowerPoint presentation. Each slide must include between
50-150-word descriptions in the presenter notes. The presenter notes are found at the
bottom of each slide in PowerPoint where it says "notes".
Be sure to address important terms, including supplier relationship management, good
categories, Hadoop, Apache Spark, predictive analytics, and workflow. Make sure you
include citations and references to support your work.

Assignment Instructions
Step 1: Create your PowerPoint Presentation with the Following Sections

 Presentation Title
 E-Commerce Entity Description
 Business Framework (Plans and goals for forecasting, processing, and analyzing SRM
data)
 Use Case #1
 Use Case #2
 Use Case #3
 Use Case #4
 Use Case #5
 Conclusion
 References
Step 2: Research the Company
Research topics involve data science platforms, authority, and AI content related to your
selected company. Make sure you recommend a framework for analyzing big data and
use this framework to showcase a set of use cases to address the forecasting and
processing of SRM data.
Step 4: Prepare your Presentation
Create the presentation according to the outline. Make sure to include the 50-150
descriptions in the notes section of each slide. Include references at the end of the
presentation. You must include a minimum of 2 sources.

Assignment Tips
 Include a minimum of two references using in-text citations and listing them on the reference
list. The NXU citation style can be found in your writing lab.
 Review the rubric before submitting the assignment.
 If you need help with creating and delivering an effective presentation, review this resource
on creating and delivering an effective presentation Links to an external site..

You might also like