Data Models (Module - II)

Uploaded by

Lalithkumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views101 pages

Data Models (Module - II)

Uploaded by

Lalithkumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

IoT data management system

(Module-II)
Dr Shola UshaRani
Associate professor
SCOPE,VIT Chennai.
What is IoT Data management
it enables users to track, monitor and manage
the devices to ensure the devices to work
properly and securely after deployment.
Why Data Management is need?
● IoT sensors interact with people, homes, cities, farms,
factories, workplaces, vehicles, wearables and medical
devices, and beyond.
● it is changing our lives from managing home appliances
to vehicles. The technology of devices can now advise us
about what to do, when to do and where to go.
● Industrial IoT applications assist us in managing the data
for processes, and predicting faults and disasters.
● The IoT platforms help set and maintain parameters to refine
and store data accordingly.
Data Management Process
• the process of taking the overall available data and refining it down to important
information
• Different devices from different applications send large volumes and varieties of information. Managing all
this IoT data means is developing and executing architectures, policies, practices and procedures that
can meet the full data lifecycle needs.
• Things are controlled by smart devices to automate tasks, so we can save our time.
Intelligent things can collect, transmit and understand information, but a tool will
be required to aggregate data and draw out inferences, trends and patterns.
What is IoT data management
requirement
• it is need to design a data management framework
compatible with all the software and hardware that play a
role in collecting, managing and distributing data.
• The design needs to be efficient to accelerate time-to-
market of the end-product.
• A good IoT data management solution will be able to
filter out erroneous records coming from the IoT
systems—such as negative temperature readings—
before ingesting it into the data lake.
IoT data management techniques
cloud computing Edge computing
● data processing happens in a ● data is processed near the data
source or at the edge of the
centralised data storage
network. By processing and
location using some data locally, the IoT
● Sensors and devices can saves storage space for data,
connect indirectly through the processes information faster and
cloud, where data is centrally- meets security challenges.
managed, or send data directly ● Sensors produce a large amount
of data for edge gateway devices
to other devices to locally
so that these can make decisions
collect, store and analyse the by analysing the data
data, and then share selected ● Edge devices for data
findings or information with management help secure the
the cloud most valuable data and reduce
bandwidth cost.
Data Management challenges
• Space Optimization : the number of IoT devices will increase, thus increasing the
challenges for real-time processing and analysis to reduce time for storage
• Identification of tools: Functions such as adaptive maintenance, predictive
repair, security monitoring and process optimization rely on real-time data
selecting the right tools is a challenge because integration between different
sensors should be proven and compatibilities confirmed. When there is no
connection, devices must still gain insights, make decisions and prepare for data
distribution
• Security to data: to protect data from unauthorized access and tampering.
Organizations need to be compliant with national rules and regulations on
securing data.
• secure gateway device: Having many different devices connected directly to
cloud services presents a huge attack surface, which can be mitigated by
channeling data through a secure gateway device
Data Engineering and Data
Exploration
Data Exploration
• the initial step in data analysis, in which data analysts
use data visualization and statistical techniques to
describe dataset characterizations, such as size,
quantity, and accuracy, in order to better understand
the nature of the data.
• Two techniques
– Manual analysis
– Automated analysis
• through exploration software solutions.
• visually explore and identify relationships between different data
variables, the structure of the dataset, the presence of outliers,
and the distribution of data values in order to reveal patterns and
points of interest, enabling data analysts to gain greater insight
into the raw data.
Why we need Data Exploration?
• Data is often gathered in large, unstructured volumes
from various sources and data analysts must first
understand and develop a comprehensive view of the
data before extracting relevant data for further
analysis, such as univariate, bivariate, multivariate,
and principal components analysis.
• Humans process visual data better than numerical
data.
• Challenge : thousands of rows and columns of data
points and communicate that meaning without any
visual components
Data Exploration Tools
• Manual data exploration methods
– writing scripts to analyze raw data or manually filtering data into
spreadsheets.
– A popular tool for manual data exploration is Microsoft Excel spreadsheets
• To identify the correlation between two continuous variables in Excel, use the function
CORREL() to return the correlation
• Automated data exploration tools
– data visualization software, help data scientists easily monitor data sources
and perform big data exploration on otherwise overwhelmingly large datasets
– Graphical displays of data, such as bar charts and scatter plots, are valuable
tools in visual data exploration.
– variety of proprietary automated data exploration solutions,
including business intelligence tools, data visualization software, data
preparation software vendors, and data exploration platforms
– open source data exploration tools that include regression capabilities and
visualization features, which can help businesses integrate diverse data
sources
Data Engineering
Business Intelligence
• Collects, integrates, analyses data using
reports and dashboards to support decision
making
Roles required in Data Management
Process
• Data Analyst
• Data Engineer
• Data Scientist
Data Analyst or Data Integration
• Analyse all kinds of data and helps the
organization to understand it in general English
languages
• Helps in making better business decisions for
upper management
• Responsibilities :
– Collection
– Correlation
– Analysis
– Reporting
Data Engineer
• Preparing data for analytics and operational
usage
• Develops, constructs, tests and maintains the
complete architecture of the large-scale
processing system.
• Preparing a data pipeline
– To pull all the information integration from different
sources.
– The integrated data is consolidated for the data clean
and structure it for more analytics
Data Scientist
• Analyses and interpret the complex digital
data
– Statistics of an website
• A professional who deals with an enormous
mass of structured/unstructured data and use
their skills in math, statistics, programming,
machine learning etc.,
• use the above techniques to implement
strategic plans
Skills needed
Data Analyst Data engineer Data Scientist

Data Handling Creating & integration Advanced statistical

APIs analyses

Cleaning & Modelling In-depth ML Algorithms Predictive algorithm

Business & Reporting Data pipeline & Data driven problem

performance solving, develop
optimization operational models &
data conditioning
More detailed skills
Roles & Responsibilities
Pre-process large datasets (ETL
technique,Data Lake Topology),
Tagging the data Prediction models
for forecasting
Steps in Data Engineering
Data Integration methods
• ETL (Extract,Transform,Load)
• ELT (Extract,Load,Transform)
• Data Lake
What is a pipeline
• A set of processes that allows data to flow
• ETL aims to solve the problem of having data
in disparate places and formats by allowing
you to pull data from various sources into a
centralized location with a standardized
format. It’s called a “pipeline”.
What is an ETL
• Extract, Transform and Load
• can query hundreds of database rows at once without waiting for
each query to complete before moving on to the next one
• process of Extracting, Transforming, and Loading data from one or
multiple sources into a destination
• process to convert large amounts of data from one format to another.
• allows data to flow from one destination to another, while passing
through the three different stages.
• ETL Pipeline is implemented when to move data into new data base
or destination in a different from than the one stored in data
source periodically.
• ETL pipeline has
– Extract
– Transform
– Load
Reference : https://medium.com/@osiolabs/what-is-an-etl-extract-
transform-load-pipeline-in-node-js-9a1a17de30f1
Parts of ETL Pipeline
• Extract — Retrieve raw data from wherever it is,
be that a database, an API, or any other source.
• Transform — Alter the data in some way. This
could be restructuring the data, renaming keys in
the data, removing invalid or unnecessary data
points, computing new values, or any other type
of data processing which converts what you have
extracted into the format you need it to be.
• Load — Move the formatted data into its final
destination (such as a database, flat file, or other
structure) where it can be accessed by others.
Extract
• retrieve the data from the data source
• input for the pipeline
• The data might be asynchronous nature.
• Interacting with the data source and getting
the right data is the main focus of the extract
step, and frees up the other parts of the
pipeline from having to worry about how to
retrieve the data.
Extract
• Data management teams can extract
data from a variety of data sources,
which can be structured or unstructured.
Those sources include but are not limited
to:
• SQL or NoSQL servers
• CRM and ERP systems
• Flat files
• Email
• Web pages
Transform
• Simple, sometimes complex too.
• the data flowing through the pipeline.
• converting the raw data from the extract step
into the data structures or values that desired.
• performed with functions that operate on the
data, taking the raw data as input and outputting
new records in we need as desired format.
• It is taking input and returning an appropriate
output based on the data it received.
Tasks of Transform
• Filtering, cleansing, de-duplicating, validating, and
authenticating the data.
• Performing calculations, translations, or summarizations
based on the raw data. This can include changing row and
column headers for consistency, converting currencies or
other units of measurement, editing text strings, and
more.
• Conducting audits to ensure data quality and compliance
• Removing, encrypting, or protecting data governed by
industry or governmental regulators
• Formatting the data into tables or joined tables to match
the schema of the target data warehouse.
Loading
• sending the data.
• Loading the data into desired format at the
destination like database, a data warehouse,
or a flat file to interact with it for further use.
ETL for Retail company
• Problem: Retail company with stores around the world that
transact in different local currencies. Every store reports its
revenue to the head office at the end of the month.
• Need :
– the stores report revenue in different currencies
– analyze all the stores’ performance in direct relation to each
other
– analyze the revenue from each store to use an ETL process to
standardize each report so they all use the same currency
• Process at ETL:
– extract the data from the reports sent by the stores,
– transform the currency amounts from their local currency to a
single base currency, and
– then load the modified report data to a final reporting database
or other location
Why Data Lake?
• Industries need large scale data infrastructure to support
analytics goals.
• Traditional databases for collecting data from applications
and storing it for reference, but poor for running analytical
queries on the same data.
• Traditional technologies only for linear placement of data,
not suitable for to represent complex relationships among
the data.
• The data warehouse quickly became associated with its
limitations, as companies became hungrier to leverage
more and more data.
• The data warehouse became crowded and bogged down
with requests which killed its performance and tested its
ability to deliver on service level agreements.
What is DATA LAKE
• The fundamental principle for building a data platform for
analytics.
• The invention of the data lake is aimed to solve the
problem of scalability in data.
Data Lake
● A Data Lake is a storage repository that can store large amount of structured, semi-structured, and
unstructured data or machine to machine, logs flowing through in real-time. It is a place to store
every type of data in its native format with no fixed limits on account size or file
● The main objective of building a data lake is to offer an unrefined view of data to data scientists.
Data Lake- Topology

Used by AWS,Data bricks,Azure &

Oracle etc.,
Differences between Data Lake & Data
Warehouse (traditional system)
When to Use a Data Warehouse
• Query performance
• Transactional reporting
• Dashboards
• Structured data
• Data integrity
When to Use a Data Lake
• Large data volumes
• Unstructured and semi-structured data
• Streaming and time relevant data
• Data archive
Data preprocessing
• The set of techniques used prior to the application
of a data mining method is named as data
preprocessing for data mining.
• Since data will likely be imperfect, containing
inconsistencies and redundancies is not directly
applicable for a starting a data mining process.
• The bigger amounts of data collected require more
sophisticated mechanisms to analyze it.
• Data preprocessing is able to adapt the data to the
requirements posed by each data mining algorithm.
Why do we need to preprocess data?
By preprocessing data, we:
• Make our database more accurate. We eliminate
the incorrect or missing values that are there as a
result of the human factor or bugs.
• Boost consistency. When there are inconsistencies
in data or duplicates, it affects the accuracy of the
results.
• Make the database more complete. We can fill in
the attributes that are missing if needed.
• Smooth the data. This way we make it easier to use
and interpret.
Data preprocessing tasks
Data preprocessing tasks

Source: https://serokell.io/blog/data-preprocessing
Pre-processing 1 : Data Cleaning
What is Data cleaning
• Applying different techniques based on the
problem and the data type.
• Incorrect data is either removed,
corrected, or imputed.
Different Steps performed in Data
Cleaning
• Duplicates,
• Type conversion,
• Syntax errors,
• Standardize,
• Scaling / Transformation,
• Normalization,
• Missing Values,
• Outlier Treatment,
• Irrelevant Data.
Duplicates (step 1 of 9)
• Duplicates are data points that are
repeated in your dataset.
• It often happens when for example; Data
are combined from different sources
Type Conversion (step 2 of 9)
• Make sure numbers are stored as
numerical data types.
• A date should be stored as a date object,
or a Unix timestamp (number of seconds),
and so on.
• Categorical values can be converted into
and from numbers if needed.
Syntax Errors
(step 3 of 9)
• Remove white spaces: Extra white spaces
at the beginning or the end of a string should
be removed.
• Pad strings: Strings can be padded with
spaces or other characters to a certain width.
– For example, some numerical codes are often
represented with prepending zeros to ensure they
always have the same number of digits.
• Fix typos: Strings can be entered in many
different ways, and no wonder, can have
mistakes.
Standardize (step 4 of 9)
• Our duty is to not only recognize the typos
but also put each value in the same
standardized format.
• For strings, make sure all values are
either in lower or upper case.
• For numerical values, make sure all
values have a certain measurement unit.
Scaling &
Transformation (step 5 of 9)
• Scaling means to transform
your data so that it fits within
a specific scale, such as 0–
100 or 0–1.
– For example, exam scores of
a student can be re-scaled to
be percentages (0–100)
instead of GPA (0–10).
• It can also help in making
certain types of data easier
to plot.
Normalization (step 6 of 9)
• In most cases, we
normalize the data
if we’re going to be
using statistical
methods that rely
on normally
distributed data.
Missing Values (step 7 of 9)
• Ways to Resolve Missing Values are:

– Drop: If the missing values in a column rarely

happen and occur at random, then the easiest
and most forward solution is to drop observations
(rows) that have missing values.
– Impute: It means to calculate the missing value
based on other observations. There are quite a
lot of methods to do that.
• Types : Statistical values; linear regression; Hot-
Deck.
– Flags: Mark the missing values and flag them
out for later analysis
Outlier Treatment (step 8 of 9
• They are values that are significantly
different from all other observations.
– Any data value lies more than (1.5 * IQR)
away from the Q1 and Q3 quartiles is
considered an outlier.
Irrelevant Data (step 9 of 9)
• Irrelevant data are those that are not
actually needed, and don’t fit under the
context of the problem we’re trying to
solve.
– For example, If we were analyzing data
about the general health of the population, the
phone number wouldn’t be necessary —
column-wise.
Other steps (Data cleaning)
Apply one of the following methods to solve this problem:
• Binning. Use binning if you have a pool of sorted data.
Divide all the data into smaller segments of the same
size and apply your dataset preparation methods
separately on each segment. For example, you
can bin the values for Age into categories such as 21-35,
36-59, and 60-79.
• Regression. Regression analysis helps to decide what
variables do indeed have an impact. Apply regression
analysis to smooth large volumes of data. This will
allow you to only work with the key features instead of
trying to analyze an overwhelming amount of variables.
• Clustering. clustering algorithms to group the data.
Pre-processing 2: Data
transformation
• By cleaning and smoothing the data, have
already performed data modification.
• Data transformation are the methods of
turning the data into an appropriate
format.
• Aggregation
– the data is pooled together and presented in a unified format for data
analysis.
• Working with a large amount of high-quality data allows for getting more
reliable results from the ML model.
• Normalization
– Normalization helps you to scale the data within a range to avoid
building incorrect ML models while training and/or executing data
analysis.
– Why?
• If the data range is very wide, it will be hard to compare the figures.
• With various normalization techniques, you can transform the original data
linearly, perform decimal scaling or Z-score normalization.
• Feature selection
– It is the process of automatically choosing relevant features for
your machine learning model want to predict
– Feature selection is the selection of variables in data that are the
best predictors for the output variable
• Discretization
– During discretization, a programmer transforms the
data into sets of small intervals. For example,
putting people in categories “young”, “middle age”,
“senior” rather than working with continuous age
values. Discretization helps to improve efficiency.
• Concept hierarchy generation
– If you use the concept hierarchy generation method,
you can generate a hierarchy between the attributes
where it was not specified.
• For example, if you have the location information that
includes a street, city, province, and country but they have
no hierarchical order, this method can help you transform
the data.
Pre-processing 3: Data reduction
• When you work with large amounts of data, it
becomes harder to come up with reliable
solutions.
• Data reduction can be used to reduce the
amount of data and decrease the costs of
analysis.
• Researchers really need data reduction when
working with verbal speech datasets.
– Attributes feature reduction
– Dimensionality reduction
– Numerosity reduction
Attribute feature selection
• Techniques for data transformation can also be used for data
reduction.
• If you construct a new feature combining the given
features in order to make the data mining process more
efficient, it is called an attribute selection.
Dimensionality reduction
• Datasets that are used to solve real-life tasks have a huge
number of features.
• Computer vision, speech generation, translation, and many
other tasks cannot sacrifice the speed of operation for the
sake of quality.
• It’s possible to use dimensionality reduction to cut the
number of features used.
Numerosity reduction
• Numerosity reduction is a method of data reduction
that replaces the original data by a smaller form
of data representation.
• There are two types of numerosity reduction
methods – Parametric and Non-Parametric.
• Parametric Methods
– Parametric methods use models to represent data,
regression is used to build such models.
• Non-parametric methods
– These techniques allow for storing reduced
representations of the data through histograms, data
sampling, and data cube aggregation.
References
• https://www.kdnuggets.com/2018/06/difference-
between-data-integration-data-engineering.html
• https://www.omnisci.com/learn/data-exploration
• https://www.simplilearn.com/tutorials/data-science-
tutorial/data-scientist-vs-data-analyst-vs-data-engineer
• https://iot.electronicsforu.com/content/tech-
trends/data-management-systems-iot-devices/
• https://streamsets.com/blog/edw-or-edh-data-lake-
warehouse-or-
lakehouse/#:~:text=A%20data%20warehouse%2C%20a
lso%20known,typically%20on%20a%20regular%20cade
nce.
Tagging of Data
Data forms
• Text data on paper
• Bytes as electronic memory storage.
• Twitter data
• image data
• structured data
• semi structured data
• unstructured data
• views
• likes
• tagged data, meta data
Example-I
• Google Data power
– Through an email can seen a pop-up ad, that you
saw from the browser before.
– How it is happening?
– Google software read your email, saw that you
mentioned unicorn hoodies, found an advertisement
for one, and then offered them to you.
– Google collects billions of data points every single
second from users around the world.
– Data points come from Gmail, Google searches, G
Suite, Google Chrome, YouTube,
Example -II
• Real estate website that provides home values
for free.
• They started making agreements with local real
estate associations to port their local sales
information directly to Zillow’s database.
• Zillow gets data from realtors. It uses that data to
attract homebuyers online, generating leads for
realtors, and then sells this data back to the
realtors, who gave them the data in the first
place.
Data organization
• Data is more powerful when it is organized
properly
– It’s the ability, as a business owner, to see and
understands the trends in your customers’
behavior and make proactive, rather than
reactive changes in order to increase your bottom
line
Advantages of Data collectors
• Monetization of data for gaining the money.
– Social media networks such as Facebook, Twitter,
and LinkedIn use data to make money.
• Collecting data on your customers and their
trends
Tags
• Organizing the data
– Collecting data
– Learn how to track
– Using the data
• Tags are something you can assign to any set of data. A set of data can be
tagged with one, or multiple tags.
– social media post with hashtags.
• Process of Tagging
– is a kind of indexing, a process of labelling and categorizing information made
to support resource discovery for users.
• Social tagging generally means the practice whereby internet users generate keywords to
describe, categorize or comment on digital content
• Tags are a very efficient way to organize data sets, such as customers,
– improve the experience businesses with categoric data
• Some data with multiple tags.
– Customer might be with more than one tag.
• CRMs have multiple data fields that you can input for each customer.
Any field where you can add a category that is common across the
different customers, and if you can categorize them with more than one
of those category options, using tags.
HashTag
• Assigning the set of data of social media data
with hashtags.
• A hashtag is a metadata tag that is prefaced by
the hash symbol, #. Hashtags are widely used on
microblogging and photo-sharing services such
as Twitter and Instagram as a form of user-
generated tagging that enables cross-referencing
of content sharing a subject or theme.
• hashtag is used to draw attention, organize,
promote, and connect. It is started on Twitter as
a way of making it easier for people to find,
follow, and contribute to a conversation
• Instagram every post is tagged with a specific
hashtag.
– a hashtag called #chicagofood that is attached to
about 791,000 posts.
– #ginoseastpizza (a restaurant in Chicago) which is
attached to just over 1000 posts
Why use hashtags?
• Hashtags helps to get found of target
audience.
• Hashtags improve your click through rates
(CTR)
• Hashtags are great for research
• Hashtags bring like-minded people together.
• Hashtags can be used for humor
Contd.,
• #business, #entrepreneur or #marketing
• Specifying
– put the pound sign directly in front of the word or
phrase you want to turn into a hashtag and follow
these simple rules:
– No spaces
– No punctuation
– No special characters
– capitalization for readability(optional)
Types of Tags
• One Tag
• Multiple Tags
• Hyper-targeted advertising tags
One Tag
• Only one tag is associated with the data of the
customer data
– Buying only beauty products by the customer
– Buying only sports goods from the same company
WITHOUT THE TAGS, SENDING
EVERYONE IN YOUR DATABASE
EVERYTHING YOU HAVE TO OFFER,
WHICH IS NOT AS EFFECTIVE.
Multiple tags
• Assigning more than one tag to a customer by
the company
– the customer who buys beauty products and
sporting goods
– With tags, a company can assign as many
different tags to a customer as they need and
desire.
– Now it’s up to the company’s marketing team to
market to customers based on how they are
tagged;
Targeted advertising
• Targeted advertising is a form of advertising, an online
advertising, that is directed towards an audience with
certain traits(quality parameters), based on the
product or person the advertiser is promoting.
• These traits can either be demographic with a focus on
race, economic status, gender, age, generation, level of
education, income level, and employment, or there
can be a psychographic focus which is based on the
consumer values, personality, attitude, opinion,
lifestyle and interest.
Importance of target advertising
• Digital data radically increase, and it spreads
across numerous ICT channels cohesively.
• The emergence of new online channels, the
need for targeted advertising is increasing
• Companies aim to minimize wasted
advertising by means of information
technology.
Issues with targeted advertising
• the lack of 'new' advertisements of goods or
services.
• Seeing as all ads are tailored to be based on
user preferences, no different products will be
introduced to the consumer.
– In this case the consumer will be at a loss as they
are not exposed to anything new.
Hyper-targeted advertising
• Hyper-targeting is a marketing strategy where you clearly identify a
target customer and deliver extremely relevant messages in the places
where they will be most likely to see it.
• Definitions in Hyper-targeted advertising
– Buyer persona
• More details about customer profile outlines who your target customer is by
defining their demographics, sociographic, professional roles, values, goals,
challenges, influences and buying habits
– Segmentation
• Dividing the audience into smaller groups based on their interactions with your brand
or their demographics, sociographic, professional roles, values, goals, challenges,
influences and buying habits.
– Geo-targeting
• The act of targeting a customer based on their location. This could be as broad as
targeting by country or as narrow as targeting by zip-code or mile-radius around a
location or business.
– Retargeting
• The act of sending messages to audiences who have already engaged with your
brand. This could include remarketing to people who have visited your website,
visited your store, made a purchase, followed your brand on social, signed up for an
online based offer, or had any other interaction with your brand.
Benefits
• Deliver personalized marketing messages that are relevant to
what customers actually what and need, and stop sending
customers marketing messages about products and services they
have no interest in.
• Get a better value for your ad spend because you can reach the
right audiences and not spend money on ad campaigns that are
reaching the wrong audiences.
• Get the qualified leads from people who are likely to become
customers, and avoid filling your funnel or pipeline with
uninterested audiences.
• Generate more sales by marketing the right offerings to the right
customers on the right platforms.
• When executed properly, hypertargeting can make it easier for you
to connect with customers, generate leads, and increase your
sales.
Scenario
• Consider a situation where shop keeper
wants to know which products are needs
more to buy for his next seasons.
• Who is his regular customers
• Which items are seasonal based.
• Ordering the stock based on predictions.
Data usage
Forecasting
• ancestors observed the sky to forecast the
weather
• Data Scientists develop and train machine
learning models to predict sales, risks,
events, trends, etc
• decisions need to be made about what should
be forecast,
• when something can be forecast accurately,
– Real Estate Market
Forecasting types
• Based on time horizon
– Short term forecasts : a normal range between one and
three months
– Medium term forecasts : the time period is normally one
year
– Long term forecasts:predict results over periods greater
than two years
• Based on data availability
– Qualitative Forecasts :If there are no data available, or if
the data available are not relevant to the forecasts
– Quantitative Forecasts (Time Series Forecasting): If
numerical information about the past is available; and the
past patterns will continue into the future
Forecasting Methods
• Judgmental Forecasting is the only option for
Qualitative Forecasts due to the lack of historical
data ; Examples of a new policy, a new product,
or a new competitor
• Time Series Forecasting methods only use
historical values on the variable to be forecast
and exclude the factors that could affect its
behavior such as competitor activities, changes in
environment or economic conditions, and so on
Predictive Modelling
• the process of using known results to create, process, and
validate a model that can be used to forecast future
outcomes or predictions.
• a tool used in predictive analytics, a data mining technique
that attempts to answer the question "what might possibly
happen in the future?“.
• Mostly used predictive modeling techniques are
:regression and neural networks
– For example, predictive modeling could help identify
customers who are likely to purchase new product over the
next 90 days.
• Companies can use predictive modeling to forecast events,
customer behavior, as well as financial, economic, and
market risks.
• Predictive modeling is a process that uses
data and statistics to predict outcomes with
data models.
– These models can be used to predict anything
from sports outcomes and TV ratings to
technological advances and corporate earnings.
• Predictive modeling is also often referred to
as:
– Predictive analytics
– Predictive analysis
– Machine learning
Why predictive modelling
• analysing historical events.
• increase the probability of forecasting events,
customer behavior, as well as financial,
economic, and market risks
How the data is retrieved?
• rapid migration to digital products has created a
sea of data that is readily available to
businesses.
• Big data is utilized by companies to improve the
dynamics of the customer-to-business
relationship.
• This vast amount of real-time data is retrieved
from sources such as social media, internet
browsing history, cell phone data, and cloud
computing platforms.
Examples
• The bank wants to identify which of its customers are likely
to engage in money laundering activities.
• the bank’s customer data, a predictive model is
built around the dollar amount of money transfers that
customers made during a period of time.
• The model is taught to recognize the difference between a
money laundering transaction and a normal transaction.
The optimal outcome from the model should be a pattern
that signals which customer laundered money and which
didn’t. If the model perceives that a pattern of fraud is
emerging for a particular customer, it will create a signal
for action, which will be attended to by the bank’s fraud
prevention unit.
Process of Predictive Analysis

• Predictive analysis consists of 7 processes as follows:

• Define project: Defining the project, scope, objectives and
result.
• Data collection: Data is collected through data mining
providing a complete view of customer interactions.
• Data Analysis: It is the process of cleaning, inspecting,
transforming and modelling the data.
• Statistics: This process enables validating the assumptions and
testing the statistical models.
• Modelling: Predictive models are generated using statistics
and the most optimized model is used for the deployment.
• Deployment: The predictive model is deployed to automate
the production of everyday decision-making results.
• Model monitoring: Keep monitoring the model to review
performance which ensures expected results. https://www.geeksforgeeks.or
g
Predictive Modelling Tools
• Tools are ability to handle linear and non-linear data
relationships.
– Neural networks
– decision trees,
– time series data mining, and
– Bayesian analysis
– big data
– Ordinary Least Squares
– Generalized Linear Models (GLM)
– Logistic Regression
– Random Forests
– Decision Trees
– Neural Networks
– Multivariate Adaptive Regression Splines (MARS)
What are the Biggest Challenges of Predictive
Modeling?

• predictive modeling presents a number

of challenges in practice.
• These challenges include:
– Sufficiently large and comprehensive datasets
– Adaptability of models to new problems
– Data organization and hygiene
– Data privacy and security
Need of Predictive Analysis
• Understanding customer behavior: Predictive analysis uses
data mining feature which extracts attributes and behavior of
customers. It also finds out the interests of the customers so
that business can learn to represent those products which can
increase the probability or likelihood of buying.
• Gain competition in the market: With predictive analysis,
businesses or companies can make their way to grow fast and
stand out as a competition to other businesses by finding out
their weakness and strengths.
• Learn new opportunities to increase revenue: Companies
can create new offers or discounts based on the pattern of the
customers providing an increase in revenue.
• Find areas of weakening: Using these methods, companies
can gain back their lost customers by finding out the past
actions taken by the company which customers didn’t like.

https://www.geeksforgeeks.or
g
Applications of Predictive
Analysis
• Health care: Predictive analysis can be used to determine the
history of patient and thus, determining the risks.
• Financial modelling: Financial modelling is another aspect
where predictive analysis plays a major role in finding out the
trending stocks helping the business in decision making
process.
• Customer Relationship Management: Predictive analysis
helps firms in creating marketing campaigns and customer
services based on the analysis produced by the predictive
algorithms.
• Risk Analysis: While forecasting the campaigns, predictive
analysis can show an estimation of profit and helps in
evaluating the risks too.

https://www.geeksforgeeks.or
g
References
• https://www.godaddy.com/garage/what-is-
hypertargeting/
• https://textsanity.com/text-message-marketing/why-
tags-are-the-best-way-to-organize-your-data/
• https://www.investopedia.com/terms/p/predictive-
modeling.asp
• https://medium.com/@osiolabs/what-is-an-etl-
extract-transform-load-pipeline-in-node-js-
9a1a17de30f1
• https://towardsdatascience.com/forecasting-
fundamentals-you-should-know-before-building-
predictive-models-299a18c2093b

IoT - New 6
No ratings yet
IoT - New 6
186 pages
IBM - Introduccion Analisis de Datos
No ratings yet
IBM - Introduccion Analisis de Datos
148 pages
Data Transformation With Advanced Data Stack
No ratings yet
Data Transformation With Advanced Data Stack
35 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
Data Analytics For IoT Solutions (Module VI)
No ratings yet
Data Analytics For IoT Solutions (Module VI)
81 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Module 2-3 Fuba Midterms
100% (1)
Module 2-3 Fuba Midterms
5 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
Data Preparation
No ratings yet
Data Preparation
19 pages
Data Science Material
No ratings yet
Data Science Material
48 pages
Kobelev Vladimir Durability of Springs
100% (1)
Kobelev Vladimir Durability of Springs
291 pages
DM Lecture 5
No ratings yet
DM Lecture 5
31 pages
Da Unit-I
No ratings yet
Da Unit-I
19 pages
Data Engineering Part 1 1735286787
No ratings yet
Data Engineering Part 1 1735286787
22 pages
CCD 4,5,6
No ratings yet
CCD 4,5,6
21 pages
Methods and Techniques of Data Processing
No ratings yet
Methods and Techniques of Data Processing
22 pages
Data Engineering - Session 01
No ratings yet
Data Engineering - Session 01
34 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Unit 4
No ratings yet
Unit 4
11 pages
DA Unit 2
No ratings yet
DA Unit 2
13 pages
Week 5 Chapter 6
No ratings yet
Week 5 Chapter 6
29 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
1.1.3 Data Management Components
No ratings yet
1.1.3 Data Management Components
11 pages
Data Analytics and Modelling: Amynah Reimoo
No ratings yet
Data Analytics and Modelling: Amynah Reimoo
9 pages
Lec7 Internet of Things Data Management
No ratings yet
Lec7 Internet of Things Data Management
17 pages
UNIT 1 Merged
No ratings yet
UNIT 1 Merged
11 pages
Data Management - Jesús Tapia
No ratings yet
Data Management - Jesús Tapia
35 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
TIS Chapter 3
No ratings yet
TIS Chapter 3
36 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
6 pages
My Mind Reader's
No ratings yet
My Mind Reader's
19 pages
Data Engineering - Beginner's Guide
100% (1)
Data Engineering - Beginner's Guide
9 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Coursera - IBM - Introduction To Data Analytics
No ratings yet
Coursera - IBM - Introduction To Data Analytics
13 pages
IDA Essay Question - Answer
No ratings yet
IDA Essay Question - Answer
6 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
4.data Engineering
No ratings yet
4.data Engineering
9 pages
Data Analytics-Wps Office
No ratings yet
Data Analytics-Wps Office
21 pages
4-Data Processing Pipelines in Science and Business
100% (1)
4-Data Processing Pipelines in Science and Business
22 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
What Is Data Engineering?: Think
No ratings yet
What Is Data Engineering?: Think
13 pages
Data What Is Data Engineering:: Managing and Organizing Data Analyzing and Interpreting Data
No ratings yet
Data What Is Data Engineering:: Managing and Organizing Data Analyzing and Interpreting Data
2 pages
ACC IT APP MIdterm Bigdata
No ratings yet
ACC IT APP MIdterm Bigdata
12 pages
Course1 Summary
No ratings yet
Course1 Summary
4 pages
Download
No ratings yet
Download
4 pages
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Data Processing - ST
No ratings yet
Data Processing - ST
12 pages
Grade Beam - Grade Foundation Analysis & Design
100% (2)
Grade Beam - Grade Foundation Analysis & Design
2 pages
Data Analytics
No ratings yet
Data Analytics
5 pages
DS Skills
No ratings yet
DS Skills
4 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Telugu Learn
50% (2)
Telugu Learn
2 pages
Data Analytics and AI
100% (11)
Data Analytics and AI
267 pages
Data Engineering 101
No ratings yet
Data Engineering 101
1 page
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Liturgy of St. John (Eliz. English) - Staff Notation
100% (2)
Liturgy of St. John (Eliz. English) - Staff Notation
99 pages
Java Programming
No ratings yet
Java Programming
246 pages
Lets Learn AI Base Module PDF
86% (14)
Lets Learn AI Base Module PDF
196 pages
Acca Afm Taha Popatia Whatsapp +923453086312
No ratings yet
Acca Afm Taha Popatia Whatsapp +923453086312
67 pages
CIS Times 2023 2024
No ratings yet
CIS Times 2023 2024
210 pages
Generative Ai Terminology
67% (3)
Generative Ai Terminology
26 pages
Greeting Card
No ratings yet
Greeting Card
13 pages
Internal Reconstruction
No ratings yet
Internal Reconstruction
21 pages
Soil Information System
No ratings yet
Soil Information System
2 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
93% (14)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
100 Generative AI Use Cases Examples For Industries
100% (6)
100 Generative AI Use Cases Examples For Industries
63 pages
Top 100 Applications of Generative AI 1683282083
100% (15)
Top 100 Applications of Generative AI 1683282083
119 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Ict SSS One, Two and Three
No ratings yet
Ict SSS One, Two and Three
8 pages
Board 4-CHN
100% (23)
Board 4-CHN
30 pages
Data Analytics Concepts Techniques and A PDF
100% (12)
Data Analytics Concepts Techniques and A PDF
451 pages
Beginners Python Cheat Sheet
86% (7)
Beginners Python Cheat Sheet
28 pages
2019 Book DataScienceAndBigDataAnalytics
100% (15)
2019 Book DataScienceAndBigDataAnalytics
418 pages
RAG Architecture
100% (8)
RAG Architecture
52 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
66 pages
Module III - Simulation Scenarios - C1
No ratings yet
Module III - Simulation Scenarios - C1
179 pages
Git GitHub
No ratings yet
Git GitHub
40 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Deployment of Analytics Solutions - Module VII - Students
No ratings yet
Deployment of Analytics Solutions - Module VII - Students
120 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Module 6
No ratings yet
Module 6
56 pages
Generative Ai Fundamentals v1
100% (16)
Generative Ai Fundamentals v1
80 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Apress Understanding Large Language Models B0CJ2C8TXQ
100% (11)
Apress Understanding Large Language Models B0CJ2C8TXQ
166 pages
Generative AI With Large Language Models
100% (3)
Generative AI With Large Language Models
31 pages
Value Engineering and Analysis (Module V)
No ratings yet
Value Engineering and Analysis (Module V)
65 pages
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
100% (14)
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
210 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
94% (16)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Garcia 2019
No ratings yet
Garcia 2019
7 pages
LAS Biotech 8 MELC 1 Week 1
No ratings yet
LAS Biotech 8 MELC 1 Week 1
8 pages
Relevant Costing
No ratings yet
Relevant Costing
8 pages
Blank Column
No ratings yet
Blank Column
12 pages
Onward Journey Ticket Details: Terms and Conditions
No ratings yet
Onward Journey Ticket Details: Terms and Conditions
3 pages
Murud Janjira The Unsung Legacy of Siddi
No ratings yet
Murud Janjira The Unsung Legacy of Siddi
10 pages
Protection Coordination For Networked Microgrids Using Single and Dual Setting
No ratings yet
Protection Coordination For Networked Microgrids Using Single and Dual Setting
11 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Search Document
No ratings yet
Search Document
13 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
Problems CHAPTER 17
100% (2)
Problems CHAPTER 17
4 pages
Practical Projects
100% (30)
Practical Projects
478 pages
The Best ChatGPT
100% (39)
The Best ChatGPT
8 pages
Principles of Marketing
No ratings yet
Principles of Marketing
47 pages
Vaishnavweekly Diary
No ratings yet
Vaishnavweekly Diary
14 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
Step by Step Instructions On How To ThinApp Microsoft Office 2007
No ratings yet
Step by Step Instructions On How To ThinApp Microsoft Office 2007
13 pages
Module-2 Analyzing Marketing Environment
No ratings yet
Module-2 Analyzing Marketing Environment
22 pages
Eo Organizing The BDC
No ratings yet
Eo Organizing The BDC
3 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Learn Python in A Day
100% (14)
Learn Python in A Day
141 pages
GD
No ratings yet
GD
18 pages
Wu Et Al 2015 PDF
No ratings yet
Wu Et Al 2015 PDF
14 pages
Understanding Machine Learning
100% (69)
Understanding Machine Learning
416 pages
SET 2 BROADCST TV and RADIO Signals SET 2
No ratings yet
SET 2 BROADCST TV and RADIO Signals SET 2
3 pages
Present Perfect Tense
No ratings yet
Present Perfect Tense
2 pages
Model Question Paper
No ratings yet
Model Question Paper
2 pages
Hunter Run Time Calculator - Door Card - X-Core Main
No ratings yet
Hunter Run Time Calculator - Door Card - X-Core Main
1 page
Touchpad Modular Ver. 1.1 Class 8: Windows 7 & MS Office 2010
From Everand
Touchpad Modular Ver. 1.1 Class 8: Windows 7 & MS Office 2010
Team Orange
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Data Models (Module - II)

Uploaded by

Data Models (Module - II)

Uploaded by

IoT data management system

Data Handling Creating & integration Advanced statistical

Cleaning & Modelling In-depth ML Algorithms Predictive algorithm

Business & Reporting Data pipeline & Data driven problem

Used by AWS,Data bricks,Azure &

– Drop: If the missing values in a column rarely

• Predictive analysis consists of 7 processes as follows:

• predictive modeling presents a number

You might also like