0% found this document useful (0 votes)
90 views25 pages

Da Unit Ii

Uploaded by

odugukumari81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views25 pages

Da Unit Ii

Uploaded by

odugukumari81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT-II-Data Analytics:

Introduction to Analytics, Introduction to Tools and Environment, Application of Modeling


in Business, Databases & Types of Data and variables, Data Modeling Techniques, Missing
Imputations etc. Need for Business Modeling.
Introduction to Analytics:
Analytics involves the systematic computational analysis of data or statistics. It's used
to discover, interpret, and communicate meaningful patterns in data.
Types of Analytics:
Analytics is broadly categorized into four main types: Descriptive, Diagnostic,
Predictive, and Prescriptive. Each type serves a different purpose and provides insights at
various levels of analysis.

Descriptive Analytics
Descriptive analytics answers the question, "What happened?" by summarizing historical
data to understand trends, patterns, and relationships. It involves data aggregation,
summarization, and visualization.
Example:
o Sales Reporting: A retail company uses descriptive analytics to generate daily
sales reports. These reports summarize the total sales, number of transactions,
and average transaction value across different stores. By examining these
1
reports, the company can identify peak sales periods, popular products, and
regional differences in sales performance.
 Tools Commonly Used:
o Dashboards (e.g., Tableau, Power BI)
o Reports (e.g., Excel, Google Data Studio)
2. Diagnostic Analytics
Diagnostic analytics answers the question, "Why did it
happen?" by drilling down into data to identify causes and correlations. It often involves
more detailed exploration and the use of statistical techniques.

Example:
Customer Churn Analysis: A telecommunications company uses diagnostic analytics to
understand why certain customers are canceling their subscriptions. By analyzing data such
as customer service interactions, billing history, and usage patterns, the company identifies
factors that are highly correlated with churn, such as frequent billing issues or poor service
coverage in certain areas.
 Tools Commonly Used:
o Statistical software (e.g., SPSS, R)
o Data mining tools (e.g., SAS, RapidMiner)
3. Predictive Analytics
Predictive analytics answers the question, "What is likely to happen?" by using historical
data to make forecasts or predictions about future outcomes. It involves techniques like
machine learning, statistical modeling, and data mining.
Example:
Demand Forecasting: An e-commerce company uses predictive analytics to forecast
demand for products during the holiday season. By analyzing historical sales data,
promotional activities, and external factors like economic conditions, the company predicts
which products will be in high demand. This allows the company to optimize inventory levels
and reduce the risk of stock outs or overstocking.
2
 Tools Commonly Used:
o Machine learning platforms (e.g., Python with scikit-learn, TensorFlow)
o Forecasting tools (e.g., ARIMA models in R, Prophet by Facebook)
4. Prescriptive Analytics
Prescriptive analytics answers the question, "What should we do?" by providing
recommendations for actions that can optimize outcomes. It involves the use of optimization
techniques, simulation models, and machine learning algorithms.
Example:
Supply Chain Optimization: A manufacturing company uses prescriptive analytics to
optimize its supply chain operations. By integrating predictive models with optimization
algorithms, the company determines the best sourcing strategy, production schedule, and
logistics routes to minimize costs while meeting customer demand. The system might
recommend adjusting order quantities or rerouting shipments in real time based on current
inventory levels and demand forecasts.
 Tools Commonly Used:
o Optimization software (e.g., IBM CPLEX, Gurobi)
o Simulation tools (e.g., AnyLogic, Arena)
Introduction to Tools and Environment :
Data analytics involves extracting valuable insights from data, which requires a robust
set of tools and a well-configured environment. The choice of tools and environment depends
on the type of data, the complexity of the analysis, and the specific goals of the project.

3
1.Programming Languages
Programming languages form the backbone of data analytics, allowing analysts and data
scientists to manipulate data, perform calculations, and build models.

 Python:
Python is the most widely used language for data analytics due to its simplicity and extensive
libraries.

Libraries:
 Pandas: For data manipulation and analysis.
 NumPy: For numerical computations and handling large datasets.
 Matplotlib and Seaborn: For data visualization.
 Scikit-learn: For machine learning and predictive modeling.
Use Case: Data cleaning, exploratory data analysis, machine learning model development.
 R:
R is a powerful language for statistical analysis and visualization.
Libraries:
 ggplot2: For creating advanced visualizations.
 dplyr: For data manipulation.
 caret: For machine learning.
 Shiny: For building interactive web applications.
Use Case: Statistical analysis, hypothesis testing, creating detailed plots.
 SQL:
o Description: Structured Query Language (SQL) is essential for querying and
managing data stored in relational databases.
o Use Case: Extracting, transforming, and loading (ETL) data from databases,
performing complex queries.
4
2. Data Visualization Tools
Visualization tools are used to create charts, graphs, and dashboards that make data more
understandable and actionable.
 Tableau:
A leading data visualization tool that enables the creation of interactive and shareable
dashboards.
o Features: Drag-and-drop interface, integration with various data sources, real-
time data analysis.
o Use Case: Creating dashboards for business intelligence, interactive data
exploration.
 Power BI:
Microsoft’s powerful business analytics tool that allows users to visualize and share insights.
o Features: Integration with Microsoft products, real-time analytics, natural
language queries.
o Use Case: Business reporting, interactive data dashboards, sharing insights
across organizations.
 Matplotlib/Seaborn (Python) & ggplot2 (R):
Libraries for creating static, animated, and interactive visualizations in Python and R.
o Use Case: Custom visualizations in data science projects, exploratory data
analysis.

3. Data Storage and Management Tools


These tools are crucial for storing, managing, and retrieving large volumes of data efficiently.
 Relational Databases (e.g., MySQL, PostgreSQL, Oracle):
Traditional databases that store data in tables and are accessed using SQL.
o Use Case: Structured data storage, transactional data, financial records.
 NoSQL Databases (e.g., MongoDB, Cassandra, Redis):
Databases designed to handle unstructured and semi-structured data with flexible schemas.
o Use Case: Big data, real-time web applications, storing documents, key-value
pairs, graphs.
 Data Warehouses (e.g., Amazon Redshift, Google BigQuery, Snowflake):
Central repositories of integrated data from various sources, optimized for query and analysis.
o Use Case: Enterprise-wide analytics, reporting, business intelligence.

4. Big Data Processing Tools


Big data tools are designed to handle vast amounts of data, often in distributed computing
environments.
 Apache Hadoop:
An open-source framework that allows for the distributed processing of large data sets across
clusters of computers.
5
oComponents:
 HDFS (Hadoop Distributed File System): For storing large datasets.
 MapReduce: For processing data in parallel across a distributed cluster.
o Use Case: Batch processing of large data sets, data storage for large-scale
applications.
 Apache Spark:
A fast and general-purpose cluster-computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance.
o Features: In-memory processing, real-time data streaming, machine learning
library (MLlib).
o Use Case: Real-time data processing, machine learning, iterative algorithms.

5. Machine Learning and Statistical Tools


These tools are used to build predictive models and perform advanced statistical analysis.
 Scikit-learn (Python):
o Description: A library for machine learning that integrates with Python’s
ecosystem of scientific libraries.
o Features: Support for classification, regression, clustering, dimensionality
reduction.
o Use Case: Building and evaluating machine learning models.
 TensorFlow & PyTorch:
o Description: Open-source machine learning frameworks used for developing
deep learning models.
o Features: Support for neural networks, training large-scale models, flexibility in
model design.
o Use Case: Image recognition, natural language processing, deep learning
applications.
 SAS:
o Description: A software suite developed by SAS Institute for advanced
analytics, business intelligence, data management, and predictive analytics.
o Features: Extensive statistical analysis tools, strong integration with business
processes.
o Use Case: Statistical analysis in large enterprises, predictive modeling.

6. Cloud Platforms
Cloud platforms provide scalable and flexible environments for data storage, processing, and
analytics.
 Amazon Web Services (AWS):
o Tools: Amazon S3 (storage), Redshift (data warehousing), SageMaker (machine
learning).

6
o Use Case: Scalable cloud-based analytics, data storage, machine learning model
deployment.
 Google Cloud Platform (GCP):
o Tools: BigQuery (data warehousing), AutoML (machine learning), Dataflow
(data processing).
o Use Case: Real-time data analytics, big data processing, AI-driven insights.
 Microsoft Azure:
o Tools: Azure Data Lake (storage), Azure Machine Learning, Azure Synapse
Analytics.
o Use Case: End-to-end data analytics solutions, integrating with Microsoft
ecosystem.

7. Integrated Development Environments (IDEs) and Notebooks


IDEs and notebooks are essential for writing, testing, and running code, particularly in data
science and analytics.
 Jupyter Notebook:
o Description: An open-source web application that allows you to create and
share documents containing live code, equations, visualizations, and narrative
text.
o Use Case: Data exploration, interactive data analysis, sharing insights with
others.
 RStudio:
o Description: An integrated development environment for R that provides tools
for plotting, history, debugging, and workspace management.
o Use Case: Statistical computing, data visualization, reproducible research.
 VS Code:
o Description: A lightweight but powerful source code editor with support for
various programming languages, including Python.
o Use Case: Writing and debugging code, integrating with version control systems
(e.g., Git), developing data pipelines.

8. Data Integration and ETL Tools


ETL (Extract, Transform, Load) tools are used to integrate data from multiple sources,
transform it into a suitable format, and load it into a target database or data warehouse.

 Talend:
o Description: An open-source data integration platform that allows for data
migration, profiling, and quality management.
o Use Case: Data integration, data quality management, ETL processes.
 Apache Nifi:

7
o Description: A tool for automating the flow of data between software systems,
with real-time data processing and integration.
o Use Case: Data ingestion from various sources, real-time data flow
management.
 Informatica PowerCenter:
o Description: An enterprise data integration platform used to extract, transform,
and load data.
o Use Case: Large-scale data integration, ETL processes, data warehousing.

9. Version Control Systems


Version control is essential for managing changes to code, data models, and analytics
projects, especially in collaborative environments.
 Git:
o Description: A distributed version control system used to track changes in
source code during software development.
o Platforms: GitHub, GitLab, Bitbucket.
o Use Case: Collaborating on data science projects, tracking changes to code and
documentation, managing project versions.

10. Collaborative Environments


Collaboration tools are essential for teams working on data analytics projects, allowing them
to share insights, code, and data in a seamless manner.
 Google Colab:
o Description: A cloud-based Jupyter notebook environment that supports Python
and allows for collaborative work.
o Use Case: Collaborative data analysis, machine learning experiments, sharing
notebooks.
 Microsoft Teams:
o Description: A collaboration platform that integrates with Microsoft Office and
other tools.
o Use Case: Sharing reports, conducting meetings, collaborative work on data
analytics projects.
 Slack:
o Description: A messaging platform that allows teams to communicate and
integrate with other tools like GitHub, Jenkins, and JIRA.
o Use Case: Team communication, sharing updates on analytics projects,
integrating notifications from various tools.

8
Application of Modeling in Business:
Modeling in business involves creating mathematical representations of real-world
processes to predict outcomes, optimize operations, and inform decision-making. Various
types of models are used in data analytics to solve business problems, from forecasting
demand to optimizing supply chains.
1.Sales Forecasting
Sales forecasting models predict future sales based on historical data, market trends, and
other relevant factors. Accurate sales forecasts enable businesses to make informed decisions
about inventory management, staffing, and budgeting.
 Model Types:
o Time Series Models: ARIMA, Exponential Smoothing.
o Machine Learning Models: Random Forest, Gradient Boosting Machines.
Example:
Retail Industry: A clothing retailer uses a time series forecasting model to predict
sales during the holiday season. By analyzing past sales data, promotional activities, and
external factors like weather, the model forecasts demand for each product category. This
helps the retailer optimize inventory levels, reducing the risk of stockouts or overstocking.

2. Customer Segmentation
Customer segmentation models group customers based on shared characteristics, such as
demographics, behavior, or purchasing patterns. This allows businesses to tailor marketing
strategies and improve customer engagement.
 Model Types:
o Clustering Models: K-Means, Hierarchical Clustering.
o Classification Models: Logistic Regression, Decision Trees.
Example:
Telecommunications: A telecom company uses clustering models to segment its
customer base into distinct groups based on usage patterns, demographics, and service
preferences. The company then creates targeted marketing campaigns for each segment,
offering customized plans and promotions to improve customer retention and satisfaction.

3. Risk Management
Risk management models assess the likelihood of adverse events, such as loan defaults or
market downturns, and estimate their potential impact on the business. These models help
businesses mitigate risks and make more informed decisions.

 Model Types:
o Credit Scoring Models: Logistic Regression, Support Vector Machines.

9
o Market Risk Models: Value at Risk (VaR), Monte Carlo Simulation.
Example:
Banking: A bank uses a credit scoring model to evaluate the risk of lending to
individual customers. By analyzing factors such as credit history, income level, and
employment status, the model predicts the probability of default. The bank uses this
information to approve or reject loan applications and to set appropriate interest rates.

4. Supply Chain Optimization


Supply chain optimization models focus on minimizing costs and maximizing efficiency
across the supply chain. These models consider factors like production schedules, inventory
levels, and logistics to optimize the flow of goods and services.
 Model Types:
o Linear Programming: For optimizing production schedules and resource
allocation.
o Simulation Models: For testing different supply chain scenarios.

Example:
Manufacturing: A manufacturing company uses a linear programming model to
optimize its production schedule and minimize costs. The model considers factors like
production capacity, raw material availability, and delivery deadlines. As a result, the
company reduces production costs, minimizes inventory holding costs, and ensures timely
delivery of products to customers.

5. Marketing Mix Modeling


Marketing mix models analyze the effectiveness of various marketing activities (e.g.,
advertising, promotions, pricing) on sales and other key performance indicators. These
models help businesses allocate marketing budgets more effectively.
 Model Types:
o Regression Models: Multiple Linear Regression, Ridge Regression.
o Time Series Models: ARIMA, Vector Auto regression (VAR).

Example:
Consumer Goods: A consumer goods company uses a marketing mix model to assess
the impact of different marketing activities on sales. By analyzing historical data on
advertising spend, pricing, and promotions, the model identifies which activities drive the
most sales. The company uses this insight to allocate its marketing budget more effectively,
focusing on the channels that provide the highest return on investment.

10
6. Churn Prediction
Churn prediction models identify customers who are likely to leave a service or stop
purchasing a product. These models enable businesses to take proactive measures to retain
customers, such as targeted offers or personalized communication.
 Model Types:
o Classification Models: Logistic Regression, Random Forest.
o Survival Analysis: Cox Proportional Hazards Model.
Example:
Subscription Services: A subscription-based video streaming service uses a churn prediction
model to identify customers at risk of canceling their subscriptions. By analyzing factors like
viewing patterns, customer support interactions, and payment history, the model predicts
churn risk. The service then targets at-risk customers with personalized offers or incentives to
encourage them to stay.

7. Fraud Detection
Fraud detection models identify suspicious activities that may indicate fraudulent
behavior. These models are particularly valuable in industries like banking, insurance, and e-
commerce, where fraud can have significant financial impacts.
 Model Types:
o Anomaly Detection Models: Isolation Forest, One-Class SVM.
o Supervised Learning Models: Logistic Regression, Neural Networks.
Example:
Banking: A bank uses a fraud detection model to monitor transactions in real-time.
The model flags unusual activities, such as large transactions from an unfamiliar location,
which could indicate fraudulent activity. The bank can then block the transaction and contact
the customer for verification, reducing the risk of fraud.

8. Product Recommendation
Product recommendation models suggest products or services to customers based on their
preferences, past behavior, and similar customer profiles. These models enhance the customer
experience and drive sales.
 Model Types:
o Collaborative Filtering: User-based or Item-based.
o Content-Based Filtering: Based on product attributes.
Example:
E-commerce: An online retailer uses a product recommendation model to suggest
products to customers based on their browsing and purchase history. The model analyzes
patterns across customers with similar preferences and recommends products that the
customer is likely to be interested in. This personalization increases the likelihood of
purchase and enhances the customer’s shopping experience.

11
9. Price Optimization
Price optimization models help businesses determine the best pricing strategy to maximize
profits while remaining competitive. These models consider factors like demand elasticity,
competitor pricing, and cost structures.
 Model Types:
o Optimization Models: Linear Programming, Conjoint Analysis.
o Machine Learning Models: Random Forest, Gradient Boosting.
Example:
Hospitality: A hotel chain uses a price optimization model to adjust room rates
dynamically based on demand, seasonality, and competitor pricing. The model helps the hotel
maximize revenue by increasing prices during high-demand periods and offering discounts
during low-demand periods.

10. Inventory Management


Inventory management models help businesses maintain optimal inventory levels,
minimizing holding costs while ensuring that products are available to meet customer
demand.
 Model Types:
o Inventory Control Models: Economic Order Quantity (EOQ), Just-in-Time
(JIT).
o Predictive Models: Time Series Forecasting, Regression Analysis.
Example:
Retail: A retail company uses an inventory management model to determine the
optimal reorder points and quantities for each product. By analyzing sales data, lead times,
and holding costs, the model ensures that the company maintains adequate inventory levels
without overstocking, thereby reducing inventory costs and improving cash flow.

Databases & Types of Data and Variables :

In data analytics, databases, types of data, and variables play a foundational role.
Understanding these concepts is essential for effectively managing and analyzing data.

Databases in Data Analytics

A database is an organized collection of data, typically structured to be easily


accessible, manageable, and updated. In data analytics, databases are crucial for storing,
retrieving, and manipulating data. The choice of a database depends on the nature of the data,
the scale of the analysis, and the specific requirements of the business.

12
Types of Databases

1. Relational Databases (RDBMS):

Store data in tables (relations) with predefined schemas. Data is organized in rows and
columns, with each table representing an entity (e.g., customers, orders).

Examples: MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server.

Use Case: Storing structured data such as customer records, transaction logs, and product
catalogs.

Query Language: SQL (Structured Query Language).

2. NoSQL Databases:

Designed to handle unstructured or semi-structured data with flexible schemas. These


databases are often used for big data applications.

o Types:
13
Document Stores: Store data as documents (e.g., JSON, BSON).

 Examples: MongoDB, CouchDB.
 Use Case: Content management systems, user profiles, and log
data.
 Key-Value Stores: Store data as key-value pairs.
 Examples: Redis, DynamoDB.
 Use Case: Caching, session management, real-time data
processing.
 Column Stores: Store data in columns rather than rows.
 Examples: Cassandra, HBase.
 Use Case: Analytical queries, time-series data, large-scale data
processing.
 Graph Databases: Store data in graph structures, representing
relationships between entities.
 Examples: Neo4j, Amazon Neptune.
 Use Case: Social networks, fraud detection, recommendation
engines.
3. Data Warehouses:

Centralized repositories that store integrated data from multiple sources, optimized for
query and analysis.

o Examples: Amazon Redshift, Google BigQuery, Snowflake.


o Use Case: Enterprise-wide analytics, business intelligence, large-scale data
reporting.
4. Data Lakes:

Store raw, unprocessed data in its native format (structured, semi-structured, or


unstructured) until it’s needed for analysis.

o Examples: Hadoop Distributed File System (HDFS), Amazon S3 (with AWS


Lake Formation).
o Use Case: Big data analytics, machine learning, storing diverse data types for
future use.

2. Types of Data in Data Analytics

Data can be classified into different types based on its nature and structure. This
classification is essential for selecting appropriate analytical methods and tools.

14
Types of Data

1. Structured Data:

Data that is organized in a tabular format with rows and columns, making it easy to
search, filter, and analyze.

o Examples:
 Customer Database: Contains fields like customer ID, name, email, and
purchase history.
 Transaction Logs: Contains transaction ID, date, amount, and product
details.
o Storage: Relational databases (RDBMS).
o Use Case: Traditional business applications like CRM systems, ERP systems,
and financial records.

2. Unstructured Data:

Data that doesn’t have a predefined structure, making it more challenging to analyze
directly.

o Examples:
 Text Documents: Emails, reports, social media posts.
 Multimedia: Images, videos, audio files.
 Logs: Server logs, application logs.
o Storage: NoSQL databases, data lakes.
o Use Case: Sentiment analysis, natural language processing, image and video
analysis.
15
3. Semi-Structured Data:

Data that doesn’t conform to a rigid structure but still contains tags or markers to separate
elements.

oExamples:
 XML/JSON Files: Used for data exchange between systems.
 HTML: Web pages with embedded data.
o Storage: NoSQL databases (document stores), data lakes.
o Use Case: Data integration, API responses, web scraping.
4. Time-Series Data:

Data points collected or recorded at specific time intervals, often used for tracking
changes over time.

o Examples:
 Stock Prices: Daily closing prices of stocks.
 Sensor Data: Temperature readings from IoT devices.
 Sales Data: Daily or monthly sales figures.
o Storage: Specialized time-series databases (e.g., InfluxDB), columnar
databases, data lakes.
o Use Case: Trend analysis, forecasting, monitoring and anomaly detection.
5. Spatial Data:

Data that represents the physical location and shape of objects in space.

o Examples:
 GIS Data: Geographic coordinates, maps.
 GPS Data: Location tracking of vehicles or individuals.
o Storage: Spatial databases (e.g., PostGIS), NoSQL databases with spatial
capabilities.
o Use Case: Geographic information systems (GIS), location-based services,
mapping applications.

3. Types of Variables in Data Analytics

Variables are attributes or characteristics that represent different aspects of the data. They are
fundamental in building models and performing analysis.

16
Types of Variables

1. Numerical Variables:

Variables that represent quantitative values or measurements.

o Types:
 Continuous Variables: Can take any value within a range (e.g., height,
weight, temperature).
 Example: The price of a product, measured in dollars.
 Discrete Variables: Can take only specific, distinct values (e.g., count of
items).
 Example: The number of products sold in a day.
o Use Case: Statistical analysis, regression modeling, forecasting.
2. Categorical Variables:

Variables that represent qualitative attributes or categories.

o Types:
 Nominal Variables: Categories without a specific order (e.g., gender,
color).
 Example: Customer’s preferred payment method (e.g., credit card,
cash, PayPal).
 Ordinal Variables: Categories with a meaningful order (e.g., satisfaction
level, ranking).
 Example: Customer satisfaction rating (e.g., poor, fair, good,
excellent).
o Use Case: Classification models, market segmentation, decision trees.

17
3. Binary Variables:

A special type of categorical variable with only two categories.

o Examples:
 Yes/No: Whether a customer made a purchase (yes or no).
 True/False: Whether a transaction was fraudulent (true or false).
o Use Case: Logistic regression, binary classification tasks.
4. Ordinal Variables:

Variables that have a clear ordering but no fixed interval between the values.

o Examples:
 Education Level: High school, bachelor’s, master’s, doctorate.
 Customer Satisfaction: Very dissatisfied, dissatisfied, neutral, satisfied,
very satisfied.
o Use Case: Ordinal regression, ranking analysis.
5. Time Variables:

Variables that represent a specific point in time or duration.

o Examples:
 Date/Time: Date of purchase, timestamp of a transaction.
 Duration: Time taken to complete a task or delivery time.
o Use Case: Time series analysis, trend analysis, forecasting.
6. Derived Variables:

Variables created from existing variables through mathematical transformations,


aggregations, or combinations.

o Examples:
 Total Spend: Sum of all purchases made by a customer.
 Age: Derived from the customer’s date of birth.
o Use Case: Feature engineering, data transformation, improving model accuracy.

18
Data Modeling Techniques :

What is Data Modeling?


Data Modeling is the process of analyzing the data objects and their relationship to the
other objects. It is used to analyze the data requirements that are required for the business
processes. The data models are created for the data to be stored in a database. The Data
Model's main focus is on what data is needed and how we have to organize data rather than
what operations we have to perform.
Data Model is basically an architect's building plan. It is a process of documenting
complex software system design as in a diagram that can be easily understood. The diagram
will be created using text and symbols to represent how the data will flow. It is also known as
the blueprint for constructing new software or re-engineering any application.
There are three types of Data Models
Conceptual Model
Logical Model
Physical Data Model

1. Conceptual Model
The conceptual data model is a view of the data that is required to help business
processes. It also keeps track of business events and keeps related performance measures.
The conceptual model defines what the system contains. Conceptual Model focuses on
finding the data used in a business rather than processing flow. The main purpose of this data
model is to organize, define business rules and concepts. For example, it helps business
people to view any data like market data, customer data, and purchase data.
2. Logical Model
In the logical data model, the map of rules and data structures includes the data
required, such as tables, columns, etc. Data architects and Business Analysts create the
Logical Model. We can use the logical model to transform it into a database. Logical Model
is always present in the root package object. This data model helps to form the base for the
physical model. In this model, there is no secondary or primary key is defined.
3. Physical Data Model
In a physical data model, the implementation is described using a specific database
system. It defines all the components and services that are required to build a database. It is
created by using the database language and queries. The physical data model represents each
table, column, constraints like primary key, foreign key, NOT NULL, etc. The main work of
the physical data model is to create a database. This model is created by the Database
Administrator (DBA) and developers. This type of Data Modelling gives us the abstraction of
the databases and helps to create the schema. This model describes the particular
implementation of the data model. The physical data model helps us to have database column
keys, constraints, and RDBMS features.

19
What are the Data Modeling types?

Below given are 5 different types of data modeling used to organize the data:
Hierarchical Model
The hierarchical model is a tree-like structure. There is one root node, or we can say
one parent node and the other child nodes are sorted in a particular order. But, the
hierarchical model is very rarely used now. This model can be used for real-world model
relationships.
2. Object-oriented Model
The object-oriented approach is the creation of objects that contains stored values. The
object-oriented model communicates while supporting data abstraction, inheritance, and
encapsulation.
3. Network Model
The network model provides us with a flexible way of representing objects and
relationships between these entities. It has a feature known as a schema representing the data
in the form of a graph. An object is represented inside a node and the relation between them
as an edge, enabling them to maintain multiple parent and child records in a generalized
manner.
4. Entity-relationship Model
ER model (Entity-relationship model) is a high-level relational model which is used to
define data elements and relationship for the entities in a system. This conceptual design
provides a better view of the data that helps us easy to understand. In this model, the entire
database is represented in a diagram called an entity-relationship diagram, consisting of
Entities, Attributes, and Relationships.

20
5. Relational Model
Relational is used to describe the different relationships between the entities. And there
are different sets of relations between the entities such as one to one, one to many.

Missing Imputations :

Missing data imputation is a critical step in data preprocessing, addressing gaps in datasets
where some values are absent. Proper handling of missing data ensures the accuracy and
reliability of statistical analyses and machine learning models.

Types of Missing Data

1. Missing Completely at Random (MCAR):

The probability of missing data on a variable is independent of both observed and


unobserved data.

Example: Missing data on a survey question where the missingness is unrelated to the
respondent's characteristics.

2. Missing at Random (MAR):

The probability of missing data on a variable is related to observed data but not the
missing data itself.

Example: Missing income data based on the respondent's age, where the missingness is
related to age but not to the income itself.

3. Missing Not at Random (MNAR):

The probability of missing data is related to the value of the missing data itself.

Example: High-income individuals are less likely to report their income, making income
data missing due to its own value.

Imputation Techniques

1. Mean/Median/Mode Imputation:

Replaces missing values with the mean (for continuous data), median (for continuous
data), or mode (for categorical data) of the observed values.

21
Example: Replacing missing age values in a dataset with the average age of observed
individuals.

2. K-Nearest Neighbors (KNN) Imputation:

Imputes missing values using the values from the K-nearest neighbors based on a distance
metric.

Example: Imputing missing values in a dataset by averaging the values from the nearest K
neighbors with similar attributes.

3. Regression Imputation:

Uses a regression model to predict missing values based on other variables.

Example: Using a regression model to predict missing income based on age, education level,
and job type.

4. Multiple Imputation:

Creates multiple imputed datasets using a statistical model, performs analysis on each
dataset, and combines results to account for uncertainty in imputation.

Example: Imputing missing values in a survey dataset with multiple imputation to analyze
patterns across several imputed datasets.

5. Expectation-Maximization (EM) Algorithm:

Uses an iterative process to estimate missing values by maximizing the likelihood of the
observed data.

Example: Using EM to estimate missing values in a dataset with missing entries under a
multivariate normal distribution.

6. Interpolation:

Estimates missing values based on the values of neighboring data points in a sequence or
time series.

Example: Filling missing time points in a temperature dataset using linear or spline
interpolation.

22
7. Data Augmentation:

Generates new data points based on existing data to supplement missing values, often
used in conjunction with other methods.

Example: Generating synthetic data points for missing values in a dataset using techniques
like SMOTE (Synthetic Minority Over-sampling Technique).

Choosing the Right Imputation Method

Consider the type of data (continuous, categorical, time-series) when selecting an


imputation method.

 Missing Data Pattern: Assess whether the data is MCAR, MAR, or MNAR to choose
an appropriate technique.
 Computational Resources: Some methods, like multiple imputation and EM, may be
computationally intensive.
 Analysis Objectives: Align the imputation method with the goals of the analysis, such
as prediction accuracy or data completeness.

Example Scenarios

1. Healthcare Dataset:
o Situation: Missing values in patient records for certain health indicators.
o Imputation Method: Use mean imputation for continuous variables and mode
imputation for categorical variables, or employ KNN for more complex
relationships.
2. Financial Transactions:
o Situation: Missing values in transaction amounts due to incomplete records.
o Imputation Method: Use regression imputation based on other transaction
characteristics or applies multiple imputations to account for uncertainty.
3. Time Series Data:
o Situation: Missing temperature readings in a climate dataset.
o Imputation Method: Apply interpolation techniques, such as linear or spline
interpolation, to estimate missing temperature values.

23
Need for Business Modeling:

Business modeling is a crucial aspect of data analytics that involves creating abstract
representations of business processes, systems, and relationships. It helps organizations make
informed decisions, optimize operations, and achieve strategic goals.

1. Enhanced Decision-Making

Business modeling provides a structured approach to understanding complex business


processes and relationships, enabling data-driven decision-making.

Example: A retail company uses a sales forecasting model to predict future sales and
optimize inventory levels, ensuring they meet customer demand without overstocking.

2. Improved Operational Efficiency

Business models help identify inefficiencies and optimize business processes by


providing a clear picture of how different components interact.

Example: A manufacturing firm uses process simulation models to streamline production


workflows, reducing lead times and costs.

3. Strategic Planning and Forecasting

Business modeling supports long-term planning and forecasting by analyzing historical


data and predicting future trends.

Example: A financial institution uses predictive modeling to forecast market trends and plan
investment strategies.

4. Risk Management and Mitigation

Business models help in identifying and managing risks by simulating different scenarios
and assessing potential impacts.

Example: An insurance company uses risk models to assess the likelihood of claims and set
appropriate premiums.

5. Customer Insights and Personalization

Business modeling enables a deeper understanding of customer behavior and preferences,


facilitating personalized marketing and customer engagement.

24
Example: An e-commerce platform uses customer segmentation models to create
personalized recommendations and marketing campaigns.

6. Product and Service Development

Business models assist in the development and optimization of products and services by
analyzing customer needs, market conditions, and competitive landscape.

Example: A tech company uses market analysis models to design a new product that
addresses customer pain points and trends.

7. Financial Management and Analysis

Business modeling supports financial analysis and management by providing insights into
revenue, costs, profitability, and investment opportunities.

Example: A corporation uses financial modeling to analyze different investment options and
their potential returns.

8. Enhanced Communication and Alignment

Business models provide a common framework for communicating insights and aligning
stakeholders with organizational goals.

Example: A project manager uses business models to present project status and forecasts to
executive leadership, ensuring alignment with strategic objectives.

9. Competitive Advantage

Effective business modeling can provide a competitive edge by leveraging data to make
smarter business decisions and respond to market changes.

Example: A company uses competitive analysis models to identify market trends and
develop strategies to outperform competitors.

25

You might also like