Da Unit Ii
Da Unit Ii
Descriptive Analytics
Descriptive analytics answers the question, "What happened?" by summarizing historical
data to understand trends, patterns, and relationships. It involves data aggregation,
summarization, and visualization.
Example:
o Sales Reporting: A retail company uses descriptive analytics to generate daily
sales reports. These reports summarize the total sales, number of transactions,
and average transaction value across different stores. By examining these
1
reports, the company can identify peak sales periods, popular products, and
regional differences in sales performance.
Tools Commonly Used:
o Dashboards (e.g., Tableau, Power BI)
o Reports (e.g., Excel, Google Data Studio)
2. Diagnostic Analytics
Diagnostic analytics answers the question, "Why did it
happen?" by drilling down into data to identify causes and correlations. It often involves
more detailed exploration and the use of statistical techniques.
Example:
Customer Churn Analysis: A telecommunications company uses diagnostic analytics to
understand why certain customers are canceling their subscriptions. By analyzing data such
as customer service interactions, billing history, and usage patterns, the company identifies
factors that are highly correlated with churn, such as frequent billing issues or poor service
coverage in certain areas.
Tools Commonly Used:
o Statistical software (e.g., SPSS, R)
o Data mining tools (e.g., SAS, RapidMiner)
3. Predictive Analytics
Predictive analytics answers the question, "What is likely to happen?" by using historical
data to make forecasts or predictions about future outcomes. It involves techniques like
machine learning, statistical modeling, and data mining.
Example:
Demand Forecasting: An e-commerce company uses predictive analytics to forecast
demand for products during the holiday season. By analyzing historical sales data,
promotional activities, and external factors like economic conditions, the company predicts
which products will be in high demand. This allows the company to optimize inventory levels
and reduce the risk of stock outs or overstocking.
2
Tools Commonly Used:
o Machine learning platforms (e.g., Python with scikit-learn, TensorFlow)
o Forecasting tools (e.g., ARIMA models in R, Prophet by Facebook)
4. Prescriptive Analytics
Prescriptive analytics answers the question, "What should we do?" by providing
recommendations for actions that can optimize outcomes. It involves the use of optimization
techniques, simulation models, and machine learning algorithms.
Example:
Supply Chain Optimization: A manufacturing company uses prescriptive analytics to
optimize its supply chain operations. By integrating predictive models with optimization
algorithms, the company determines the best sourcing strategy, production schedule, and
logistics routes to minimize costs while meeting customer demand. The system might
recommend adjusting order quantities or rerouting shipments in real time based on current
inventory levels and demand forecasts.
Tools Commonly Used:
o Optimization software (e.g., IBM CPLEX, Gurobi)
o Simulation tools (e.g., AnyLogic, Arena)
Introduction to Tools and Environment :
Data analytics involves extracting valuable insights from data, which requires a robust
set of tools and a well-configured environment. The choice of tools and environment depends
on the type of data, the complexity of the analysis, and the specific goals of the project.
3
1.Programming Languages
Programming languages form the backbone of data analytics, allowing analysts and data
scientists to manipulate data, perform calculations, and build models.
Python:
Python is the most widely used language for data analytics due to its simplicity and extensive
libraries.
Libraries:
Pandas: For data manipulation and analysis.
NumPy: For numerical computations and handling large datasets.
Matplotlib and Seaborn: For data visualization.
Scikit-learn: For machine learning and predictive modeling.
Use Case: Data cleaning, exploratory data analysis, machine learning model development.
R:
R is a powerful language for statistical analysis and visualization.
Libraries:
ggplot2: For creating advanced visualizations.
dplyr: For data manipulation.
caret: For machine learning.
Shiny: For building interactive web applications.
Use Case: Statistical analysis, hypothesis testing, creating detailed plots.
SQL:
o Description: Structured Query Language (SQL) is essential for querying and
managing data stored in relational databases.
o Use Case: Extracting, transforming, and loading (ETL) data from databases,
performing complex queries.
4
2. Data Visualization Tools
Visualization tools are used to create charts, graphs, and dashboards that make data more
understandable and actionable.
Tableau:
A leading data visualization tool that enables the creation of interactive and shareable
dashboards.
o Features: Drag-and-drop interface, integration with various data sources, real-
time data analysis.
o Use Case: Creating dashboards for business intelligence, interactive data
exploration.
Power BI:
Microsoft’s powerful business analytics tool that allows users to visualize and share insights.
o Features: Integration with Microsoft products, real-time analytics, natural
language queries.
o Use Case: Business reporting, interactive data dashboards, sharing insights
across organizations.
Matplotlib/Seaborn (Python) & ggplot2 (R):
Libraries for creating static, animated, and interactive visualizations in Python and R.
o Use Case: Custom visualizations in data science projects, exploratory data
analysis.
6. Cloud Platforms
Cloud platforms provide scalable and flexible environments for data storage, processing, and
analytics.
Amazon Web Services (AWS):
o Tools: Amazon S3 (storage), Redshift (data warehousing), SageMaker (machine
learning).
6
o Use Case: Scalable cloud-based analytics, data storage, machine learning model
deployment.
Google Cloud Platform (GCP):
o Tools: BigQuery (data warehousing), AutoML (machine learning), Dataflow
(data processing).
o Use Case: Real-time data analytics, big data processing, AI-driven insights.
Microsoft Azure:
o Tools: Azure Data Lake (storage), Azure Machine Learning, Azure Synapse
Analytics.
o Use Case: End-to-end data analytics solutions, integrating with Microsoft
ecosystem.
Talend:
o Description: An open-source data integration platform that allows for data
migration, profiling, and quality management.
o Use Case: Data integration, data quality management, ETL processes.
Apache Nifi:
7
o Description: A tool for automating the flow of data between software systems,
with real-time data processing and integration.
o Use Case: Data ingestion from various sources, real-time data flow
management.
Informatica PowerCenter:
o Description: An enterprise data integration platform used to extract, transform,
and load data.
o Use Case: Large-scale data integration, ETL processes, data warehousing.
8
Application of Modeling in Business:
Modeling in business involves creating mathematical representations of real-world
processes to predict outcomes, optimize operations, and inform decision-making. Various
types of models are used in data analytics to solve business problems, from forecasting
demand to optimizing supply chains.
1.Sales Forecasting
Sales forecasting models predict future sales based on historical data, market trends, and
other relevant factors. Accurate sales forecasts enable businesses to make informed decisions
about inventory management, staffing, and budgeting.
Model Types:
o Time Series Models: ARIMA, Exponential Smoothing.
o Machine Learning Models: Random Forest, Gradient Boosting Machines.
Example:
Retail Industry: A clothing retailer uses a time series forecasting model to predict
sales during the holiday season. By analyzing past sales data, promotional activities, and
external factors like weather, the model forecasts demand for each product category. This
helps the retailer optimize inventory levels, reducing the risk of stockouts or overstocking.
2. Customer Segmentation
Customer segmentation models group customers based on shared characteristics, such as
demographics, behavior, or purchasing patterns. This allows businesses to tailor marketing
strategies and improve customer engagement.
Model Types:
o Clustering Models: K-Means, Hierarchical Clustering.
o Classification Models: Logistic Regression, Decision Trees.
Example:
Telecommunications: A telecom company uses clustering models to segment its
customer base into distinct groups based on usage patterns, demographics, and service
preferences. The company then creates targeted marketing campaigns for each segment,
offering customized plans and promotions to improve customer retention and satisfaction.
3. Risk Management
Risk management models assess the likelihood of adverse events, such as loan defaults or
market downturns, and estimate their potential impact on the business. These models help
businesses mitigate risks and make more informed decisions.
Model Types:
o Credit Scoring Models: Logistic Regression, Support Vector Machines.
9
o Market Risk Models: Value at Risk (VaR), Monte Carlo Simulation.
Example:
Banking: A bank uses a credit scoring model to evaluate the risk of lending to
individual customers. By analyzing factors such as credit history, income level, and
employment status, the model predicts the probability of default. The bank uses this
information to approve or reject loan applications and to set appropriate interest rates.
Example:
Manufacturing: A manufacturing company uses a linear programming model to
optimize its production schedule and minimize costs. The model considers factors like
production capacity, raw material availability, and delivery deadlines. As a result, the
company reduces production costs, minimizes inventory holding costs, and ensures timely
delivery of products to customers.
Example:
Consumer Goods: A consumer goods company uses a marketing mix model to assess
the impact of different marketing activities on sales. By analyzing historical data on
advertising spend, pricing, and promotions, the model identifies which activities drive the
most sales. The company uses this insight to allocate its marketing budget more effectively,
focusing on the channels that provide the highest return on investment.
10
6. Churn Prediction
Churn prediction models identify customers who are likely to leave a service or stop
purchasing a product. These models enable businesses to take proactive measures to retain
customers, such as targeted offers or personalized communication.
Model Types:
o Classification Models: Logistic Regression, Random Forest.
o Survival Analysis: Cox Proportional Hazards Model.
Example:
Subscription Services: A subscription-based video streaming service uses a churn prediction
model to identify customers at risk of canceling their subscriptions. By analyzing factors like
viewing patterns, customer support interactions, and payment history, the model predicts
churn risk. The service then targets at-risk customers with personalized offers or incentives to
encourage them to stay.
7. Fraud Detection
Fraud detection models identify suspicious activities that may indicate fraudulent
behavior. These models are particularly valuable in industries like banking, insurance, and e-
commerce, where fraud can have significant financial impacts.
Model Types:
o Anomaly Detection Models: Isolation Forest, One-Class SVM.
o Supervised Learning Models: Logistic Regression, Neural Networks.
Example:
Banking: A bank uses a fraud detection model to monitor transactions in real-time.
The model flags unusual activities, such as large transactions from an unfamiliar location,
which could indicate fraudulent activity. The bank can then block the transaction and contact
the customer for verification, reducing the risk of fraud.
8. Product Recommendation
Product recommendation models suggest products or services to customers based on their
preferences, past behavior, and similar customer profiles. These models enhance the customer
experience and drive sales.
Model Types:
o Collaborative Filtering: User-based or Item-based.
o Content-Based Filtering: Based on product attributes.
Example:
E-commerce: An online retailer uses a product recommendation model to suggest
products to customers based on their browsing and purchase history. The model analyzes
patterns across customers with similar preferences and recommends products that the
customer is likely to be interested in. This personalization increases the likelihood of
purchase and enhances the customer’s shopping experience.
11
9. Price Optimization
Price optimization models help businesses determine the best pricing strategy to maximize
profits while remaining competitive. These models consider factors like demand elasticity,
competitor pricing, and cost structures.
Model Types:
o Optimization Models: Linear Programming, Conjoint Analysis.
o Machine Learning Models: Random Forest, Gradient Boosting.
Example:
Hospitality: A hotel chain uses a price optimization model to adjust room rates
dynamically based on demand, seasonality, and competitor pricing. The model helps the hotel
maximize revenue by increasing prices during high-demand periods and offering discounts
during low-demand periods.
In data analytics, databases, types of data, and variables play a foundational role.
Understanding these concepts is essential for effectively managing and analyzing data.
12
Types of Databases
Store data in tables (relations) with predefined schemas. Data is organized in rows and
columns, with each table representing an entity (e.g., customers, orders).
Use Case: Storing structured data such as customer records, transaction logs, and product
catalogs.
2. NoSQL Databases:
o Types:
13
Document Stores: Store data as documents (e.g., JSON, BSON).
Examples: MongoDB, CouchDB.
Use Case: Content management systems, user profiles, and log
data.
Key-Value Stores: Store data as key-value pairs.
Examples: Redis, DynamoDB.
Use Case: Caching, session management, real-time data
processing.
Column Stores: Store data in columns rather than rows.
Examples: Cassandra, HBase.
Use Case: Analytical queries, time-series data, large-scale data
processing.
Graph Databases: Store data in graph structures, representing
relationships between entities.
Examples: Neo4j, Amazon Neptune.
Use Case: Social networks, fraud detection, recommendation
engines.
3. Data Warehouses:
Centralized repositories that store integrated data from multiple sources, optimized for
query and analysis.
Data can be classified into different types based on its nature and structure. This
classification is essential for selecting appropriate analytical methods and tools.
14
Types of Data
1. Structured Data:
Data that is organized in a tabular format with rows and columns, making it easy to
search, filter, and analyze.
o Examples:
Customer Database: Contains fields like customer ID, name, email, and
purchase history.
Transaction Logs: Contains transaction ID, date, amount, and product
details.
o Storage: Relational databases (RDBMS).
o Use Case: Traditional business applications like CRM systems, ERP systems,
and financial records.
2. Unstructured Data:
Data that doesn’t have a predefined structure, making it more challenging to analyze
directly.
o Examples:
Text Documents: Emails, reports, social media posts.
Multimedia: Images, videos, audio files.
Logs: Server logs, application logs.
o Storage: NoSQL databases, data lakes.
o Use Case: Sentiment analysis, natural language processing, image and video
analysis.
15
3. Semi-Structured Data:
Data that doesn’t conform to a rigid structure but still contains tags or markers to separate
elements.
oExamples:
XML/JSON Files: Used for data exchange between systems.
HTML: Web pages with embedded data.
o Storage: NoSQL databases (document stores), data lakes.
o Use Case: Data integration, API responses, web scraping.
4. Time-Series Data:
Data points collected or recorded at specific time intervals, often used for tracking
changes over time.
o Examples:
Stock Prices: Daily closing prices of stocks.
Sensor Data: Temperature readings from IoT devices.
Sales Data: Daily or monthly sales figures.
o Storage: Specialized time-series databases (e.g., InfluxDB), columnar
databases, data lakes.
o Use Case: Trend analysis, forecasting, monitoring and anomaly detection.
5. Spatial Data:
Data that represents the physical location and shape of objects in space.
o Examples:
GIS Data: Geographic coordinates, maps.
GPS Data: Location tracking of vehicles or individuals.
o Storage: Spatial databases (e.g., PostGIS), NoSQL databases with spatial
capabilities.
o Use Case: Geographic information systems (GIS), location-based services,
mapping applications.
Variables are attributes or characteristics that represent different aspects of the data. They are
fundamental in building models and performing analysis.
16
Types of Variables
1. Numerical Variables:
o Types:
Continuous Variables: Can take any value within a range (e.g., height,
weight, temperature).
Example: The price of a product, measured in dollars.
Discrete Variables: Can take only specific, distinct values (e.g., count of
items).
Example: The number of products sold in a day.
o Use Case: Statistical analysis, regression modeling, forecasting.
2. Categorical Variables:
o Types:
Nominal Variables: Categories without a specific order (e.g., gender,
color).
Example: Customer’s preferred payment method (e.g., credit card,
cash, PayPal).
Ordinal Variables: Categories with a meaningful order (e.g., satisfaction
level, ranking).
Example: Customer satisfaction rating (e.g., poor, fair, good,
excellent).
o Use Case: Classification models, market segmentation, decision trees.
17
3. Binary Variables:
o Examples:
Yes/No: Whether a customer made a purchase (yes or no).
True/False: Whether a transaction was fraudulent (true or false).
o Use Case: Logistic regression, binary classification tasks.
4. Ordinal Variables:
Variables that have a clear ordering but no fixed interval between the values.
o Examples:
Education Level: High school, bachelor’s, master’s, doctorate.
Customer Satisfaction: Very dissatisfied, dissatisfied, neutral, satisfied,
very satisfied.
o Use Case: Ordinal regression, ranking analysis.
5. Time Variables:
o Examples:
Date/Time: Date of purchase, timestamp of a transaction.
Duration: Time taken to complete a task or delivery time.
o Use Case: Time series analysis, trend analysis, forecasting.
6. Derived Variables:
o Examples:
Total Spend: Sum of all purchases made by a customer.
Age: Derived from the customer’s date of birth.
o Use Case: Feature engineering, data transformation, improving model accuracy.
18
Data Modeling Techniques :
1. Conceptual Model
The conceptual data model is a view of the data that is required to help business
processes. It also keeps track of business events and keeps related performance measures.
The conceptual model defines what the system contains. Conceptual Model focuses on
finding the data used in a business rather than processing flow. The main purpose of this data
model is to organize, define business rules and concepts. For example, it helps business
people to view any data like market data, customer data, and purchase data.
2. Logical Model
In the logical data model, the map of rules and data structures includes the data
required, such as tables, columns, etc. Data architects and Business Analysts create the
Logical Model. We can use the logical model to transform it into a database. Logical Model
is always present in the root package object. This data model helps to form the base for the
physical model. In this model, there is no secondary or primary key is defined.
3. Physical Data Model
In a physical data model, the implementation is described using a specific database
system. It defines all the components and services that are required to build a database. It is
created by using the database language and queries. The physical data model represents each
table, column, constraints like primary key, foreign key, NOT NULL, etc. The main work of
the physical data model is to create a database. This model is created by the Database
Administrator (DBA) and developers. This type of Data Modelling gives us the abstraction of
the databases and helps to create the schema. This model describes the particular
implementation of the data model. The physical data model helps us to have database column
keys, constraints, and RDBMS features.
19
What are the Data Modeling types?
Below given are 5 different types of data modeling used to organize the data:
Hierarchical Model
The hierarchical model is a tree-like structure. There is one root node, or we can say
one parent node and the other child nodes are sorted in a particular order. But, the
hierarchical model is very rarely used now. This model can be used for real-world model
relationships.
2. Object-oriented Model
The object-oriented approach is the creation of objects that contains stored values. The
object-oriented model communicates while supporting data abstraction, inheritance, and
encapsulation.
3. Network Model
The network model provides us with a flexible way of representing objects and
relationships between these entities. It has a feature known as a schema representing the data
in the form of a graph. An object is represented inside a node and the relation between them
as an edge, enabling them to maintain multiple parent and child records in a generalized
manner.
4. Entity-relationship Model
ER model (Entity-relationship model) is a high-level relational model which is used to
define data elements and relationship for the entities in a system. This conceptual design
provides a better view of the data that helps us easy to understand. In this model, the entire
database is represented in a diagram called an entity-relationship diagram, consisting of
Entities, Attributes, and Relationships.
20
5. Relational Model
Relational is used to describe the different relationships between the entities. And there
are different sets of relations between the entities such as one to one, one to many.
Missing Imputations :
Missing data imputation is a critical step in data preprocessing, addressing gaps in datasets
where some values are absent. Proper handling of missing data ensures the accuracy and
reliability of statistical analyses and machine learning models.
Example: Missing data on a survey question where the missingness is unrelated to the
respondent's characteristics.
The probability of missing data on a variable is related to observed data but not the
missing data itself.
Example: Missing income data based on the respondent's age, where the missingness is
related to age but not to the income itself.
The probability of missing data is related to the value of the missing data itself.
Example: High-income individuals are less likely to report their income, making income
data missing due to its own value.
Imputation Techniques
1. Mean/Median/Mode Imputation:
Replaces missing values with the mean (for continuous data), median (for continuous
data), or mode (for categorical data) of the observed values.
21
Example: Replacing missing age values in a dataset with the average age of observed
individuals.
Imputes missing values using the values from the K-nearest neighbors based on a distance
metric.
Example: Imputing missing values in a dataset by averaging the values from the nearest K
neighbors with similar attributes.
3. Regression Imputation:
Example: Using a regression model to predict missing income based on age, education level,
and job type.
4. Multiple Imputation:
Creates multiple imputed datasets using a statistical model, performs analysis on each
dataset, and combines results to account for uncertainty in imputation.
Example: Imputing missing values in a survey dataset with multiple imputation to analyze
patterns across several imputed datasets.
Uses an iterative process to estimate missing values by maximizing the likelihood of the
observed data.
Example: Using EM to estimate missing values in a dataset with missing entries under a
multivariate normal distribution.
6. Interpolation:
Estimates missing values based on the values of neighboring data points in a sequence or
time series.
Example: Filling missing time points in a temperature dataset using linear or spline
interpolation.
22
7. Data Augmentation:
Generates new data points based on existing data to supplement missing values, often
used in conjunction with other methods.
Example: Generating synthetic data points for missing values in a dataset using techniques
like SMOTE (Synthetic Minority Over-sampling Technique).
Missing Data Pattern: Assess whether the data is MCAR, MAR, or MNAR to choose
an appropriate technique.
Computational Resources: Some methods, like multiple imputation and EM, may be
computationally intensive.
Analysis Objectives: Align the imputation method with the goals of the analysis, such
as prediction accuracy or data completeness.
Example Scenarios
1. Healthcare Dataset:
o Situation: Missing values in patient records for certain health indicators.
o Imputation Method: Use mean imputation for continuous variables and mode
imputation for categorical variables, or employ KNN for more complex
relationships.
2. Financial Transactions:
o Situation: Missing values in transaction amounts due to incomplete records.
o Imputation Method: Use regression imputation based on other transaction
characteristics or applies multiple imputations to account for uncertainty.
3. Time Series Data:
o Situation: Missing temperature readings in a climate dataset.
o Imputation Method: Apply interpolation techniques, such as linear or spline
interpolation, to estimate missing temperature values.
23
Need for Business Modeling:
Business modeling is a crucial aspect of data analytics that involves creating abstract
representations of business processes, systems, and relationships. It helps organizations make
informed decisions, optimize operations, and achieve strategic goals.
1. Enhanced Decision-Making
Example: A retail company uses a sales forecasting model to predict future sales and
optimize inventory levels, ensuring they meet customer demand without overstocking.
Example: A financial institution uses predictive modeling to forecast market trends and plan
investment strategies.
Business models help in identifying and managing risks by simulating different scenarios
and assessing potential impacts.
Example: An insurance company uses risk models to assess the likelihood of claims and set
appropriate premiums.
24
Example: An e-commerce platform uses customer segmentation models to create
personalized recommendations and marketing campaigns.
Business models assist in the development and optimization of products and services by
analyzing customer needs, market conditions, and competitive landscape.
Example: A tech company uses market analysis models to design a new product that
addresses customer pain points and trends.
Business modeling supports financial analysis and management by providing insights into
revenue, costs, profitability, and investment opportunities.
Example: A corporation uses financial modeling to analyze different investment options and
their potential returns.
Business models provide a common framework for communicating insights and aligning
stakeholders with organizational goals.
Example: A project manager uses business models to present project status and forecasts to
executive leadership, ensuring alignment with strategic objectives.
9. Competitive Advantage
Effective business modeling can provide a competitive edge by leveraging data to make
smarter business decisions and respond to market changes.
Example: A company uses competitive analysis models to identify market trends and
develop strategies to outperform competitors.
25