Unit 1
Unit 1
Overview of Data Analytics: Types of Data Analysis – Steps in Data Analysis Process –Data
Repositories – ETL process – Roles, Responsibilities and Skill Sets of Data Analysts – Data
Analytic
Data Analytics is the process of analyzing raw data to identify trends, patterns, and
actionable insights to support decision-making. In today’s data-driven world, businesses and
organizations rely on analytics to make strategic, informed decisions that improve efficiency,
reduce costs, and drive growth.
"Data Analytics is the science of examining raw data to uncover meaningful trends, patterns,
and insights. It involves various processes, tools, and techniques to analyze data and draw
conclusions for better decision-making."
1. Improved Decision-Making:
o Data-driven insights allow organizations to make informed, evidence-based
decisions.
o Example: A company identifies the most profitable customer segment and targets
marketing campaigns effectively.
2. Identifying Patterns and Trends:
o Detect patterns that may not be visible manually.
o Example: Retailers analyze purchase data to identify peak buying times and
optimize inventory.
3. Competitive Advantage:
o Businesses that use data analytics outperform competitors who rely on intuition.
o Example: Amazon uses recommendation engines to personalize product
suggestions.
4. Cost Reduction:
o Streamline operations and identify inefficiencies.
o Example: Energy companies use sensor data to monitor and optimize power
consumption.
5. Enhancing Customer Experience:
o Data analytics helps in understanding customer needs and preferences.
o Example: Netflix uses viewing history to recommend personalized content.
1. Data:
Data refers to raw facts and figures. It can be structured (like databases) or
unstructured (text, images, videos).
Types of Data:
Data Analytics can be categorized into five main types, each serving a distinct purpose. These
types range from analyzing historical data to making future predictions and providing
actionable recommendations.
1. Descriptive Analysis
Definition
Descriptive analysis focuses on summarizing historical data to identify patterns and trends. It
answers “What happened?”
Purpose
Techniques
Scenario
Insights
2.Diagnostic Analysis
Definition
Diagnostic analysis identifies why something happened. It explains causes and relationships.
Purpose
Example:
Scenario
A company observes that advertising costs impact sales. Diagnostic analysis finds a strong
correlation between them.
Correlation Analysis:
Correlation Matrix:
Ad_Spend Sales
Ad_Spend 1.000000 0.998053
Insights
3. Predictive Analysis
Definition
Predictive analysis uses statistical models and machine learning to forecast future outcomes.
It answers “What will happen?”
Techniques
Example
Scenario
A retail store predicts sales for the next month based on historical data.
4. Prescriptive Analysis
Definition
Example:
A real estate agency wants to predict the price of houses based on their square footage. Given
historical data of house sizes and prices, we can build a linear regression model to forecast
prices for new homes.
Scenario
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Definition
EDA involves visually exploring and summarizing data to discover patterns and anomalies.
Techniques
Purpose: Clearly define the problem or question to be answered through data analysis.
Why It Matters: A well-defined problem ensures the analysis stays focused and aligned
with the goals.
Example:
o Problem: "What factors influence the sales of a product?"
o Objective: "Predict product sales based on advertising spend."
2. Data Collection
Purpose: Clean the raw data to remove errors, inconsistencies, and missing values,
ensuring quality and reliability.
Steps:
o Remove duplicates.
o Handle missing values (imputation or removal).
o Correct inconsistencies in data (e.g., typos, wrong formats).
o Normalize or scale data.
Tools: Python libraries (Pandas, NumPy), Excel.
Example:
import pandas as pd
# Load Data
5. Data Transformation
6. Data Modeling
Purpose: Apply statistical or machine learning models to derive insights and predictions.
Techniques:
o Descriptive Models: Summarize historical data.
o Predictive Models: Predict future outcomes (e.g., Linear Regression, Decision
Trees).
o Prescriptive Models: Recommend actions (e.g., Optimization).
Example (Predictive Analysis):
o Predict sales based on advertising spend using Linear Regression:
7. Evaluation of Results
Data Repositories
Data repositories are centralized locations where data is stored, managed, and accessed.
These repositories ensure that data is organized, secure, and easily retrievable for analysis and
processing.
Data repositories act as the foundation for data analysis, enabling organizations to store,
manage, and retrieve data efficiently.
1. Data Warehouse
Definition: A centralized repository for storing structured data from multiple sources,
designed for business intelligence and reporting.
Characteristics:
o Stores large volumes of historical data.
o Optimized for querying and analysis rather than transaction processing.
o Data is extracted from transactional databases (OLTP), processed, and loaded into
the warehouse.
Examples:
o Amazon Redshift
o Google BigQuery
o Snowflake
Diagram:
2. Data Lakes
Definition: A repository that stores raw data in its native format (structured, semi-
structured, and unstructured).
Characteristics:
o Highly scalable and flexible.
o Supports big data frameworks like Apache Spark and Hadoop.
o Allows data scientists to analyze raw data using tools like Python, R, and SQL.
Examples:
o Amazon S3 (AWS Data Lake)
o Azure Data Lake
o Hadoop Distributed File System (HDFS)
Diagram:
3. Relational Databases
Definition: A repository that stores structured data in tables with rows and columns,
enabling relationships between tables.
Characteristics:
o Uses SQL (Structured Query Language) for data access.
o Designed for structured data and transactional operations.
Examples:
o MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server
Diagram:
4. NoSQL Databases
5. Cloud-Based Repositories
Definition: Cloud platforms provide scalable storage solutions for data, accessible over
the internet.
Characteristics:
o Pay-as-you-go pricing.
o Highly scalable and secure.
o Accessible from anywhere.
Examples:
o Amazon Web Services (AWS)
o Google Cloud Storage (GCS)
o Microsoft Azure Blob Storage
Diagram:
6. Data Marts
The ETL process is a critical step in data management and analytics workflows. It involves
extracting data from various sources, transforming it into a usable format, and loading it into a
data repository like a data warehouse or database for analysis and reporting.
What is ETL?
ETL Workflow
text
Copy code
Data Sources → Extract → Transform → Load → Data Warehouse → Analysis & Reporting
Steps in ETL Process
1. Extract
The extract step retrieves raw data from various heterogeneous sources like:
Key Tasks:
python
Copy code
import pandas as pd
print("Extracted Data:")
print(data.head())
2. Transform
The transform step involves cleaning, formatting, and enriching the data to ensure it's ready
for analysis.
Key Tasks:
Data Cleaning: Remove duplicates, handle missing values, correct data types.
Data Aggregation: Summarize and group data (e.g., sum of sales by region).
Normalization/Standardization: Scale numerical data for consistency.
Feature Engineering: Create new features or columns.
Data Validation: Ensure data quality and accuracy.
python
Copy code
# Clean Data
data.dropna(inplace=True) # Remove missing values
data['sales'] = data['sales'].astype(float) # Convert data type
# Aggregate Data
sales_by_region = data.groupby('region')['sales'].sum().reset_index()
print("Transformed Data:")
print(sales_by_region)
3. Load
The load step stores the transformed data into a target system where it can be queried and
analyzed.
Target Systems:
Key Tasks:
1. Extract: Pull sales data from the POS system (CSV) and product data from a database.
2. Transform:
o Merge the sales and product data.
o Remove missing or invalid values.
o Aggregate total sales by region.
3. Load: Save the cleaned and transformed data into a data warehouse like Snowflake for
analysis.
1. ETL Tools:
o Apache NiFi
o Talend
o Microsoft SSIS (SQL Server Integration Services)
o Informatica PowerCenter
3. Python Libraries:
o Pandas (Data Transformation)
o SQLAlchemy (Load to Databases)
o PySpark (Big Data ETL)
Data Integration: Combines data from multiple sources into a single repository.
Data Quality: Ensures cleaned and accurate data.
Efficiency: Automates data pipelines for faster decision-making.
Scalability: Supports handling large-scale data.
Data analysts play a crucial role in helping organizations make data-driven decisions. They are
responsible for analyzing and interpreting data to uncover valuable insights, trends, and
patterns. Below is an overview of the roles, responsibilities, and skills required for data
analysts.
A data analyst's role involves several key functions and tasks, depending on the organization
and project. Here are the core roles:
1. Data Collection:
o Collecting raw data from various internal and external sources (e.g., databases,
APIs, flat files, etc.).
o Ensuring the accuracy and quality of the data.
2. Data Cleaning and Preprocessing:
o Cleaning the data to remove noise, duplicates, and errors.
o Ensuring data is in a suitable format for analysis, including handling missing
values, outliers, and inconsistencies.
6. Collaboration:
o Collaborating with different teams (e.g., business, engineering, marketing) to
understand data requirements and provide insights.
o Working closely with data engineers to ensure data pipelines are working
efficiently.
6. Visualization:
o Design and develop visualizations (charts, graphs, heatmaps) to simplify complex
data for better understanding.
o Use tools like Power BI, Tableau, or Matplotlib (in Python) for data visualization.
7. Collaboration:
o Work with cross-functional teams (marketing, finance, operations) to understand
their data needs.
o Act as a liaison between technical teams (e.g., data engineers) and business
stakeholders.
1. Technical Skills
Data Wrangling:
o Python and R: Popular programming languages for data manipulation and
analysis.
o Pandas (Python) and dplyr (R) for data cleaning and preprocessing.
o SQL: Essential for querying and manipulating relational databases.
o Excel: Advanced Excel skills (pivot tables, VLOOKUP, macros) for analyzing and
organizing data.
Data Visualization:
o Matplotlib and Seaborn (Python) for visualizing data.
o Power BI and Tableau: Tools for building interactive dashboards and reports.
o Plotly: For creating interactive and web-based visualizations.
2. Soft Skills
Problem-Solving:
o Ability to identify business problems and translate them into data analysis
questions.
o Creative thinking to find innovative ways to solve data-related problems.
Communication:
o Excellent written and verbal communication skills to convey insights to non-
technical stakeholders.
o Ability to present complex data in an understandable and actionable way.
Attention to Detail:
o High attention to detail to ensure data integrity and accuracy.
o Detect and correct discrepancies in the data during analysis.
Collaboration:
o Teamwork skills to collaborate with data engineers, business teams, and
management.
3. Business Skills
Domain Knowledge:
o Understanding the industry in which the business operates (e.g., finance, retail,
healthcare).
o Ability to ask the right questions and identify the most relevant data to analyze.
Critical Thinking:
o Ability to analyze data from different perspectives and understand the
implications of the findings.
o Evaluate data in terms of its business impact.
1. Extract Data: The analyst extracts data from the POS system, customer database, and
social media platforms.
2. Clean Data: Remove duplicate records, handle missing customer demographic
information, and standardize date formats.
3. Analyze: Perform EDA to discover customer trends and perform a regression analysis to
identify factors influencing purchasing decisions.
4. Visualize: Create dashboards to show purchasing patterns, age group preferences, and
sales by region.
5. Report: Present insights to the marketing team for targeted advertising campaigns.
Data Analytics: An Overview
Data analytics refers to the process of examining and interpreting raw data with the purpose
of drawing conclusions, identifying patterns, trends, and making data-driven decisions. It
involves various techniques, tools, and methodologies to extract meaningful insights from
data. Data analytics is a critical component of modern businesses as it helps organizations
improve operational efficiency, identify market opportunities, and develop strategies for
growth.
Data analytics is often categorized into four major types based on the purpose and method
used:
1. Descriptive Analytics
This type answers the "what happened" question by summarizing past data and events.
It uses historical data to describe patterns and trends, helping businesses understand
their past performance.
Example: A retail store analyzing sales data from the last year to determine the most
popular products.
Tools/Techniques:
o Pivot Tables
o Summary Statistics (Mean, Median, Mode)
o Data Visualization (Bar charts, Pie charts)
2. Diagnostic Analytics
This type answers the "why did it happen" question by examining data to uncover
reasons behind trends or patterns. It uses historical data and statistical techniques to
identify relationships and causes.
Example: A company noticing a drop in sales and analyzing customer reviews and
feedback to determine why sales have declined.
Tools/Techniques:
o Correlation analysis
o Hypothesis testing
o Regression analysis
3. Predictive Analytics
Predictive analytics answers the "what could happen" question by using historical data
to predict future outcomes. It uses statistical models and machine learning algorithms
to make forecasts about future trends.
Example: A finance company using past transaction data to predict the likelihood of
loan defaults.
Tools/Techniques:
o Linear Regression
o Decision Trees
o Time Series Forecasting (ARIMA)
o Machine Learning Models (Random Forest, Support Vector Machines)
4. Prescriptive Analytics
Prescriptive analytics answers the "what should we do" question by suggesting actions
or strategies to optimize outcomes. It uses data, algorithms, and simulation models to
recommend the best course of action.
Tools/Techniques:
o Optimization algorithms
o Simulation modeling
o Decision analysis (Monte Carlo Simulation, Linear Programming)
1. Data Collection
Data is gathered from various sources such as databases, APIs, spreadsheets, and third-
party services. The goal is to collect relevant and accurate data.
2. Data Cleaning and Preparation
Raw data is often messy, with missing values, duplicates, and errors. Data cleaning
involves preparing the data for analysis by handling issues like missing values, correcting
inaccuracies, and formatting the data.
Example: Converting all text to lowercase, filling missing values, or removing duplicate
records.
1. Programming Languages:
o Python: Python is one of the most widely used languages for data analysis due to
its simplicity and the wide range of libraries available such as Pandas, NumPy,
Matplotlib, and Scikit-learn.
o R: R is another powerful language for statistical analysis and data visualization. It
is especially popular in academia and research.
3. Databases:
o SQL: SQL (Structured Query Language) is used for querying and managing
relational databases like MySQL, PostgreSQL, and Oracle.
o NoSQL: Used for non-relational databases like MongoDB, Cassandra, and
Couchbase.
Let’s consider a simple example of Predictive Analytics where a company wants to predict
future sales based on past sales data.
Steps:
1. Data Collection: Collect past sales data (e.g., sales per month).
2. Data Cleaning: Ensure there are no missing values and the data is consistent.
3. Data Exploration: Use summary statistics to understand the sales trend.
4. Data Modeling: Use a linear regression model to predict future sales based on the past
sales data.
5. Data Visualization: Plot the predicted vs. actual sales in a graph for easier
interpretation.
6. Decision-Making: Use the predictions to adjust marketing strategies or manage
inventory.
In the example above, we predict the sales for month 7 based on the data from the previous
six months using linear regression.