DataAnalytics Chap 1
DataAnalytics Chap 1
Chapter-1
Introduction to Data Analytics.
Data:
• It comprises facts, observations and raw information.
Analytics:
• It is systematic computational analysis of data.
• Analytics is the discovery, interpretation and communication of
meaningful patterns in data.
Data Analytics:
• It is defined as , a science of extracting meaningful, valuable
information from raw data.
• Data is extracted and categorized to identify and analyze behavioural
data and patterns and techniques vary according to organizational
requirements.
• The goal of data analytics is to get actionable insights from raw data
resulting better decisions.
Roles in Data Analytics:
1. Data Analyst:
• Collects and processes data to find insights and answer questions.
• Uses tools like Excel, SQL, and visualization software to create reports.
• Works on summarizing data through charts, graphs, and dashboards.
• Focuses on understanding trends and patterns in data.
• Helps businesses make decisions based on historical data.
• Example: A weather reporter who analyzes past weather data to show last
month’s rainfall totals and create a clear report.
2. Data Scientist:
• Uses advanced techniques like machine learning, deep learning to
make in depth analysis of input data.
• Works with both structured and unstructured data to solve complex
problems.
• Mainly deals with large and complex data.
• Cleans and preprocesses data, making it ready for analysis.
• Communicates findings through storytelling and visualizations.
• Bridges the gap between analysis and actionable business strategies.
• Example: A detective using past crime data to predict where a crime
might happen next.
3. Data Architect:
• Designs and builds the framework for data management systems.
• Ensures data systems are secure, efficient, and scalable.
• Works closely with data engineers and stakeholders to create data
policies.
• Plans how data is stored, retrieved, and managed in databases or
warehouses.
• Example: An urban planner designing a city map that decides where
roads and buildings should go for efficient traffic flow.
4. Data Engineer:
• Responsible for building and maintain the data architecture of data
science project.
• Builds and maintains pipelines that transport data from sources to
databases.
• Ensures data is clean, reliable, and available for analysts and scientists.
• Develops and optimizes data systems for performance and scale.
• Works with technologies like ETL (Extract, Transform, Load) tools,
cloud services, and big data platforms.
• Collaborates with data architects and analysts to improve data
workflows.
• Example: A plumber setting up pipes so water (data) flows smoothly
to different parts of a house.
5. Analytics Manager:
• Leads a team of data analysts and data scientists.
• Plans and oversees analytics projects to meet business goals.
• Ensures data-driven strategies are implemented effectively.
• Communicates between the analytics team and upper management.
• Monitors project progress, sets priorities, and ensures deadlines are
met.
• Example: A team coach making sure each player knows their role and
the team plays well together to win the game.
Lifecycle of Data Analytics:
1. Data Discovery:
• It define the purpose of data and how to achieve it by the end of the data
analytics lifecycle.
• Data Discovery phase consist of identifying critical objectives a business is
trying to discover by mapping out the data.
2. Data Preparation:
• Data is prepared for transforming it from legacy system into a data analytics form
by using the sandbox platform.
(sandbox platform is a safe, isolated environment where analysts can experiment with data, models,
and tools without affecting live systems. It allows testing and validating analytics workflows before
implementation.)
3. Model Planning:
• In this phase, where the data analytics team members make proper planning of the
methods to be adapted and the various workflow to be followed during the next
phase of model building.
• In this phase, data analytics team have to analyze the quality of data and find a
suitable model for the project.
4. Model Building:
• In this phase, team work on developing datasets for training and testing as well as
for production purpose.
• It is process where team has to deploy the planned model in a real-time
environment.
• The execution needed for the execution of the model is decided and prepared so
that if more robust environment is required, it is accordingly applied.
5. Communicate Results:
• It checks the result of the project to find whether it is a success or failure.
• In this phase, the business/ organizational values are quantified and elaborate
narrative on the key findings is prepared.
6. Operationalize:
• The team delivers final report is prepared by the team along with the briefings,
source code and related technical documents.
• It also involves running the pilot project(real-world project) to implement the
model and test it in real-time environment.
Data Analytics Framework:
1. Data Connection Layer:
• The Data Connection Layer is the "pipeline" that bridges data sources with the
analytics system.
• This layer connects different data sources to the system for analysis.
• Low Quality of Data: Poor or inaccurate data can lead to misleading results and
incorrect conclusions.
• Privacy Concerns: Collecting and analyzing data may involve sensitive
information, raising ethical and legal issues around data privacy.
Difference between Data Analysis & Data Analytics:
The process of extracting information from raw The process of extracting meaningful valuable
data is called as data analysis. insights from raw data called as data analytics.
It is process of involving the collection, manipulation It takes the analyzed data and working on it in a
and examination of data for getting insights from meaningful and useful way to make well-versed
data. organizational decisions.
Data analysis look backwards, with a historical view Data analytics models the future or predict a result.
of what has happened.
Data analysis also make decision but less good than Data analytics utilizing data, ML, statistical data
data analytics. analysis to get better insight and make better decision
from the data.
Data analysis is a subset of data analytics. Data analytics uses data analysis as a subcomponent.
TYPES OF DATA ANALYTICS
What happened ?? Why did it happened ?? What will happened ? How can we make it
happen?
Descriptive Analytics:
• It examines the raw data or content to answer question, what happened?, by
analyzing valuable information found from the available past data.
• The goal is to provide insights into the past leading to the present, using descriptive
statistic, interactive explorations of the data, and data mining.
• Descriptive analytics looks at data & analyze past events for insights as to how to
approach the future.
• Eg: An organizations records give a past review of their financials, operations,
customers and stakeholders, sales and so on.
Diagnostic Analytics:
• It examines the data to answer the question, why did it happen?
• It find the root cause analysis that focus on the processes and causes, key factors
and unseen patterns.
• Diagnostic analytics tries to gain a deeper understanding of the reason behind the
pattern of data found in the past.
• Ex: Car troubles.
Predictive Analytics:
• It deals with prediction of future based on the available current and past data.
• It uses past data to create a model that answer the question, what will happen?
• Predictive analytics is use of data, machine learning techniques and statistical
algorithms to determine the likelihood of future results based on historical data.
• Primary goal is to help yo go beyond just what has happened and provide the best
possible assessment of what is likely to happen in future.
• Ex: Weather forecasting
Prescriptive Analytics:
• It goes beyond predicting future outcomes by also suggesting actions to benefit
from the showing the decision maker the implications of each decision option.
• It not only anticipate what will happen and when it will happen but also why it will
happen.
• It can suggest decision options on how to take advantage of a future opportunity or
mitigate future risk and illustrate the implementation of each decision option.
• Prescriptive analytics can continually and automatically process new data to
improve prediction accuracy and provide better decision options.
• Ex: You are heading to work and want to avoid traffic. Prescriptive analytics uses
real-time data from traffic apps to suggest the fastest route, accounting for
accidents, traffic jams, and construction, helping you arrive on time and avoid
delays.
Exploratory Analytics:
• Exploratory Data Analysis is very import aspect to any data analysis.
• It attempts to find hidden, unseen or previously unknown relationship.
• The EDA techniques are used to interactively discover and visualize trends,
behaviours and relationship in data.
• The goal of EDA is to maximize the analyst’s insight into a data set and into the
underlying structure of dataset.
• EDA is an approach to analysing datasets to summarize their main characteristic,
often with visual/graphical methods.
• The purpose of exploratory data analysis is to:
1. Checking of missing data and other mistakes.
2. Gain maximum insights into the datasets and its underlying structure.
3. Create a list of outliers or other anomalies.
• Exploratory Data Analysis is the process of analysing and interpreting datasets
while summarizing their particular characteristic with the help of data visualization
methods.
• Ex:When you went to theater to watch movie.
Mechanistic Analytics:
• It allows data scientist to understand clear alterations in variables which can result
in changing of variables.
• The results of mechanistic data analytics are determined by equations in
engineering and physical sciences.
• The goal of it is to understand exact changes in variables that leads to other
changes in other variables.
• Ex: For example, if you change the temperature in a factory, mechanistic analytics
helps you understand exactly how this change affects the quality of products, based
on the science of the materials being used.
Confusion Matrix:
• A confusion matrix is a matrix that summarizes the performance of a machine
learning model on a set of test data.
• It is a means of displaying the number of accurate and inaccurate instances based
on the model’s predictions.
• It is often used to measure the performance of classification models, which aim to
predict a categorical label for each input instance.
• The matrix displays the number of instances produced by the model on the test
data:
• True Positive (TP): The model correctly predicted a positive outcome (the actual
outcome was positive).
• True Negative (TN): The model correctly predicted a negative outcome (the actual
outcome was negative).
• False Positive (FP): The model incorrectly predicted a positive outcome (the
actual outcome was negative). Also known as a Type I error.
• False Negative (FN): The model incorrectly predicted a negative outcome (the
actual outcome was positive). Also known as a Type II error.
Example: Covid-19 Test Result
A COVID-19 test classifies individuals as Positive (infected) or Negative (not
infected).
Outcomes in the Confusion Matrix:
True Positive (TP): The test correctly identifies a person with COVID-19 as positive.
True Negative (TN): The test correctly identifies a healthy person as negative.
False Positive (FP): The test wrongly says a healthy person has COVID-19.
False Negative (FN): The test wrongly says a person with COVID-19 is healthy.
• TP: A person who has COVID-19 is detected correctly, ensuring proper treatment.
• TN: A healthy person is confirmed negative, avoiding unnecessary stress or isolation.
• FP: A healthy person is incorrectly labeled as infected, leading to unnecessary quarantine.
• FN: A person with COVID-19 is missed, which can spread the virus.
Metrics based on Confusion Matrix:
1. Accuracy: Accuracy is used to measure the performance of the model. It is the ratio of Total
correct instances to the total instances
2. Precision: It is measure of how accurate a model’s positive predictions are. It is defined as the
ratio of true positive predictions to the total number of positive predictions made by the model .
3. Recall: It measures the effectiveness of a classification model in identifying all relevant instances
from a dataset. It is the ratio of the number of true positive (TP) instances to the sum of true positive
and false negative (FN) instances.
4. F-score: It is used to evaluate the overall performance of a classification model. It is the
harmonic mean of precision and recall,