0% found this document useful (0 votes)
29 views81 pages

Unit 1 Data Analytics

The document outlines a course on Data Science taught by Ms. Aarushi Thusu at the Noida Institute of Engineering and Technology, covering various units such as Data Handling, Data Preprocessing, Exploratory Data Analysis, and Data Visualization. It details the course objectives, outcomes, evaluation scheme, and program outcomes, emphasizing the importance of data analytics in various industries. Additionally, it introduces key concepts like Big Data, the 5 V's, and the skills required for a career in data science.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views81 pages

Unit 1 Data Analytics

The document outlines a course on Data Science taught by Ms. Aarushi Thusu at the Noida Institute of Engineering and Technology, covering various units such as Data Handling, Data Preprocessing, Exploratory Data Analysis, and Data Visualization. It details the course objectives, outcomes, evaluation scheme, and program outcomes, emphasizing the importance of data analytics in various industries. Additionally, it introduces key concepts like Big Data, the 5 V's, and the skills required for a career in data science.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Noida Institute of Engineering and Technology, Greater Noida

Introduction to Data Science

Unit: 1

Data Analytics ACSAI0512


Ms. Aarushi Thusu
Assistant Professor
B.Tech 5th Semester
CSE-AIML

Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1


1
05/11/2025
Faculty Introduction

Name Aarushi Thusu


Qualification M. Tech. (Artificial Intelligence)
Designation Assistant Professor
Department AIML
NIET Experience 2 Years
Subject Taught Data Analytics, Introduction to Artificial Intelligence,
Social Media Analytics, Cyber Security

Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1


2
05/11/2025
Evaluation Scheme

Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1


3
05/11/2025
Syllabus

UNIT-I: Introduction to Data Science

Introduction to Data Science, Big Data, the 5 V’s, Evolution of


Data Science, Datafication, Skillsets needed, Data Science
Lifecycle, types of Data Analysis, Data Science Tools and
technologies, Need for Data Science, Analysis Vs Analytics Vs
Reporting, Big Data Ecosystem, Future of Data Science,
Applications of Data Science in various fields, Use cases of
Data science-Facebook, Netflix, Amazon, Uber, AirBnB.

05/11/2025 Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1 4


Syllabus

UNIT-II: Data Handling

Type of Data: structured, semi-structured,


unstructured data, Numeric, Categorical, Graphical, High
Dimensional Data, Transactional Data, Spatial Data, Social
Network Data, standard datasets, Data Classification,
Sources of Data, Data manipulation in various formats, for
example, CSV file, pdf file, XML file, HTML file, text file,
JSON, image files etc. import and export data in
R/Python.

05/11/2025 Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1 5


Syllabus

UNIT-III: Data Preprocessing

Form of Data Pre-processing, data Attribute and its types,


understanding and extracting useful variables,KDD process,
Data Cleaning: Missing Values, Noisy Data, Discretization
and Concept hierarchy generation (Binning, Clustering,
Histogram), Inconsistent Data, Data Integration and
Transformation. Data Reduction: Data Cube Aggregation,
Data Compression, Numerosity Reduction

05/11/2025 Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1 6


Syllabus

UNIT-IV: Exploratory Data Analysis

Handling Missing data, Removing Redundant variables, variable


Selection, identifying outliers, Removing Outliers, Time series
Analysis, Data transformation and dimensionality reduction
techniques such as Principal Component Analysis (PCA), Factor
Analysis (FA) and Linear Discriminant Analysis (LDA), Univariate
and Multivariate Exploratory Data Analysis. Data Munging, Data
Wrangling- APIs and other tools for scrapping data from the web/
internet using R/Python.

05/11/2025 Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1 7


Syllabus

UNIT-V: Data Visualization

Introductions and overview, Debug and troubleshoot installation and configuration


of the Tableau. Creating Your First visualization: Getting started with Tableau
Software, Using Data file formats, connecting your Data to Tableau, creating basic
charts (line, bar charts, Tree maps), Using the Show me panel. Tableau Calculations:
Overview of SUM, AVR, and Aggregate Features Creating custom calculations and
fields, Applying new data calculations to your visualization. Manipulating Data in
Tableau: Cleaning-up the data with the Data Interpreter, structuring your data,
Sorting, and filtering Tableau data, Pivoting Tableau data. Advanced Visualization
Tools: Using Filters, Using the Detail panel Using the Size panels, customizing filters,
Using and Customizing tooltips, Formatting your data with colours, Creating
Dashboards & Stories, Distributing & Publishing Your Visualization

05/11/2025 Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1 8


Branch Wise Applications

1.Security
2. Digital Advertising
3. E-Commerce
4. Publishing
5. Massively Multiplayer Online Games
6. Backend Services and Messaging
7. Project Management & Collaboration
8. Real time Monitoring Services
9.Live Charting and Graphing
10. Group and Private Chat

Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1


9
05/11/2025
Course Objective

The objective of this course is to understand the fundamental concepts of Data


analytics and learn about various types of data formats and their manipulations. It
helps students to learn exploratory data analysis and visualization techniques in
addition to R/Python/Tableau programming language.

Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1


10
05/11/2025
Course Outcomes

At the end of course, the student will be


able to:

Understand the fundamental concepts of data analytics in the areas that


plays major role within the realm of data science.
Explain and exemplify the most common forms of data and its
representations.

Understand and apply data pre-processing techniques.

Analyse data using exploratory data analysis.

Illustrate various visualization methods for different types of data sets


and application scenarios.

Ms. Aarushi Thusu Data Analytics ACSAI0512 Unit Number 1


11
05/11/2025
Program Outcomes

Engineering Graduates will be able


to:
PO1 : Engineering Knowledge

PO2 : Problem Analysis

PO3 : Design/Development of solutions

PO4 : Conduct Investigations of complex problems

PO5 : Modern tool usage

PO6 : The engineer and society

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 12


Program Outcomes

Engineering Graduates will be able


to:

PO7 : Environment and sustainability

PO8 : Ethics

PO9 : Individual and teamwork

PO10 : Communication

PO11 : Project management and finance

PO12 : Life-long learning

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 13


CO-POs Mapping

CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1 2 2 2 3 3 - - - - - - -

CO2 3 2 3 2 3 - - - - - - -

CO3 3 2 3 2 3 - - - - - - -

CO4 3 2 3 2 3 - - - - - - -

CO5 3 2 3 3 3 - - - - - - -

AVG 2.8 2.0 2.8 2.4 3.0 - - - - - - -

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 14


Program Specific Outcomes

Program Specific
S. No. PSO Description
Outcomes (PSO)

Design innovative intelligent systems for the


1 PSO1 welfare of the people using machine learning and its
applications.

Demonstrate ethical, professional and team -


oriented skills while providing innovative solutions in A
2 PSO2 rtificial Intelligence and Machine Learning for life-long
learning.

05/11/2025 15
CO-PSOs Mapping

CO.K PSO1 PSO2 PSO3 PSO4

CO1 3 - - -

CO2 3 2 - -

CO3 3 3 - -

CO4 3 3 - -

CO5 3 3 - -

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 16


Program Educational Objectives

Program Educational
PEOs Description
Objectives (PEOs)

 Pursue higher education and professional career to excel in the field of


Artificial Intelligence and Machine Learning.
PEOs

• Lead by example in innovative research and entrepreneurial zeal for


21st century skills.
PEOs

 Proactively provide innovations solutions for societal


PEOs problems to promote life-long learning.

05/11/2025 Aarushi Thusu ACSAI0622 Social Media Analytics Unit 5 17


Pattern of External Exam Question Paper

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 18


Pattern of External Exam Question Paper

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 19


Pattern of External Exam Question Paper

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 20


Pattern of External Exam Question Paper

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 21


Brief Introduction about the Subject and Videos

Data analytics (DA) is the area of examining data sets in order to find trends and draw
conclusions about the information they contain. Increasingly, data analytics is done with
the aid of specialized systems and software.

YouTube/other Video Links


https://www.youtube.com/watch?v=KxryzSO1Fjs

Aarushi Thusu Social Media Analytics Unit 2


05/11/2025 22
Unit Content

• What is Data Science


• Big Data, the 5 V’s
• Evolution of Data Science
• Datafication
• Skill sets needed
• Data Science Lifecycle
• Types of Data Analysis
• Data Science Tools and technologies
• Need for Data Science
• Analysis Vs Analytics Vs Reporting
• Big Data Ecosystem
• Future of Data Science
• Applications of Data Science in various fields
• Crowd sourcing analytics
• Data Security Issues
• Use cases of Data science-Facebook ,Netflix, Amazon, Uber, Airbnb.

05/11/2025 Aarushi Thusu Social Media Analytics Unit 2 23


Unit Objective

⮚ Understand the significance Data Science in Industry.


⮚ Understanding basic concept of Data Science implementation and its
techniques.
⮚ Describe a formal definition to its frameworks.
⮚ Understanding the concept of Data Science in industry.
⮚ Understand the challenges faced by Data Science implementation process.
⮚ Describe the standards & requirements of implementing Data Science in
cloud environment.

15/06/2022
What is Data Science?
What is Data Science?

15/06/2022 25
What is Data Science?

What is Data Science?

15/06/2022 26
What is Data Science?

Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine
learning to analyze data and to extract knowledge and insights from it. .

By using Data Science, companies are able to make:


• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the data)

15/06/2022 27
What is Data Science?

Where is Data Science Needed?

Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing.

Examples where Data Science is needed:

For route planning: To discover the best routes to ship


• To foresee delays for flight/ship/train etc. (through predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections

15/06/2022 28
What is Data Science?

How Does a Data Scientist Work?


A Data Scientist requires expertise in several backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the data in a standard format.

15/06/2022 29
What is Data Science?

How Does a Data Scientist Work?

Here is how a Data Scientist works:


• Ask the right questions - To understand the business problem.
• Explore and collect data - From database, web logs, customer feedback, etc.
• Extract the data - Transform the data to a standardized format.
• Clean the data - Remove erroneous values from the data.
• Find and replace missing values - Check for missing values and replace them with a suitable value (e.g. an average value).
• Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is important).
• Analyze data, find patterns and make future predictions.
• Represent the result - Present the result with useful insights in a way the "company" can understand.

15/06/2022 30
Big Data

This is a term related to extracting meaningful data by analyzing the huge amount of complex, variously formatted data generated at high speed, that
cannot be handled, or processed by the traditional system.
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data
in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that almost 90% of today's data has been generated in the past 3 years.

15/06/2022 31
Big Data

Sources

Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a day to day basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through its daily transaction.

15/06/2022 32
Big Data

5 Vs of Big Data:

The 5 V's of big data -- velocity, volume, value, variety and veracity -- are the five main and innate characteristics of big data. Knowing the 5 V's lets data scientists derive more value from their data while
also allowing their organizations to become more customer-centric.
Earlier this century, big data was talked about in terms of the three V's -- volume, velocity and variety. Over time, two more V's -- value and veracity -- were added to help data scientists more effectively
articulate and communicate the important characteristics of big data. In some cases, there's even a sixth V term for big data -- variability.

15/06/2022 33
Types of Digital Data

Process of classifying data in relevant categories so that it can be used or applied more
efficiently. The classification of data makes it easy for the user to retrieve it. Data
classification holds its importance when comes to data security and compliance and
also to meet different types of business or personal objective. It is also of major
requirement, as data must be easily retrievable within a specific period of time.
We can further divide Data into three categories:
1. Structured Data
2. Unstructured Data
3. Semi Structured Data

15/06/2022 34
Types of Digital Data

Structured Data :

Structured data is created using a fixed schema and is maintained in tabular format. The elements
in structured data are addressable for effective analysis. It contains all the data which can be
stored in the SQL database in a tabular format. Today, most of the data is developed and
processed in the simplest way to manage information.

Examples –
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of students for a
university like the name of the student, ID of a student, address, and Email of the student. To
store the record of students used the following relational schema and table for the same.

ID NAME ADDRESS EMAIL


101 A Delhi [email protected]
102 B Mumbai [email protected]

15/06/2022 35
Types of Digital Data

Un Structured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say
that any does not follow any organized format. This kind of data is also not fit for the
relational database because in the relational database you will see a pre-defined manner
or you can say organized way of data. Unstructured data is also very important for the
big data domain and To manage and store Unstructured data there are many platforms
to handle it like No-SQL Database.

Examples –
Word, PDF, text, media logs, etc.

15/06/2022 36
Types of Digital Data

Semi - Structured Data :

Semi-structured data is information that does not reside in a relational database but that
have some organizational properties that make it easier to analyze. With some process,
you can store them in a relational database but is very hard for some kind of semi-
structured data, but semi-structured exist to ease space.

Example –
XML data

15/06/2022 37
Evolution of Data Science

The field of data science has greatly transformed in recent times. Initially, it relied
mostly on statistical analysis and data mining to draw conclusions based on structured
data. However, as big data technologies have expanded and increased the amount of
unstructured information from sources such as social networks and sensors, this has
made Data Science, which is related to massive data sets, even more demanding.
Modern technology and programming both in machine learning and artificial
intelligence, have empowered Data Scientists. They can now readily develop complex
models for such tasks as predictive analytics, NLP and computer vision. This has led to
a hybrid of jobs such as within the health field, finance industry, but also retail and
manufacturing industries. They use it for things like personalized medicine, spotting
fraud, suggesting products, and predicting when machines need fixing.

15/06/2022 38
Evolution of Data Science

The evolution of Data Science over the years has taken form in many phases. It all
started with Statistics. Simple statistics models were employed to collect, analyze and
manage data since the early 1800s. These principles underwent various modulations
over time until the rise of the digital age. Once computers were introduced as
mainstream public devices, there was a shift in the industry to the Digital Age. A flood
of data and digital information was created. This resulted in the statistical practices and
models getting computerized giving rise to digital analytics. Then came the rise of the
internet that exponentially grew the data available giving rise to what we know as Big
Data. This explosion of information available to the masses gave rise to the need for
expertise to process, manage, analyze and visualize this data for the purpose of decision
making through the use of various models. This gave birth to the term Data Science.

15/06/2022 39
Datafication

Datafication, according to MayerSchoenberger and Cukier is the transformation of


social action into online quantified data, thus allowing for real-time tracking and
predictive analysis. Simply said, it is about taking previously invisible process/activity
and turning it into data, that can be monitored, tracked, analyzed and optimized.
Latest technologies we use have enabled lots of new ways of ‘datify’ our daily and
basic activities.
Summarizing, datafication is a technological trend turning many aspects of our lives
into computerized data using processes to transform organizations into data-driven
enterprises by converting this information into new forms of value.
Datafication refers to the fact that daily interactions of living things can be rendered
into a data format and put to social use.

15/06/2022 40
Datafication

• Let’s say social platforms, Facebook or Instagram, for example, collect and monitor data
information of our friendships to market products and services to us and surveillance
services to agencies which in turn changes our behavior; promotions that we daily see
on the socials are also the result of the monitored data. In this model, data is used to
redefine how content is created by datafication being used to inform content rather
than recommendation systems.

• However, there are other industries where datafication process is actively used:
• Insurance: Data used to update risk profile development and business models.
• Banking: Data used to establish trustworthiness and likelihood of a person paying back a
loan.
• Human resources: Data used to identify e.g. employees risk-taking profiles.
• Hiring and recruitment: Data used to replace personality tests.
• Social science research: Datafication replaces sampling techniques and restructures the
manner in which social science research is performed.

15/06/2022 41
Skill Set Needed

15/06/2022 42
Data Science Life Cycle

15/06/2022 43
Data Science Life Cycle

Phase 1—Discovery: Before you begin the project, it is important to understand the various
specifications, requirements, priorities and required budget. You must possess the ability to ask the right
questions. Here, you assess if you have the required resources present in terms of people, technology,
time and data to support the project. In this phase, you also need to frame the business problem and
formulate initial hypotheses (IH) to test.

Phase 2—Data preparation: In this phase, you require analytical sandbox in which you can perform
analytics for the entire duration of the project. You need to explore, preprocess and condition data prior
to modeling. Further, you will perform ETLT (extract, transform, load and transform) to get data into the
sandbox. Let’s have a look at the Statistical Analysis flow below.

You can use R for data cleaning, transformation, and visualization. This will help you to spot the outliers
and establish a relationship between the variables. Once you have cleaned and prepared the data, it’s
time to do exploratory analytics on it. Let’s see how you can achieve that.

Phase 3—Model planning: Here, you will determine the methods and techniques to draw the
relationships between variables. These relationships will set the base for the algorithms which you will
implement in the next phase. You will apply Exploratory Data Analytics (EDA) using various statistical
formulas and visualization tools.
15/06/2022 44
Data Science Life Cycle

Phase 4—Model building: In this phase, you will develop datasets for training and testing purposes. You
will consider whether your existing tools will suffice for running the models or it will need a more robust
environment (like fast and parallel processing). You will analyze various learning techniques like
classification, association and clustering to build the model.

Phase 5—Operationalize: In this phase, you deliver final reports, briefings, code and technical
documents. In addition, sometimes a pilot project is also implemented in a real-time production
environment. This will provide you a clear picture of the performance and other related constraints on a
small scale before full deployment.

Phase 6—Communicate results: Now it is important to evaluate if you have been able to achieve your
goal that you had planned in the first phase. So, in the last phase, you identify all the key findings,
communicate to the stakeholders and determine if the results of the project are a success, or a failure
based on the criteria developed in Phase .

15/06/2022 45
Data Science Tools and Technologies

15/06/2022 46
Data Science Tools and Technologies

15/06/2022 47
Data Science Tools and Technologies

Following is the data science technology stack that is in demand and can come as necessary for a successful career
in data science and technology.
1. Amazon Web Services (AWS)
Amazon Web Services (AWS) is a cloud provider. It is a cloud-based service that allows users to access virtual
servers. The technology is offered through an Amazon Elastic Compute Cloud or EC2 instance. Instances are
powered by Apache Spark on Amazon Linux and provide access to other services that may be used for data
processing.
2. Amazon Machine Learning (AML)
Amazon Machine Learning is a specialized ML service inside AWS that can be used to develop ML models with
predictive capabilities.
3. Text Mining
Nearly 80% of all data in the world is unstructured, making Text Mining a crucial analysis and processing method.
It is the practice of extracting useful information and finding patterns from large amounts of textual material by
organizing previously unrecognized relationships and trends.

15/06/2022 48
Data Science Tools and Technologies

Following is the data science tools used:


1. Pandas: pandas makes data cleaning, manipulation, analysis, and feature engineering seamless in Python. It is
the most used library by data professionals for all kinds of tasks. You can now use it for data visualization, too.

2. Scikit-learn: Scikit-learn is the go-to Python library for machine learning. This library provides a consistent
interface to common algorithms, including regression, classification, clustering, and dimensionality reduction.
It's optimized for performance and widely used by data scientists.

3. Seaborn: Seaborn is a powerful data visualization library that is built on top of Matplotlib. It comes with a
range of beautiful and well-designed default themes and is particularly useful when working with pandas Data-
Frames. With Seaborn, you can create clear and expressive visualizations quickly and easily.

4. Jupyter Notebooks: Jupyter Notebooks is a popular open-source web application that allows data scientists to
create shareable documents combining live code, visualizations, equations, and text explanations. Great for
exploratory analysis, collaboration, and reporting.

15/06/2022 49
Types of Data Analysis

1. Descriptive Analysis
Goal — Describe or Summarize a set of Data
Description:
The very first analysis performed
Generates simple summaries about samples and measurements
common descriptive statistics (measures of central tendency, variability, frequency, position, etc)
Example:
Take the COVID-19 statistics page on google for example, the line graph is just a pure summary of the
cases/deaths, a presentation and description of the population of a particular country infected by the virus
Summary:
Descriptive Analysis is the first step in analysis where you summarize and describe the data you have using
descriptive statistics, and its result is a simple presentation of your data.

15/06/2022 50
Types of Data Analysis

2. Exploratory Analysis
Goal — Examine or explore data and find relationships between variables which were previously unknown
Description:
EDA helps you discover relationships between measures in your data, which are not evidence for the existence of
the correlation, as denoted by the phrase (Correlation doesn’t imply causation)
useful for discovering new connections — forming hypothesis and drives design planning and data collection
Example:
Climate change is an increasingly important topic as the global temperature is gradually rising over the years. One
example of EDA on climate change is by taking the rise in temperature over the years, say 1950 to 2020 for
example, and the increase of human activities and industrialization, and form relationships from the data, e.g.
increasing number of factories, cars on the road and airplane flights increase correlates.
Summary:
EDA explores data to find relationships between measures that tells us they exist, without the cause. They can be
used to formulate hypotheses.

15/06/2022 51
Types of Data Analysis

3. Inferential Analysis
Goal— Using a small sample of data to infer about a larger population,
The goal of statistical modeling itself, is all about using a small amount of information to extrapolate and generalize
information to a larger group.
Description:
Using estimated data that value in population and give a measure of uncertainty (standard deviation) in your
estimation
Accuracy of inference depends heavily on sampling scheme; if the sample isn’t representative of the population, the
generalization will be inaccurate (ref Central Limit Theorem).
Example:
The idea of inferring about the population at large with a smaller sample is quite intuitive, many statistics you see
on the media and the internet are inferential, a prediction of an event based on a small sample. To give an example,
a psychology study for the benefits of sleep, a total of 500 people involved in the study, when followed up with the
candidates, they reported to have better overall attention and well-being with 7–9 hours of sleep, while those with
less sleep and more sleep suffered with reduced attention and energy. This report from 500 people was just a tiny
portion of 7b people in the world, thus an inference of the larger population.
Summary:
IA extrapolates and generalizes the information of the larger group with a smaller sample to generate analysis and
predictions.
15/06/2022 52
Types of Data Analysis

4. Predictive Analysis
Goal — Using historical or current data to find patterns to make predictions about the future
Description:
Accuracy of the predictions depends on the input variables
Accuracy also depends on the types of models, a linear model might work well in some cases, and vice-versa
Using a variable to predict another doesn’t denote a causal relationships
Example:
The 2020 US election is a popular topic and many prediction models are built to predict the winning candidate
FiveThirtyEight did a great 2016 Election forecast and is back at it again in 2020. Prediction analysis for an election
would require input variables such as historical polling data, trends and the current polling data in order to get a
good prediction. Something as large as an election wouldn’t just be using a linear model, but a complex model
with certain tunings to best serve it’s purpose.
Summary:
PA takes data from the past and present to make predictions about the future.

15/06/2022 53
Types of Data Analysis

5. Causal Analysis
Goal — Looks at the cause and effect of relationships between variables, focused on finding the cause of a
correlation.
Description:
To find the cause, you have to question whether the observed correlations driving your conclusion are valid, as just
looking at the data (surface) won’t help you discover the hidden mechanisms underlying the correlations Applied
in randomized studies focused on identifying causation
the gold standard in data analysis, scientific studies where cause of phenomenon is to be extracted and singled
out, like separating wheat from chaff
Challenges:
Good data is hard to find and requires expensive research and studies. These studies are analyzed in aggregate
(multiple groups), and the observed relationships are just average effects (mean) of the whole population
(meaning the results might not apply to everyone)
Example: Say you want test out this new drug that improves human strength and focus, and to do that you
perform randomized control trials for the drug to test the effect of the drug. You compare the sample of candidates
for your new drug vs the candidates receiving mock control with a few test for on strength and overall focus and
attention and observe how the drug affects the outcome
Summary: CA is about finding out the causal relationship between variables, change one variable and what
happens to another.
15/06/2022 54
Types of Data Analysis

6. Mechanistic Analysis
Goal — Understand exact changes in variables that lead to other changes in other variables
Description:
Applied in physical or engineering sciences, situations that require high precision and little room for error(only
noise in data is measurement error)
Designed to understand a biological or behavioral process, the pathophysiology of a disease, or the mechanism of
action of an intervention. (by NIH)
Example:
Many graduate-level research and complex topics are suitable examples, but to put it in a simple manner, let’s say
an experiment is done to simulate safe and effective nuclear fusion to power the world, a mechanistic analysis of
the study would entail precise balance of controlling and manipulating variables with highly accurate measures of
both variables and the desired outcomes. It’s this intricate and meticulous modus operandi (strategy) towards
these big topics that allows for scientific breakthroughs and advancement of society.
Summary:
MA is in some ways a predictive analysis, but modified to tackle studies that require high precision and meticulous
methodologies for physical or engineering science.

15/06/2022 55
Need for Data Science

15/06/2022 56
Need for Data Science

• With the help of data science technology, we can convert the massive amount of raw and unstructured data into
meaningful insights.
• Data science technology is opting by various companies, whether it is a big brand or a startup. Google, Amazon,
Netflix, etc, which handle the huge amount of data, are using data science algorithms for better customer
experience.
• Data science is working for automating transportation such as creating a self-driving car, which is the future of
transportation.
• Data science can help in different predictions such as various survey, elections, flight ticket confirmation, etc.

15/06/2022 57
Analysis Vs Analytics Vs Reporting

•Analytics and reporting can help a business improve operational efficiency and production in several
ways. Analytics is the process of making decisions based on the data presented, while reporting is used
to make complicated information easier to understand. Let’s discuss analytics vs reporting.
•Analytics and reporting are often referred to as the same. Although both take in data as input and
present it in charts, graphs, or dashboards, they have several key differences.

15/06/2022 58
Analysis Vs Analytics Vs Reporting

What is analytics vs reporting?

Analytics is the technique of examining data and reports to obtain actionable insights that can be used to
comprehend and improve business performance. Business users may gain insights from data, recognize trends, and
make better decisions with workforce analytics.

On the one hand, analytics is about finding value or making new data to help you decide. This can be performed
either manually or mechanically. Next-generation analytics uses new technologies like AI or machine learning to
make predictions about the future based on past and present data.
The steps involved in data analytics are as follows:
• Developing a data hypothesis
• Data collection and transformation
• Creating analytical research models to analyze and provide insights
• Utilization of data visuaization, trend analysis, deep dives, and other tools.
• Making decisions based on data and insights

15/06/2022 59
Analysis Vs Analytics Vs Reporting

On the other hand, reporting is the process of presenting data from numerous sources clearly and
simply. The procedure is always carefully set out to report correct data and avoid misunderstandings.
Today’s reporting applications offer cutting-edge dashboards with advanced data visualization features.
Companies produce a variety of reports, such as financial reports, accounting reports, operational
reports, market studies, and more. This makes it easier to see how each function is operating quickly.

In general, the procedures needed to create a report are as follows:


Determining the business requirement
Obtaining and compiling essential data
Technical data translation
Recognizing the data context
Building dashboards for reporting
Providing real-time reporting
Allowing users to dive down into reports

15/06/2022 60
Analysis Vs Analytics Vs Reporting

15/06/2022 61
Big Data Eco-System

Big data ecosystems refers to the massive volumes of both structured and unstructured data, whose size
or type is beyond the ability of a traditional relational database. These are used to capture, manage and
process data with low latency.
Nearly every successful business relies of quick, agile decisions to stay competitive. This is only
achieved when the enterprise properly deals with their data storage, processing, and visualization.
Cerebra Consulting has the experience and know-how to make sure your business does just that!

15/06/2022 62
Big Data Eco-System

•Big data ecosystem is the comprehension of massive functional components with various enabling
tools. Capabilities of the big data ecosystem are not only about computing and storing big data, but also
the advantages of its systematic platform and potentials of big data analytics. Hence, according to
proposed solutions of reviewed literature and big data capabilities, the maturity of big data ecosystem
application is categorized into three stages:
•Stage 1: Proposing a big data framework and platform;
•Stage 2: Harvesting cloud computing capacity for big data computing and storage;
•Stage 3: Analysing big data with various algorithms for the applications (prediction, fault detection,
optimization etc.).

15/06/2022 63
Big Data Importance and Applications

Main Technology Components Of Big Data:

1. Data Management​
2. Data Mining​
3. Hadoop​
4. In-Memory Analytics​
5. Predictive Analytics​
6. Text Mining​
Why is big data concepts analytics important?
2. Reduced cost
2. Quick decision making
3. New products and features

15/06/2022 64
Applications of Big Data

15/06/2022 65
Future of Data Science

Data Science has evolved all the way long from statistics– with simple statistical models, the
organizations collected, managed, and analyzed the data from the 19th century. Later, once computers
emerged in the scenario, the digital era began generating massive amounts of data. The internet has
made a breakthrough with the explosion of data, and the need to manage Big Data has led to the growth
of Data Science.

Data Scientist skills help organizations to make informed business decisions through effective data
management. Data science technologies trigger personalized healthcare systems, targeted advertising,
risk and fraud detection, airline route management, financial applications, and many other processes of
various industries.
The future of data science is uncertain; however, it would definitely bring further innovation in business
processes with the technological revolution. This article, let us know the top 10 predictions of data
science.

15/06/2022 66
Future of Data Science

Predictions about the future of Data Science:

1. The tasks of Data Scientists hired to augment business processes could be automated soon.
2. Data Science will incorporate concepts from various fields like sociology and psychology– it will soon become
interdisciplinary.
3. Social Media and other online platforms will become the source for the collection of more data.
4. Data Science will help businesses predict the consumer behavior.
5. Data Science is moving into an era of becoming a team activity. It speaks not about creating a model, but what
would you use it for once you build it.
6. Data Science will grow more conscious of the increased cybersecurity threats.
7. Data Scientists will face a growing Cloud Computing prevalence.
8. Coding and AI skills will become more essential, and data scientists need to be more business-minded.
9. Data Scientist’s jobs become more operationalized with advanced tools to capture their workflows and train
enterprise on their best practices.
10. Data Scientists will get the opportunity to initiate a “quantum leap”.

15/06/2022 67
Future of Data Science

Data science makes the way forward powerful with many emerging trends that help organizations to
thrive. However, these changes would lead the organizations to look for candidates with advanced Data
scientist skills. To make the most of this demand and win opportunities, Data science
certifications can be a great pick. With data science certifications from an expert program provider, you
can build all the necessary skills to make a revolution in data science.

15/06/2022 68
Applications of Data Science in Various Fields

Healthcare: Data science can identify and predict disease and personalize healthcare recommendations.
Transportation: Data science can optimize shipping routes in real-time.
Sports: Data science can accurately evaluate athletes’ performance.
Government: Data science can prevent tax evasion and predict incarceration rates.
E-commerce: Data science can automate digital ad placement.
Gaming: Data science can improve online gaming experiences.
Social media: Data science can create algorithms to pinpoint compatible partners.
Fintech: Data science can help create credit reports and financial profiles, run accelerated underwriting and create predictive models based on historical payroll data.

15/06/2022 69
Applications of Data Science

15/06/2022 70
Use-Cases of Data Science

Netflix

• Netflix initially started as a DVD rental service in 1998. It mostly relied on a third party postal services to deliver its DVDs to the users. This resulted in heavy losses which they soon mitigated with the introduction of their online streaming service in 2007.
• In order to make this happen, Netflix invested in a lot of algorithms to provide a flawless movie experience to its users. One of such algorithms is the recommendation system that is used by Netflix to provide suggestions to the users.
• A recommendation system understands the needs of the users and provides suggestions of the various cinematographic products.
• A recommendation system is a platform that provides its users with various contents based on their preferences and likings. A recommendation system takes the information about the user as an input.
• This information can be in the form of the past usage of product or the ratings that were provided to the product. It then processes this information to predict how much the user would rate or prefer the product. A recommendation system makes use of a variety of machine learning algorithms.

15/06/2022 71
Use-Cases of Data Science

Another important role that a recommendation system plays today is to search for similarity between different products. In the case of Netflix, the recommendation system searches for movies that are similar to the ones you have watched or have liked previously.
• This is an important method for scenarios that involve cold start. In cold start, the company does not have much of the user data available to generate recommendations.
• Therefore, based on the movies that are watched, Netflix provides recommendations of the films that share a degree of similarity. There are two main types of Recommendation Systems –
• 1. Content-based recommendation systems
• In a content-based recommendation system, the background knowledge of the products and customer information are taken into consideration. Based on the content that you have viewed on Netflix, it provides you with similar suggestions.
• For example, if you have watched a film that has a sci-fi genre, the content-based recommendation system will provide you with suggestions for similar films that have the same genre.

15/06/2022 72
Use-Cases of Data Science

Uber – Using Data to Make Rides Better

Next in data science use cases is Uber. Uber is a popular smartphone application that allows you to book a cab. Uber makes extensive use of Big Data. After all, Uber has to maintain a large database of drivers, customers, and several other records.
It is therefore, rooted in Big Data and makes use of it to derive insights and provide the best services to its users. Uber shares the big data principle with crowdsourcing. That is, registered drivers in the area can help anyone who wants to go somewhere.
As mentioned above, Uber contains a database of drivers. Therefore, whenever you hail for a cab, Uber matches your profile with the most suitable driver. What differentiates Uber from other cab companies is that Uber charges you based on the time it takes to cover the distance and not the distance itself.
It calculates the time taken through various algorithms that also make use of data related to traffic density and weather conditions.
Uber makes the best use of data science to calculate its surge pricing. When there are less drivers available to more riders, the price of the ride goes up. This happens only during the scarcity of drivers in any given area.
However, if the demand for Uber rides is less, then Uber charges a lower rate. This dynamic pricing is rooted in Big Data and makes excellent usage of data science to calculate the fares based on the parameters.

15/06/2022 73
Use-Cases of Data Science

Facebook – Using Data to Revolutionize Social Networking & Advertising

• Facebook is a social-media leader of the world today. With millions of users around the world, Facebook utilizes a large scale quantitative research through data science to gain insights about the social interactions of the people.
• Facebook has become a hub of innovation where it has been using advanced techniques in data science to study user behavior and gain insights to improve their product. Facebook makes use of advanced technology in data science called deep learning.
• Using deep learning, Facebook makes use of facial recognition and text analysis. In facial recognition, Facebook uses powerful neural networks to classify faces in the photographs. It uses its own text understanding engine called “DeepText” to understand user sentences.
• It also uses Deep Text to understand people’s interest and aligning photographs with texts.
• However, more than being a social media platform, Facebook is more of an advertisement corporation. It uses deep learning for targeted advertising. Using this, it decides what kind of advertisements the users should view.
• It uses the insights gained from the data to cluster users based on their preferences and provides them with the advertisements that appeal to them.

15/06/2022 74
Use-Cases of Data Science

• Since its inception, Amazon has been working hard to make itself a customer-centric platform. Amazon heavily relies on predictive analytics to increase customer satisfaction. It does so through a personalized recommendation system.
• This recommendation system is a hybrid type that also involves collaborative filtering which is comprehensive in nature. Amazon analyzes the historical purchases of the user to recommend more products.

• This also comes through the suggestions that are drawn from other users who use similar products or provide similar ratings.
• Amazon has an anticipatory shipping model that uses big data for predicting the products that are most likely to be purchased by its users. It analyzes the pattern of your purchases and sends products to your nearest warehouse which you may utilize in the future.
• Amazon also optimizes the prices on its websites by keeping in mind various parameters like the user activity, order history, prices offered by the competitors, product availability, etc. Using this method, Amazon provides discounts on popular items and earns profits on less popular items.
• Another area where every e-commerce platform is addressing is Fraud Detection. Amazon has its own novel ways and algorithms to detect fraud sellers and fraudulent purchases.
• Other than online platforms, Amazon has been optimizing the packaging of products in warehouses and increasing the efficiency of packaging lines through the data collected from the workers.

15/06/2022 75
Use-Cases of Data Science

15/06/2022 76
Weekly/Monthly Assignment

1. Explain the differences between supervised and unsupervised learning?


2. Why is Python used for Data Cleaning in DS?
3. Why is R used in Data Visualization?
4. What do you understand by linear regression?
5. What do you understand by logistic regression?
6. How is Data Science different from traditional application programming?
7. How are Data Science and Machine Learning related to each other?
8. Explain the difference between Data Science and Data Analytics?
9. Define the term deep learning?
10. List out the libraries in Python used for Data Analysis and Scientific Computations.

15/06/2022 77
Weekly/Monthly Assignment

1. Explain about Big Data-Characteristics and applications.


2.Explain The building blocks of Hadoop?
3.Explain Why is Big Data Important?
4.What is data Analysis? Why is python used for data analysis?
5.What are the applications of machine Learning in data science?
6.What are the problems face when handling large data?
7.What do you understand with crowdsourcing analytics?
8. What do u mean by 5v’s of Big Data?
9. What are security challenges of Data Science?
10. How data science can be used in medical industry? Explain briefly?

15/06/2022 78
Glossary Questions

1. Raw Data is original _______ of data.


2.Creating reproducible _______ is performed by Data Scientist.
3. Hadoop is a framework that works with a variety of related tools. Common cohorts include
____________.
4. __________characteristic of big data is relatively more concerned to data science.
5. ____________analytical capabilities are provided by information management company?
6. __________step is performed by data scientist after acquiring the data.
7. _______ are not sufficient to describe big data.
8. _____focuses on the discovery of (previously) unknown properties on the data.
9. As companies move past the experimental phase with Hadoop, many cite the need for additional
capabilities, including _______________.
10. Data in ____ bytes size is called big data. ...

15/06/2022 79
References

15/06/2022 80
Expected Questions for End Semester Exam

1. What is data science and its benefits?


2. Explain role and stages in data science?
3. What are the goals of data science?
4. What is data Analysis? Why is python used for data analysis?
5. Explain supervised and unsupervised machine Learning?
6. Why we need the machine Learning in data science?
7.Explain general techniques for handling volumes of data?
8.What are the problems face when handling large data?
9.Explain different stages of data Science?
10.Difference between data Analysis and analytics?

15/06/2022 81

You might also like