Unit 1
Unit 1
CS3ED06
Medi-Caps University
UNIT-I
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
1
Books to Refer
• Text Books
1. Data Science from Scratch: First Principles with Python 1st Edition by Joel Grus
2. Principles of Data Science by Sinan Ozdemir, (2016) PACKT publishing.
• Reference Books
1. Data Science For Dummies by Lillian Pierson (2015)
2. Data Science for Business: What You Need to Know about Data Mining and Data-
Analytic Thinking by Foster Provost, Tom Fawcett
• https://hackr.io/blog/data-science-books
2
Dr.Pramod S. Nair, Dean, Engineering & Professor, CSE, Medi-Caps University
3
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Science Jobs
• The average salary range for a data scientist will be approximately $95,000 to
$ 165,000 per annum.
• As per different researches, about 11.5 millions of job will be created by the
year 2026.
9
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Types of Data We Have
• Linear Algebra
• Probability Theory and Statistics
• Multivariate Calculus
• Algorithms and Complex Optimizations
• Real and Complex Analysis
• Information Theory
• Function Spaces and Manifolds.
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 24
Data Science
28
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Science Evolution
• Data Science helps sectors for better marketing & customer acquisition.
• It helps business processes in solving various challenges & assist in
prompt decision- making.
• It is helpful to derive hidden patterns & trends from the data.
• Businesses use for innovation & enriching customers experiences.
• Data Scientist
• They find stories, extract knowledge. They are not reporters
• Data scientists are the key to realizing the opportunities presented by big data.
• And advise executives on the implications for products, processes, and decisions
Big data has arisen to be defined as something like: that amount of data that
will not practically fit into a standard (relational) database for analysis and
44
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 45
Big Data Infrastructure
• The amount of data that can be collected by the companies are huge, and
they pertain to big data
• But utilisation of the data to extract valuable information, data science is
needed.
• The 3Vs of the big data guide dataset and is characterized by velocity,
variety, and volume
• But the data science provides techniques to analyze the data.
• Big data analysis caters to a large amount of data set which is also known
as data mining
• but data science makes use of the machine learning algorithms to design
and develop statistical models to generate knowledge from the pile of big
data.
▪ Data Strategy
▪ Data Engineering
▪ Data Analysis and Models
▪ Data Visualization and Operationalization
59
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Engineering
The use of technology and systems to access, organize, and utilize data is
known as data engineering
• Statistics
• The mathematical foundation of Data Science is statistics.
• Statistics is the discipline that concerns the collection, organization,
analysis, interpretation, and presentation of data.
• Without getting a clear knowledge of statistics and probability, there is
a high possibility of misinterpreting data and reaching to incorrect
conclusions.
62
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Science Frame Work
• Business understanding
• Data understanding
• Data preparation
• Modeling
• Evaluation
• Deployment
• Business understanding
• Define the business problem
• Set the criteria for success
• Convert business problem to DS problem
• Categorize the DS problem (Classification/Regression/Anomaly Detection etc)
• Prepare a high-level plan to achieve results
• Visualize the DS pipeline in context of objective (Evaluation Criteria / Algorithms /
Transformations)
• Data understanding
• Collect & integrate initial data
• Understand the attributes & its relationship
• Identify data quality issues
• Perform EDA (Exploratory Data Analysis)
• Data preparation
• Integrate data sources
• Handle missing values, outliers
• Apply Feature Engineering
• Modeling
• Apply multiple models
• Choose most optimal model
• Create a feedback pipeline
• Ensemble/Stack different models
• Evaluation
• Refine evaluation criteria
• Evaluate the models
• Handle Overfitting/Underfitting
• Deployment
• Prepare a detailed deployment plan
• Agree on post-deployment monitoring & support
• Monitor & support
• The bank collects a large amount of data for each approved loan: 146
fields. These fields can be split into a few distinct groups:
• Loan demographics, such as the amount, the term, the interest rate, and the reason for
loan.
• Applicant demographics, such as age, salary, employment length, and home
ownership.
• Numerous risk factors, such as the number of public records, credit card delinquency,
and bankruptcy. Roughly 70 sparsely populated risk fields cover these risk factors.
• The goal is to create a predictive model to identify loans that might be bad
loans.
• However, in the raw data, no field indicates whether the loan is good or
bad.
• A data scientist, sits down with the customer to understand the customer's
business problem.
• The first goal is to articulate the overall business problem:
“Can I detect attributes about a person or a type of loan that can be
used to flag a risky loan that needs to be processed by my loan
underwriting team?”