0% found this document useful (0 votes)
6 views76 pages

Unit 1

Uploaded by

niharikagg2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views76 pages

Unit 1

Uploaded by

niharikagg2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Data Science

CS3ED06
Medi-Caps University
UNIT-I
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
1
Books to Refer
• Text Books
1. Data Science from Scratch: First Principles with Python 1st Edition by Joel Grus
2. Principles of Data Science by Sinan Ozdemir, (2016) PACKT publishing.

• Reference Books
1. Data Science For Dummies by Lillian Pierson (2015)
2. Data Science for Business: What You Need to Know about Data Mining and Data-
Analytic Thinking by Foster Provost, Tom Fawcett

• https://hackr.io/blog/data-science-books
2
Dr.Pramod S. Nair, Dean, Engineering & Professor, CSE, Medi-Caps University
3
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Science Jobs

• The average salary range for a data scientist will be approximately $95,000 to
$ 165,000 per annum.
• As per different researches, about 11.5 millions of job will be created by the
year 2026.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 4


Types of Data Science Jobs
❖Data Scientist
❖Data Analyst
❖Machine Learning Expert/Engineer
❖Data Engineer
❖Data Architect
❖Data Administrator
❖Business Analyst
❖Business Intelligence Manager
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 5
Skills You should have
• Programming Skills
• Statistics
• Machine Learning
• Multivariable Calculus & Linear Algebra
• Data Wrangling
• Data Visualization & Communication
• Data Intuition
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 6
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 7
Some Important Domains of Data Scientists
• National Security • Sports
• Finance • And many more….
• Manufacturing
• Business
• Engineering
• Healthcare
• Education 8
Dr. Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Why Data Science: Data All Around
• Lots of data is being collected and warehoused
• Web data, e-commerce data
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network

9
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Types of Data We Have

• Relational Data (Tables/Transaction/Legacy Data)


• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 10


Types of Data
Types of Data
Qualitative Data
• Qualitative or Categorical Data is data that can’t be measured or counted in the form of
numbers.
• These types of data are sorted by category, not by number
• The gender of a person, i.e., male, female, or others
• Some qualitative data examples
What language do you speak
Favorite holiday destination
Opinion on something (agree, disagree, or neutral)
Colors
Types of Qualitative Data
• Nominal Data
• Nominal Data is used to label variables without any order or quantitative value
• Examples of Nominal Data :
• Colour of hair (Blonde, red, Brown, Black, etc.)
• Marital status (Single, Widowed, Married)
• Nationality (Indian, German, American)
• Gender (Male, Female, Others)
• Eye Color (Black, Brown, etc.)
Types of Qualitative Data
• Ordinal Data
• Ordinal data have natural ordering where a number is present in some kind of order by their
position on the scale
• Examples of Ordinal Data :
• When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10
• Letter grades in the exam (A, B, C, D, etc.)
• Ranking of people in a competition (First, Second, Third, etc.)
• Economic Status (High, Medium, and Low)
• Education Level (Higher, Secondary, Primary)
Quantitative Data
• Quantitative data can be expressed in numerical values.
• Quantitative data can be used for statistical manipulation
• Examples of Quantitative Data :
• Height or weight of a person or object
• Room Temperature
• Scores and Marks (Ex: 59, 80, 60, etc.)
• Time
Types of Quantitative Data
• Discrete Data
• The term discrete means distinct or separate. The discrete data contain the values that fall
under integers or whole numbers.
• The discrete data are countable and have finite values
• Examples of Discrete Data :
• Total numbers of students present in a class
• Cost of a cell phone
• Numbers of employees in a company
• The total number of players who participated in a competition
• Days in a week
Types of Quantitative Data
• Continuous Data
• Continuous data are in the form of fractional numbers
• The key difference between discrete and continuous data is that discrete data
contains the integer or whole number.
• Examples of Continuous Data :
• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
Structured Data
• Structured data is data whose elements are addressable for effective analysis. It has been organized
into a formatted repository that is typically a database.
• It concerns all data which can be stored in database in a table with rows and columns.
• They have relational keys and can easily be mapped into pre-designed fields.
• Today, those data are most processed in the development and simplest way to manage
information.
• Example: Relational data.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 18


Semi-Structured Data
• Semi-structured data is information that does not reside in a relational
database but that have some organizational properties that make it easier to
analyse.
• With some process, you can store them in the relation database (it could be
very hard for some kind of semi-structured data), but Semi-structured exist
to ease space.

• Example: XML data, Email etc.


Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 19
Unstructured Data

• Unstructured data is a data which is not organized in a predefined manner


or does not have a predefined data model, thus it is not a good fit for a
mainstream relational database.
• So, for Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications.
• Example: Word, PDF, Text, Media logs.
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 20
Data Science

• Data Science is a deep study of


The massive amount of data, which involves extracting meaningful
insights from raw, structured, and unstructured data that is processed using
the scientific method, different technologies, and algorithms.

• It is a multidisciplinary field that uses tools and techniques to manipulate


the data
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 21
What is Data Science?

• An area that manages, manipulates, extracts, and interprets knowledge from


tremendous amount of data
• Data Science (DS) is a multidisciplinary field of study with goal to address the
challenges in big data
• Data Science principles apply to all data – big and small

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 22


What is Data Science?
• Theories and techniques from many fields and disciplines are used to investigate and analyze a
large amount of data to help decision makers in many industries such as science, engineering,
economics, politics, finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High Performance Computing,
Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic Modeling, Probability.
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 23
Maths and Statistics Involved in ML

• Linear Algebra
• Probability Theory and Statistics
• Multivariate Calculus
• Algorithms and Complex Optimizations
• Real and Complex Analysis
• Information Theory
• Function Spaces and Manifolds.
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 24
Data Science

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 25


Data Science

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 26


In Nutshell

The Data Science is all about……..

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 27


Need of Data Science

28
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Science Evolution

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 29


Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 30
Business Intelligence Vs Business Analytics
Data
Science Vs
Machine
Learning
Importance of Data Science

• Data Science helps sectors for better marketing & customer acquisition.
• It helps business processes in solving various challenges & assist in
prompt decision- making.
• It is helpful to derive hidden patterns & trends from the data.
• Businesses use for innovation & enriching customers experiences.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 33


Real Life Examples

• Companies learn your secrets, shopping patterns, and preferences


• Some political parties in India also make use of Data Science and its
possibilities.
• Data Science and election (2008, 2012)
• 1 million people installed the Obama Facebook app that gave access to info on
“friends”

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 34


Data Scientists

• Data Scientist
• They find stories, extract knowledge. They are not reporters

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 35


Data Scientists

• Data scientists are the key to realizing the opportunities presented by big data.

• They bring structure to it, find compelling patterns in it.

• And advise executives on the implications for products, processes, and decisions

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 36


Big Data

“While definitions of ‘big data’ may differ slightly,


at the root of each are …..
very large, diverse data sets that include structured, semi-structured and
unstructured data from different sources and in different volumes, from terabytes to
zettabytes. It’s about data sets so large and diverse that it’s difficult, if not
impossible, for traditional relational databases to capture, manage, and process
them with low-latency”
Rob Thomas, Senior Vice President of IBM Cloud and Data Platform.
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 37
Internet Traffic(Mobile and Desktop)

More than half of the population


is currently connected to the internet

Which will be more than 4.13 billion

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 38


Big Data

Big data has arisen to be defined as something like: that amount of data that

will not practically fit into a standard (relational) database for analysis and

processing caused by the huge volumes of information being created by

human and machine-generated processes.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 39


Big Data
Big Data is any data that is expensive to manage and hard to extract value from
due to high
• Volume
• The amount of data matters. big data, we have to process high volumes of
low-density, unstructured data.
• This can be data of unknown value, such as Twitter data feeds,
clickstreams on a web page or a mobile app, or sensor-enabled
equipments.
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 40
Big Data
• Velocity
• Velocity is the fast rate at which data is received and (perhaps)
acted on.
• Normally, the highest velocity of data streams directly into
memory versus being written to disk.
• Some internet-enabled smart products operate in real time or near
real time and will require real-time evaluation and action.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 41


Big Data
• Variety and Complexity
• Variety refers to the many types of data that are available.
Traditional data types were structured and fit neatly in a relational
database.
• With the rise of big data, data comes in new unstructured data
types. Unstructured and semi structured data types, such as text,
audio, and video, require additional preprocessing to derive
meaning and support metadata.
42
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Some Make as 4V’s

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 43


Big Data: 3V’s

44
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 45
Big Data Infrastructure

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 46


Field of Data Science

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 47


Big Data Vs Data Science

• Big data is used by organisations to improve the efficiency, understand the


untapped market, and enhance competitiveness

• while data science is concentrated towards providing modelling techniques


and methods to evaluate the potential of big data in a précised way.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 48


Big Data Vs Data Science

• The amount of data that can be collected by the companies are huge, and
they pertain to big data
• But utilisation of the data to extract valuable information, data science is
needed.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 49


Big Data Vs Data Science

• The 3Vs of the big data guide dataset and is characterized by velocity,
variety, and volume
• But the data science provides techniques to analyze the data.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 50


Big Data Vs Data Science

• Data science supposedly uses theoretical as well as practical approaches to


dig information from the big data which plays an important role in
utilizing the potential of the big data.
• Whatsoever, big data can be considered as the pool of data which has no
credibility unless analysed with deductive and inductive reasoning.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 51


Big Data Vs Data Science

• Big data analysis caters to a large amount of data set which is also known
as data mining
• but data science makes use of the machine learning algorithms to design
and develop statistical models to generate knowledge from the pile of big
data.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 52


Big Data Vs Data Science

• Data science focuses more on business decision


• whereas Big data relates more with technology, computer tools, and
software.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 53


Big Data Vs Data Science

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 54


Difference between BI and Data Science

Criterion Business intelligence Data science


Data Business intelligence deals with Data science deals with structured and unstructured data,
Source structured data, e.g., data e.g., weblogs, feedback, etc.
warehouse.
Method Analytical(historical data) Scientific(goes deeper to know the reason for the data
report)
Skills Statistics and Visualization are Statistics, Visualization, and Machine learning are the
the two skills required for required skills for data science.
business intelligence.
Focus Business intelligence focuses on Data science focuses on past data, present data, and also
both Past and present data future predictions.
55
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Science Components

▪ Data Strategy
▪ Data Engineering
▪ Data Analysis and Models
▪ Data Visualization and Operationalization

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 56


Data Strategy
Data Strategy addresses more than the data; It is based on
• What employees need so that they are empowered to use the data
• Finding the processes that ensure data is accessible and of high quality
• Technology that will enable the storage, sharing and analysis of data

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 57


Data Strategy
• Business Requirements
• Sourcing and Gathering Data
• Technology Infrastructure Requirements
• Turning Data into Insights- Visualization
• People and Processes
• Data Governance
• The Roadmap

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 58


Conceptual View

59
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Engineering

The use of technology and systems to access, organize, and utilize data is
known as data engineering

• Practical applications of data collection and analysis


• Mechanisms for collecting and validating that information
• Data engineers focus on the applications and harvesting of big data

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 60


Data Science Components

• Statistics
• The mathematical foundation of Data Science is statistics.
• Statistics is the discipline that concerns the collection, organization,
analysis, interpretation, and presentation of data.
• Without getting a clear knowledge of statistics and probability, there is
a high possibility of misinterpreting data and reaching to incorrect
conclusions.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 61


Data Science Hierarchy of Needs

62
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Science Frame Work

• Business understanding
• Data understanding
• Data preparation
• Modeling
• Evaluation
• Deployment

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 63


Data Science Frame Work

• Business understanding
• Define the business problem
• Set the criteria for success
• Convert business problem to DS problem
• Categorize the DS problem (Classification/Regression/Anomaly Detection etc)
• Prepare a high-level plan to achieve results
• Visualize the DS pipeline in context of objective (Evaluation Criteria / Algorithms /
Transformations)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 64


Data Science Frame Work

• Data understanding
• Collect & integrate initial data
• Understand the attributes & its relationship
• Identify data quality issues
• Perform EDA (Exploratory Data Analysis)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 65


Data Science Frame Work

• Data preparation
• Integrate data sources
• Handle missing values, outliers
• Apply Feature Engineering

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 66


Data Science Frame Work

• Modeling
• Apply multiple models
• Choose most optimal model
• Create a feedback pipeline
• Ensemble/Stack different models

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 67


Data Science Frame Work

• Evaluation
• Refine evaluation criteria
• Evaluate the models
• Handle Overfitting/Underfitting

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 68


Data Science Frame Work

• Deployment
• Prepare a detailed deployment plan
• Agree on post-deployment monitoring & support
• Monitor & support

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 69


Example

• A bank wants to optimize its loan underwriting procedures.


• Currently, it applies filters on loan applications that automatically reject
the riskier ones.
• However, the bank is still approving too many applications that run into
repayment issues.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 70


Example

• The bank collects a large amount of data for each approved loan: 146
fields. These fields can be split into a few distinct groups:
• Loan demographics, such as the amount, the term, the interest rate, and the reason for
loan.
• Applicant demographics, such as age, salary, employment length, and home
ownership.
• Numerous risk factors, such as the number of public records, credit card delinquency,
and bankruptcy. Roughly 70 sparsely populated risk fields cover these risk factors.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 71


Example

• The goal is to create a predictive model to identify loans that might be bad
loans.
• However, in the raw data, no field indicates whether the loan is good or
bad.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 72


Example

• A data scientist, sits down with the customer to understand the customer's
business problem.
• The first goal is to articulate the overall business problem:
“Can I detect attributes about a person or a type of loan that can be
used to flag a risky loan that needs to be processed by my loan
underwriting team?”

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 73


Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 74
Some Data Science Techniques

• Probability and Statistics • Hypothesis testing


• Distribution • Non-Parametric
• Regression analysis • Parametric
• Linear Regression • Neural Networks
• Logistic Regression • K-Means clustering
• Descriptive statistics • Decision Trees
• Inferential statistics
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 75
Data Size and Growth

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University 76

You might also like