Data Science 1
Data Science 1
MACHINE LEARNING
B. TECH
II YEAR – II SEM (Sec-A & B)
Academic Year 2022-23
Pre-requisite:
Database Management Systems, Data Structures
Course Objectives:
This course will enable students to:
• Know about the fundamental concepts and technologies of Data Science.
• Explore the various Data collection and storage methods.
• Understand the Data Analysis, statistics, and various machine learning algorithms.
• Investigate about the visualization of data and apply coding techniques to data for
securing the data.
• Study the Applications of Data Science, Technologies for visualization Handling of
variables using Python.
Textbooks:
1. Cathy O’Neil, Rachel Schutt, Doing Data Science, Straight Talk from the Frontline. O’Reilly,
2013.
2. Jure Leskovek, Anand Rajaraman, Jeffrey Ullman, Mining of Massive Datasets. v 2.1,
Cambridge University Press, 2014.
Reference Books:
1. Joel Grus, “Data Science from scratch”, O'Reilly, 2015.
2. Gupta, S.C. and Kapoor, V.K.: “Fundamentals of Mathematical Statistics”, Sultan &
Chand & Sons, New Delhi, 11th Ed, 2002.
3. Hastie, Trevor, et al. “The elements of Statistical Learning”, Springer, 2009.
4. Wes Mc Kinney, “Python for Data Analysis”, O'Reilly Media, 2012
Course Outcomes:
The student will be able to
• Identify the basic concepts of data science and identify the types of data.
• Analyse about how to collect the data, manage the data, explore the data, store the data.
• Implement the basic measures of central tendency and classify the data using SVM and
navie Bayesian.
• Interpret the visualization of data and apply coding techniques to data for securing the
data.
• Analyse the various concepts of data science and can be able to handle simple
applications of data science using python.
UNIT– I
Ø DATA SCIENCE
BASICS
Ø Intro to DS
…sounds cool!
What makes a good data scientist?
2
WHAT IS DATA SCIENCE?
…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem
1
1
WHAT IS MACHINE LEARNING?
…creating and using models that learn from data…
1
7
WHAT IS MACHINE LEARNING?
• Traditional CS
data
output
program
• Machine Learning
data
data output
program
output
1
8
WHAT IS MACHINE LEARNING?
…creating and using models that learn from data…
Examples
Detecting Predicting the
Identifying zip code communities traffic volume
from handwritten in social at rush hour
digits networks
20
LEARNING FROM DATA
• Classification
21
LEARNING FROM DATA
• Clustering
22
WHAT IS MACHINE LEARNING?
[PDSH] p 332-342
[DSFS] p 141-142 11
ACTIVITY 1
…creating and using models that learn from data…
24
MACHINE LEARNING WORKFLOW
• training phase, test phase, evaluation phase
ground
truth performance
data measure
data
output
model
output
25
ACTIVITY 2
• Example: Census Data
15
SUMMARY & READING math &
statistics
16
INTRODUCTION TO DATA SCIENCE
UNIT– I
Ø Terminologies
Ø DS Process
Ø Data Scientist
Process
DR. G. ARUN SAMPAUL THOMAS
Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
[email protected] [email protected]
Basic Terminologies
• Data
• It can be
Simulation
-generated
-collected
-retrieved.
Similarity Measures
Data Structures
Algorithms
• Data: facts with no meanings.
• Information: learning from facts.
• Knowledge: practical understanding of a subject.
• Understanding: the ability to absorb knowledge and learn to reason.
• Wisdom: the quality of having experience and good judgment; ability to
think and foresee.
• Validity: ways to confirm truth.
• Cross-sectional data: applied on data without time.
• Temporal data: applied on time series.
• Spatial: considers location i.e. coordinate determination in touch phones.
• Temporal cum Spatial (GIS): considers change with passage of time for example
population density.
• Measurements of Scales
There are 4 scales of measurement
• Nominal: determines classification of data i.e. male/female.
• Ordinal: determines order of data and can be numerical or non-numerical i.e. time of
day (dawn, morning, noon, afternoon, evening, night).
• Interval: gives the interval of a measurement i.e. temperature interval.
• Ratio: gives ratio of the measurement i.e. weight, height, number of children.
Why DS Now?
• We have massive amounts of data about many aspects of our lives, and
,simultaneously, What people might not know is that the “datafication” of our
offline behavior has started as well.
• On the Internet, this means Amazon recommendation systems.
• on Facebook, friend recommendations, film and music recommendations, and
so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and
assessments coming out of places like Knewton and Khan Academy.
• In government, this means policies based on data.
Datafication
• In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor
Mayer-Schoenberger wrote an article called “The Rise of Big Data”, In it they
discuss the concept of datafication,
They define datafication as a process of “taking all aspects of
life and turning them into data.”
• They follow up their definition in the article with a line that speaks volumes
about their perspective:
Once we datafy things, we can transform their purpose and
turn the information into new forms of value.
Datafication
Examples:
• How we quantify friendships with “likes”.
• “Google’s augmented-reality glasses datafy the gaze.
• Twitter datafies stray thoughts.
• LinkedIn datafies professional networks.
• When we “like” someone or something online, we are intending to be
datafied.
• Browse the Web, we are unintentionally through cookies.
• When we walk around in a store, or even on the street, we are being
datafied, via sensors, cameras, or Google glasses.
• Taking part in a social media experiment.
• All-out surveillance and stalking.
UNIT– I
Ø Data Science Toolkits
Ø DS Techniques
Tableau http://www.tableausoftware.com/new-features/r-integration
Qlik http://qliksolutions.ru/qlikview/add-ons/r-connector-eng/
Oracle R http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html
JMP http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html
Using R with other software
https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/
SAS/IML http://www.sas.com/technologies/analytics/statistics/iml/index.html
Teradata http://developer.teradata.com/applications/articles/in-database-analytics-with-teradata-r
Pentaho http://bigdatatechworld.blogspot.in/2013/10/integration-of-rweka-with-pentaho-data.html
IBM SPSS
https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov18855&S_TACT=M161003W&dy
nform=127&lang=en_US
TIBCO TERR
http://spotfire.tibco.com/discover-spotfire/what-does-spotfire-do/predictive-analytics/tibco-enterprise-runtime-for-r-terr
Some Advantages of R
open source
free
large number of algorithms and packages esp for statistics
flexible
very good for data visualization
superb community
rapidly growing
can be used with other software
Some Disadvantages of R
in memory (RAM) usage
steep learning curve
some IT departments frown on open source
verbose documentation
tech support
evolving ecosystem for corporates
Solutions for Disadvantages of R
in memory (RAM) usage specialized packages, in database computing
steep learning curve TRAINING !!!
some IT departments frown on open source TRAINING and education!
verbose documentation CRAN View , R Documentation
tech support expanding pool of resources
evolving ecosystem for corporates getting better with MS et al
•
•
•
•
•
•
•
•
http://www.sas.com/en_in/software/university-edition/download-software.html
•
•
–
–
–
•
Python
What is Python
Python is a widely used general-purpose, high-level programming language
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would
be possible in languages such as C++ or Java.
https://www.python.org/about/success/
Object Oriented Programming (OOPS)
a computer program consists of, such as variables, expressions, functions or modules.
name = ajay
print (name)
import printer
Hi I am %name
Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which are data structures that contain
data, in the form off ields, often known as attributes; and code, in the form of procedures, often known as methods.
Dynamic programming language is a term used in computer science to describe a class of high-level programming languageswhich, at
runtime, execute many common programming behaviors that static programming languages perform during compilation.
"compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g.,
assembly language or machine code).
Java
http://introcs.cs.princeton.edu/java/11cheatsheet/
Linux
http://www.linuxstall.com/linux-command-line-tips-that-every-linux-user-should-know/
SQL
http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Hive QL
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Python
http://www.astro.up.pt/~sousasag/Python_For_Astronomers/Python_qr.pdf
Python
https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf
R
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Pig
HDFS
https://github.com/michiard/CLOUDS-LAB/blob/master/C-S.md
Git
http://overapi.com/static/cs/git-cheat-sheet.pdf
All together now
PIG http://www.slideshare.net/Mathias-Herberts/hadoop-pig-syntax-card
HDFS https://github.com/michiard/CLOUDS-LAB/blob/master/C-S.md
R http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Python https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf
Python http://www.astro.up.pt/~sousasag/Python_For_Astronomers/Python_qr.pdf
Java http://introcs.cs.princeton.edu/java/11cheatsheet/
Linux http://www.linuxstall.com/linux-command-line-tips-that-every-linux-user-should-know/
SQL http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Git http://overapi.com/static/cs/git-cheat-sheet.pdf
R
R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical
techniques, and is highly extensible.
https://www.r-project.org/about.html
Python
http://python-history.blogspot.in/ and https://www.python.org/
SAS
http://www.sas.com/en_in/home.html
Data Science Techniques
- Machine Learning
- Regression
- Logistic Regression
- K Means Clustering
- Association Analysis
- Decision Trees
- Text Mining
- Social Network Analysis
- Time Series Forecasting
- LTV and RFM Analysis
- Pareto Analysis
What is an algorithm
Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning
system could be trained on email messages to learn to distinguish between spam and non-spam messages
Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a
set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a
desired output value (also called the supervisory signal).
In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a
training set of correctly identified observations is available.
In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the
examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes
unsupervised learning from supervised learning
The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories
based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional
vector space).
CRAN VIEW Machine Learning
Machine Learning in Python
Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a
new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership
is known.
The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features,
etc.
These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),
ordinal (e.g. "large", "medium" or "small"),
integer-valued (e.g. the number of occurrences of a part word in an email) or
real-valued (e.g. a measurement of blood pressure).
Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups
(e.g. less than 5, between 5 and 10, or greater than 10).
Regression
regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the relationship between
a dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable')
changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent
variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent
variables.
kNN
Support Vector Machines
http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf
Association Rules
http://en.wikipedia.org/wiki/Association_rule_learning
Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between
products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.
For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes
together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing
activities such as, e.g., promotional pricing or product placements.
In addition to the above example from market basket analysis association rules are employed today in many application areas
including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining,
association rule learning typically does not consider the order of items either within a transaction or across transactions
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent,
one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.
http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html
http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf
Gradient Descent
https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
A standard approach to
solving this type of
problem is to define
an error function (also
called a cost function)
that measures how “good”
a given line is.
http://select.cs.cmu.edu/class/10701-F09/recitations/recitation4_decision_tree.pdf
Decision Trees
Http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
Random Forest
Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of
the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the
classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This
sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out
of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
In the original paper on random forests, it was shown that the forest error rate depends on two things:
● The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
● The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of
the individual trees decreases the forest error rate.
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro
Bagging
http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/
Boosting
http://xgboost.readthedocs.io/en/latest/model.html#
And http://dmlc.ml/rstats/2016/03/10/xgboost.html
Top 10 Data Analytics Tools 2020
(Currently in use with Various Organizations)
https://www.youtube.com/watch?v=P-bKqfKhqR8
Top 10 Data Science Tools For 2022
Data Science Tools and Libraries
https://www.youtube.com/watch?v=zVBcmTkJqpo
INTRODUCTION TO DATA SCIENCE
UNIT– I
Ø Types of Data
Ø DS Applications
& Use Cases
2
What To Do With These Data?
3
Statistical and Critical Thinking
Analyzing Data: Potential Pitfalls
• Misleading Conclusions
When forming a conclusion based on a statistical analysis, we should make statements that are clear
even to those who have no understanding of statistics and its terminology.
• Sample Data Reported Instead of Measured
When collecting data from people, it is better to take measurements yourself instead of asking
subjects to report results.
• Loaded Questions
If survey results are not worded carefully, the results of a study can be misleading.
• Order of Questions
Sometimes survey questions are unintentionally loaded by the order of the items being considered.
• Nonresponse
A nonresponse occurs when someone either refuses to respond or is unavailable.
• Percentages
Some studies cite misleading percentages. Note that 100% of some quantity is all of it, but if there
are references made to percentages that exceed 100%, such references are often not justified.
5
Types of Data, Key Concept
A major use of statistics is to collect and use sample data to make conclusions
about populations.
• Parameter
a numerical measurement describing some
characteristic of a population
• Statistic
a numerical measurement describing some
characteristic of a sample
7
Types of Data
9
Types of Data, Quantitative Data
Data
Qualitative Quantitative
Categorical Numerical,
Can be ranked
Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.
1
0
Types of Data, Levels of Measurement:
Another way of classifying data: 4 levels of measurement: nominal, ordinal, interval, and ratio.
(F)
3
Example 2:
4
Example 3:
Parameter or Statistic?
Statistic
Parameter
5
Example 4:
Discrete or Continuous?
Continuous
Discrete
6
Example 5:
Determine the measurement level.
Nominal
Ratio
Ordinal
Interval
7
Example 6:
Determine the measurement level & what’s wrong with the conclusion?
8
Structured vs Unstructured
https://www.youtube.com/watch?v=WBU7sW1jy2o
Big Data & Data Science
Not
24
Data Science Applications
25
Data Science: Case Study
Cancer Research
• Cancer is an incredibly complex disease; a single tumor can have
more than 100 billion cells, and each cell can acquire mutations
individually. The disease is always changing, evolving, and adapting.
• Employ the power of big data analytics and high-performance
computing.
• Leverage sophisticated pattern and machine learning algorithms to
identify patterns that are potentially linked to cancer
• Huge amount of data processing and recognition
26
Data Science: Case Study
Health Care
http://med.stanford.edu/news/all-news/2016/08/stanford-medicine-google-team-up-to-harness-power-of-data-science.html 27
Data Science: Case Study
Elections
• The Obama campaigns in 2008 and 2012 are credited for their
successful use of social media and data mining.
• Micro-targeting in 2012
– http://www.theatlantic.com/politics/archive/2012/04/the-
creepiness-factor-how-obama-and-romney-are-getting-to-know-
you/255499/
– http://www.mediabizbloggers.com/group-m/How-Data-and-Micro-
Targeting-Won-the-2012-Election-for-Obama---Antony-Young-
Mindshare-North-America.html
• Micro-profiles built from multiple sources accessed by aps, real-
time updating data based on door-to-door visits, focused media
buys, e-mails and Facebook messages highly targeted.
• 1 million people installed the Obama Facebook app that gave
access to info on “friends”.
22
Data Science: Case Study
Internet of Things (IoT)
• The Internet of Things is rapidly growing. It is predicted that more than 25 billion devices
will be connected by 2020.
• The Internet of Things (IOT) will soon produce a massive volume and variety of data at
unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's
soul. 23
Data Science: Case Study
Customer Analytics
30
Case Study - How Recommender Systems Work
(Netflix/Amazon)
https://www.youtube.com/watch?v=n3RKsY2H-NE