100% found this document useful (4 votes)

791 views133 pages

Data Science 1

The document provides an introduction to a course on data science, outlining 5 units that will cover topics such as data collection and management, data analysis using statistics and machine learning algorithms, data visualization, and case studies applying data science. It lists learning objectives, textbooks, and course outcomes related to understanding and applying fundamental concepts of data science using Python.

Uploaded by

Akhil Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (4 votes)

791 views133 pages

Data Science 1

Uploaded by

Akhil Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 133

DEPARTMENT OF ARTIFICIAL INTELLIGENCE &

MACHINE LEARNING

INTRODUCTION TO DATA SCIENCE

LECTURE NOTES – UNIT 1

B. TECH
II YEAR – II SEM (Sec-A & B)
Academic Year 2022-23

Prepared & compiled by

DR.G. ARUN SAMPAUL THOMAS,

ASSOCIATE PROFESSOR & HOD, DEPARTMENT OF AI&ML
J.B.I.E.T
Bhaskar Nagar, Yenkapally(V), Moinabad(M),

Ranga Reddy(D), Hyderabad – 500 075, Telangana, India.

J. B. Institute of Engineering and
AY 2020-21 B. Tech: AI & ML
Technology
onwards II Year – II Sem
(UGC Autonomous)
Course Code:
INTRODUCTION TO DATA SCIENCE L T P D
J22D3
Credits: 2 2 0 0 0

Pre-requisite:
Database Management Systems, Data Structures

Course Objectives:
This course will enable students to:
• Know about the fundamental concepts and technologies of Data Science.
• Explore the various Data collection and storage methods.
• Understand the Data Analysis, statistics, and various machine learning algorithms.
• Investigate about the visualization of data and apply coding techniques to data for
securing the data.
• Study the Applications of Data Science, Technologies for visualization Handling of
variables using Python.

UNIT-I - Introduction to Data Science

Introduction to core concepts and technologies: Introduction, Terminology, Data science
Process, data science toolkit, Types of data, Example applications

UNIT-II - Data collection and management:

Introduction, Sources of data, Data collection and APIs, Exploring and fixing data. Data storage
and management, using multiple data sources.

UNIT-III - Data analysis:

Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and
distributions, Variance, Distribution properties and arithmetic, Samples/CLT. Basic machine
learning algorithms, Linear regression, SVM, Naive Bayes.

UNIT-IV - Data visualization:

Introduction, Types of data visualization, Data for visualization:
Data types, Data encodings, Retinal variables, mapping variables to encodings, Visual
encodings.

UNIT-V - Practices and Case Studies in Data Science:

Applications of Data Science, Technologies for visualization, Recent trends in various data
collection and analysis techniques, various visualization techniques, application development
methods used in data science. Demonstrate some case studies like Marketing, Finance, HR,
Manufacturing, Healthcare etc

Textbooks:
1. Cathy O’Neil, Rachel Schutt, Doing Data Science, Straight Talk from the Frontline. O’Reilly,
2013.
2. Jure Leskovek, Anand Rajaraman, Jeffrey Ullman, Mining of Massive Datasets. v 2.1,
Cambridge University Press, 2014.
Reference Books:
1. Joel Grus, “Data Science from scratch”, O'Reilly, 2015.
2. Gupta, S.C. and Kapoor, V.K.: “Fundamentals of Mathematical Statistics”, Sultan &
Chand & Sons, New Delhi, 11th Ed, 2002.
3. Hastie, Trevor, et al. “The elements of Statistical Learning”, Springer, 2009.
4. Wes Mc Kinney, “Python for Data Analysis”, O'Reilly Media, 2012

Course Outcomes:
The student will be able to
• Identify the basic concepts of data science and identify the types of data.
• Analyse about how to collect the data, manage the data, explore the data, store the data.
• Implement the basic measures of central tendency and classify the data using SVM and
navie Bayesian.
• Interpret the visualization of data and apply coding techniques to data for securing the
data.
• Analyse the various concepts of data science and can be able to handle simple
applications of data science using python.

WEBSITE REFERENCES FOR SELF LEARNING

1. https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-
scratch-2/
2. https://www.rstudio.com/online-learning/
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø DATA SCIENCE
BASICS
Ø Intro to DS

DR. G. ARUN SAMPAUL THOMAS

Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
[email protected] [email protected]
WHAT IS DATA SCIENCE?
…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem

…sounds cool!
What makes a good data scientist?

2
WHAT IS DATA SCIENCE?
…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem

…which step is most challenging?

use data data analysis

to create or
solution machine learning
(or both)
1
0
WHAT IS DATA ANALYSIS?
…using data to discover useful information…

• data: anything you can measure or record

• statistics: summarize (and visualize) main

Statistics
characteristics of the data

• algorithms: apply algorithms to find

Algor ithms patterns in the data

1
1
WHAT IS MACHINE LEARNING?
…creating and using models that learn from data…

• data: anything you can measure or record

• model: specification of a (mathematical)

relationship between different variables

• evaluation: how well does the model

work?

1
7
WHAT IS MACHINE LEARNING?
• Traditional CS

data
output
program

• Machine Learning

data
data output
program
output

1
8
WHAT IS MACHINE LEARNING?
…creating and using models that learn from data…

Examples
Detecting Predicting the
Identifying zip code communities traffic volume
from handwritten in social at rush hour
digits networks

Detecting fraudulent Determining the

credit card location of distribution
transactions centers based on
customers’
residence
[DSFS] p 3-13
7
LEARNING FROM DATA
• Regression

20
LEARNING FROM DATA
• Classification

21
LEARNING FROM DATA
• Clustering

22
WHAT IS MACHINE LEARNING?

…creating and using models that learn from data…

• come up with predictions

• extract knowledge/insights

à unsupervised learning/data mining

[PDSH] p 332-342
[DSFS] p 141-142 11
ACTIVITY 1
…creating and using models that learn from data…

Categorize these Examples

Detecting Predicting the
Identifying zip code communities traffic volume
from handwritten in social at rush hour
digits networks

Detecting fraudulent Determining the

credit card location of distribution
transactions centers based on
customers’
residence

24
MACHINE LEARNING WORKFLOW
• training phase, test phase, evaluation phase

ground
truth performance
data measure
data
output
model
output

à let’s have a closer look at the data we are using

25
ACTIVITY 2
• Example: Census Data

• training data and test data 14

DATA
• Notation:
• D all observed data
• X all features
• y observations Helper Notation:
• ☐TE test n number of data points
d number of features
• ☐TR training m number of training points
• y! predictions ☐1,…,i,…,n: indices for data points
☐1,…,j,…,d: indices for features

• What data structure to use?

• set, list, or array?

15
SUMMARY & READING math &
statistics

• Data Science is about hacking

expertise
skills
data, models, and evaluation
• Data Science can solve a wide variety of problems –
once we have the right data and model!

16
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø Terminologies
Ø DS Process
Ø Data Scientist
Process
DR. G. ARUN SAMPAUL THOMAS
Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
[email protected] [email protected]
Basic Terminologies

• Data
• It can be
Simulation
-generated
-collected
-retrieved.
Similarity Measures

Data Structures

Algorithms
• Data: facts with no meanings.
• Information: learning from facts.
• Knowledge: practical understanding of a subject.
• Understanding: the ability to absorb knowledge and learn to reason.
• Wisdom: the quality of having experience and good judgment; ability to
think and foresee.
• Validity: ways to confirm truth.
• Cross-sectional data: applied on data without time.
• Temporal data: applied on time series.
• Spatial: considers location i.e. coordinate determination in touch phones.
• Temporal cum Spatial (GIS): considers change with passage of time for example
population density.

• Measurements of Scales
There are 4 scales of measurement
• Nominal: determines classification of data i.e. male/female.
• Ordinal: determines order of data and can be numerical or non-numerical i.e. time of
day (dawn, morning, noon, afternoon, evening, night).
• Interval: gives the interval of a measurement i.e. temperature interval.
• Ratio: gives ratio of the measurement i.e. weight, height, number of children.
Why DS Now?

• We have massive amounts of data about many aspects of our lives, and
,simultaneously, What people might not know is that the “datafication” of our
offline behavior has started as well.
• On the Internet, this means Amazon recommendation systems.
• on Facebook, friend recommendations, film and music recommendations, and
so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and
assessments coming out of places like Knewton and Khan Academy.
• In government, this means policies based on data.
Datafication

• In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor
Mayer-Schoenberger wrote an article called “The Rise of Big Data”, In it they
discuss the concept of datafication,
They define datafication as a process of “taking all aspects of
life and turning them into data.”

• They follow up their definition in the article with a line that speaks volumes
about their perspective:
Once we datafy things, we can transform their purpose and
turn the information into new forms of value.
Datafication
Examples:
• How we quantify friendships with “likes”.
• “Google’s augmented-reality glasses datafy the gaze.
• Twitter datafies stray thoughts.
• LinkedIn datafies professional networks.
• When we “like” someone or something online, we are intending to be
datafied.
• Browse the Web, we are unintentionally through cookies.
• When we walk around in a store, or even on the street, we are being
datafied, via sensors, cameras, or Google glasses.
• Taking part in a social media experiment.
• All-out surveillance and stalking.

But it’s all datafication

Data Science Process
A Data Scientist’s Role in This
Process
The growth in data scientist job postings on Indeed, from December 2016 to December 2018
OK, So What Is a Data Scientist, Really?
Perhaps the most concrete approach is to define data science is by its usage.
• In Academia
• An academic data scientist is a scientist, trained in anything from social science to
biology, who works with large amounts of data, and must grapple with
computational problems posed by the structure, size, messiness, and the
complexity and nature of the data, while simultaneously solving a real-world
problem.
• In Industry
More generally, a data scientist is someone who knows
• How to design the experiments,
• how to the process of collecting, cleaning, and munging of data.
• Skills that are also necessary for understanding biases in the data, and for
debugging logging output from code.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.
What Is a Data Scientist
Data Engineers are the
Data analyst is someone
data professionals who
who merely curates
prepare the “big data”
meaningful insights from
infrastructure to be
data.
analyzed by Data
Scientists

A data scientist is a professional with the capabilities to gather large amounts of

data to analyze and synthesize the information into actionable plans for companies
and other organizations.
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø Data Science Toolkits
Ø DS Techniques

DR. G. ARUN SAMPAUL THOMAS

Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana 1
[email protected] [email protected]
Data Science Tools
- R
- Python
- Tableau
- Spark with ML
- Hadoop (Pig and Hive)
- SAS
- SQL
Data Science with R
A popular language
in Data Science
What Is R
https://www.r-project.org/about.html
R is an integrated suite of software facilities for data manipulation, calculation
and graphical display. It includes
● an effective data handling and storage facility,
● a suite of operators for calculations on arrays, in particular matrices,
● a large, coherent, integrated collection of intermediate tools for data
analysis,
● graphical facilities for data analysis and display either on-screen
or on hardcopy, and
● a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input and
output facilities.
Install R
https://cran.r-project.org/bin/windows/base/
Install RStudio
https://www.rstudio.com/products/rstudio/download/
Statistical Software Landscape
SAS Matlab
Python (Pandas) JMP
IBM SPSS E views
R
Julia
Clojure
Octave
Using R with other software
https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/

Tableau http://www.tableausoftware.com/new-features/r-integration

Qlik http://qliksolutions.ru/qlikview/add-ons/r-connector-eng/

Oracle R http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html

Rapid Miner https://rapid-i.com/content/view/202/206/lang,en/#r

JMP http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html
Using R with other software
https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/

SAS/IML http://www.sas.com/technologies/analytics/statistics/iml/index.html

Teradata http://developer.teradata.com/applications/articles/in-database-analytics-with-teradata-r

Pentaho http://bigdatatechworld.blogspot.in/2013/10/integration-of-rweka-with-pentaho-data.html

IBM SPSS

https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov18855&S_TACT=M161003W&dy
nform=127&lang=en_US

TIBCO TERR
http://spotfire.tibco.com/discover-spotfire/what-does-spotfire-do/predictive-analytics/tibco-enterprise-runtime-for-r-terr
Some Advantages of R
open source
free
large number of algorithms and packages esp for statistics
flexible
very good for data visualization
superb community
rapidly growing
can be used with other software
Some Disadvantages of R
in memory (RAM) usage
steep learning curve
some IT departments frown on open source
verbose documentation
tech support
evolving ecosystem for corporates
Solutions for Disadvantages of R
in memory (RAM) usage specialized packages, in database computing
steep learning curve TRAINING !!!
some IT departments frown on open source TRAINING and education!
verbose documentation CRAN View , R Documentation
tech support expanding pool of resources
evolving ecosystem for corporates getting better with MS et al
•
•

•
•

•
•
•
•
http://www.sas.com/en_in/software/university-edition/download-software.html
•
•

–
–

–
•
Python
What is Python
Python is a widely used general-purpose, high-level programming language

Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would
be possible in languages such as C++ or Java.

Python is used widely

https://www.python.org/about/success/
Object Oriented Programming (OOPS)
a computer program consists of, such as variables, expressions, functions or modules.

name = ajay

print (name)

import printer

Hi I am %name

Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which are data structures that contain
data, in the form off ields, often known as attributes; and code, in the form of procedures, often known as methods.

Dynamic programming language is a term used in computer science to describe a class of high-level programming languageswhich, at
runtime, execute many common programming behaviors that static programming languages perform during compilation.

"compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g.,
assembly language or machine code).
Java
http://introcs.cs.princeton.edu/java/11cheatsheet/
Linux
http://www.linuxstall.com/linux-command-line-tips-that-every-linux-user-should-know/
SQL
http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Hive QL
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Python
http://www.astro.up.pt/~sousasag/Python_For_Astronomers/Python_qr.pdf
Python
https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf
R
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Pig
HDFS
https://github.com/michiard/CLOUDS-LAB/blob/master/C-S.md
Git
http://overapi.com/static/cs/git-cheat-sheet.pdf
All together now
PIG http://www.slideshare.net/Mathias-Herberts/hadoop-pig-syntax-card
HDFS https://github.com/michiard/CLOUDS-LAB/blob/master/C-S.md
R http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Python https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf
Python http://www.astro.up.pt/~sousasag/Python_For_Astronomers/Python_qr.pdf
Java http://introcs.cs.princeton.edu/java/11cheatsheet/
Linux http://www.linuxstall.com/linux-command-line-tips-that-every-linux-user-should-know/
SQL http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Git http://overapi.com/static/cs/git-cheat-sheet.pdf
R
R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical
techniques, and is highly extensible.

R is an integrated suite of software facilities for data manipulation, calculation and

graphical display. It includes an effective data handling and storage facility, a suite
of operators for calculations on arrays, in particular matrices, a large, coherent,
integrated collection of intermediate tools for data analysis, graphical facilities for
data analysis and display either on-screen or on hardcopy, and a well-developed,
simple and effective programming language

https://www.r-project.org/about.html
Python
http://python-history.blogspot.in/ and https://www.python.org/
SAS
http://www.sas.com/en_in/home.html
Data Science Techniques
- Machine Learning
- Regression
- Logistic Regression
- K Means Clustering
- Association Analysis
- Decision Trees
- Text Mining
- Social Network Analysis
- Time Series Forecasting
- LTV and RFM Analysis
- Pareto Analysis
What is an algorithm

● a process or set of rules to be followed in calculations or other

problem-solving operations, especially by a computer.

● a self-contained step-by-step set of operations to be performed

● a procedure or formula for solving a problem, based on conducting a

sequence of specified action

● a procedure for solving a mathematical problem (as of finding the greatest

common divisor) in a finite number of steps that frequently involves
repetition of an operation; broadly : a step-by-step procedure for solving a
problem or accomplishing some end especially by a computer.
Machine Learning

Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning
system could be trained on email messages to learn to distinguish between spam and non-spam messages

Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a
set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a
desired output value (also called the supervisory signal).

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a
training set of correctly identified observations is available.

In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the
examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes
unsupervised learning from supervised learning

The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories
based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional
vector space).
CRAN VIEW Machine Learning
Machine Learning in Python
Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a
new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership
is known.
The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features,
etc.
These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),
ordinal (e.g. "large", "medium" or "small"),
integer-valued (e.g. the number of occurrences of a part word in an email) or
real-valued (e.g. a measurement of blood pressure).

Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups
(e.g. less than 5, between 5 and 10, or greater than 10).
Regression

regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the relationship between
a dependent variable and one or more independent variables.

More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable')
changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent
variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent
variables.
kNN
Support Vector Machines

http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf
Association Rules

http://en.wikipedia.org/wiki/Association_rule_learning
Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between
products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.
For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes
together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing
activities such as, e.g., promotional pricing or product placements.
In addition to the above example from market basket analysis association rules are employed today in many application areas
including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining,
association rule learning typically does not consider the order of items either within a transaction or across transactions

Conecpts- Support, Confidence, Lift

In R
apriori() in arules package
In Python
http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/
Gradient Descent

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent,
one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html

Start at some x value, use derivative at that value to tell

us which way to move, and repeat. Gradient descent.

http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf
Gradient Descent

https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
A standard approach to
solving this type of
problem is to define
an error function (also
called a cost function)
that measures how “good”
a given line is.

initial_b = 0 # initial y-intercept guess

initial_m = 0 # initial slope guess
num_iterations = 1000
Decision Trees

http://select.cs.cmu.edu/class/10701-F09/recitations/recitation4_decision_tree.pdf
Decision Trees

Http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
Random Forest

Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of
the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the
classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This
sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out
of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
In the original paper on random forests, it was shown that the forest error rate depends on two things:
● The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
● The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of
the individual trees decreases the forest error rate.

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro
Bagging

Bagging, aka bootstrap aggregation, is a relatively simple way to increase the

power of a predictive statistical model by taking multiple random samples(with
replacement) from your training data set, and using each of these samples to
construct a separate model and separate predictions for your test set. These
predictions are then averaged to create a, hopefully more accurate, final
prediction value.

http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/
Boosting

Boosting is one of several classic methods for creating ensemble models,

along with bagging, random forests, and so forth. Boosting means that each
tree is dependent on prior trees, and learns by fitting the residual of the trees
that preceded it. Thus, boosting in a decision tree ensemble tends to improve
accuracy with some small risk of less coverage.
XGBoost is a library designed and optimized for boosting trees algorithms.
XGBoost is used in more than half of the winning solutions in machine learning
challenges hosted at Kaggle.

http://xgboost.readthedocs.io/en/latest/model.html#
And http://dmlc.ml/rstats/2016/03/10/xgboost.html
Top 10 Data Analytics Tools 2020
(Currently in use with Various Organizations)

https://www.youtube.com/watch?v=P-bKqfKhqR8
Top 10 Data Science Tools For 2022
Data Science Tools and Libraries

https://www.youtube.com/watch?v=zVBcmTkJqpo
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø Types of Data
Ø DS Applications
& Use Cases

DR. G. ARUN SAMPAUL THOMAS

Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
[email protected] [email protected]
Data All Around

• Lots of data is being collected

and warehoused
– Scientific Experiments
– Internet of Things
– Web data, e-commerce
– Financial transactions, bank/credit transactions
– Online trading and purchasing
– Social Network
– ……many more!

2
What To Do With These Data?

• Aggregation and Statistics

– Data warehousing and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
• Data Driven
– Predictive Analytics
– Deep Learning

3
Statistical and Critical Thinking
Analyzing Data: Potential Pitfalls
• Misleading Conclusions
When forming a conclusion based on a statistical analysis, we should make statements that are clear
even to those who have no understanding of statistics and its terminology.
• Sample Data Reported Instead of Measured
When collecting data from people, it is better to take measurements yourself instead of asking
subjects to report results.
• Loaded Questions
If survey results are not worded carefully, the results of a study can be misleading.
• Order of Questions
Sometimes survey questions are unintentionally loaded by the order of the items being considered.
• Nonresponse
A nonresponse occurs when someone either refuses to respond or is unavailable.
• Percentages
Some studies cite misleading percentages. Note that 100% of some quantity is all of it, but if there
are references made to percentages that exceed 100%, such references are often not justified.
5
Types of Data, Key Concept

A major use of statistics is to collect and use sample data to make conclusions
about populations.

Parameter & Statistic

• Parameter
a numerical measurement describing some
characteristic of a population
• Statistic
a numerical measurement describing some
characteristic of a sample

7
Types of Data

Quantitative Data & Categorical Data

• Quantitative (or numerical) data
consists of numbers representing counts or measurements.

Example: The weights of supermodels

Example: The ages of respondents

• Categorical (or qualitative or attribute) data

consists of names or labels (not numbers that represent counts or measurements).

Example: The gender (male/female) of professional athletes

Example: Shirt numbers on professional athletes uniforms - substitutes for names
8
Types of Data, Quantitative Data

Discrete & Continuous types:

• Discrete data
result when the data values are quantitative and the number of values is
finite, or “countable.”

Example: The number of tosses of a coin before getting tails

• Continuous (numerical) data
result from infinitely many possible quantitative values, where the
collection of values is not countable.

Example: The lengths of distances from 0 cm to 12 cm

9
Types of Data, Quantitative Data

Data

Qualitative Quantitative
Categorical Numerical,
Can be ranked

Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.

1
0
Types of Data, Levels of Measurement:
Another way of classifying data: 4 levels of measurement: nominal, ordinal, interval, and ratio.

• Nominal level of measurement

characterized by data that consist of names, labels, or categories only, and
the data cannot be arranged in some order (such as low to high).
• Nominal - categories only
Example: Survey responses of yes, no, and undecided
(Names)
• Ordinal level of measurement
involves data that can be arranged in some order, but differences (obtained
by subtraction) between data values either cannot be determined or are • Ordinal - categories with
meaningless.
some order ( nominal, plus can
Example: Course grades A, B, C, D, or F be ranked (order))
• Interval level of measurement
involves data that can be arranged in order, and the differences between • Interval - differences but no
data values can be found and are meaningful. However, there is no
natural zero starting point at which none of the quantity is present. natural zero point (Ordinal,
plus intervals are consistent)
Example: Years 1000, 2000, 1776, and 1492
• Ratio level of measurement • Ratio - differences and a
data can be arranged in order, differences can be found and are
meaningful, and there is a natural zero starting point (where zero indicates natural zero point(Iinterval,
that none of the quantity is present). Differences and ratios are both
meaningful.
plus ratios are consistent, true
zero)
Example: Class times of 50 minutes and 100 minutes 10
Types of Data, Levels of Measurement:
Example 1:

Determine the measurement level.

Variable Nominal Ordinal Interval Ratio Level

Hair Color Yes No Nominal
Zip Code Yes No Nominal
Letter Grade Yes Yes No Ordinal
ACT Score Yes Yes Yes No Interval
Height Yes Yes Yes Yes Ratio
Age Yes Yes Yes Yes Ratio
Temperature Yes Yes Yes No Interval

(F)

3
Example 2:

4
Example 3:

Parameter or Statistic?

Statistic

Parameter

5
Example 4:

Discrete or Continuous?

Continuous

Discrete

6
Example 5:
Determine the measurement level.

Nominal

Ratio

Ordinal

Interval

7
Example 6:
Determine the measurement level & what’s wrong with the conclusion?

8
Structured vs Unstructured

https://www.youtube.com/watch?v=WBU7sW1jy2o
Big Data & Data Science

• “… the stylish job in the next 10 years will

be statisticians,” Hal Varian, Google Chief Economist
• The U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by 2018.
McKinsey Global Institute’s June 2011

• New Data Science institutes being created or

repurposed – NYU, Columbia, Washington, UCB,...
• New degree programs, courses, boot-camps:
– e.g., at Berkeley: Stats, I-School, CS, Astronomy…
– One proposal (elsewhere) for an MS in “Big Data Science”
– Plans for Data Science Stream at AUST
– RDA-CODATA School of Research Data Science
20
Data Science Vs Analysis Vs Software
Delivery
Component Traditional Analysis Traditional Software Data Science
Delivery
Tools SAS, R, Excel, SQL, in- Java, source control, Linux, R, Java, scientific Python libraries,
house tools continuous integration, unit Excel, SQL, Hadoop, Hive, Pig,
testing, bug reports and Mahout and other machine learning
project management libraries, github for source control
and issue management
Analytical Regressions, N/A Classification, clustering, similarity
Methods classifications, detection, recommenders,
measuring prediction unsupervised and supervised
accuracy and learning, small- and large-scale
coverage/error, computations, measuring prediction
sampling accuracy and coverage/error
Team Statisticians, Developers, Project Mathematicians, Statisticians,
Structure Mathematicians, Managers, Systems Scientists, Developers, Systems
Scientists Engineers Engineers
Time Frame Either: Regular software release Either:
• Usually on-going cycle, continuous delivery, etc. • Discovery/learning phase leading
research and to product development
discovery within a Or:
team in the • On-going research and product
organization invention/improvement
Or:
• Specific project to
determine answers 21
Contrast: Scientific Computing

Image General purpose classifier

Supernova

Not

Nugent group / C3 LBL

Scientific Modeling Data-Driven Approach

Physics-based models General inference engine replaces model
Problem-Structured Structure not related to problem
Mostly deterministic, precise Statistical models handle true randomness,
and un-modeled complexity.
Run on Supercomputer or High-end Run on cheaper computer Clusters (EC2)
Computing Cluster
22
Contrast: Machine Learning

Machine Learning Data Science

Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of Understand empirical properties of
models models
Improve/validate on a few, relatively Develop/use tools that can handle
clean, small datasets massive datasets
Publish a paper J Take action!
14
Contrast: Data Engineering

Data Science Data Engineering

Approach Scientific (Exploration) Engineering (Development)
Problems Unbounded Bounded
Path to Solution Iterative, exploratory, Mostly linear
nonlinear
Education More is better (PhD’s BS and/or self-trained
common)
Presentation Skills Important Not as important
Research Important Not as important
Experience
Programming Not as important Important
Skills
Data Skills Important Important

24
Data Science Applications

Business Health Care Urban Leaving

Summary From car design to Tomorrow’s healthcare may For the first time in human
insurance to pizza delivery, look more efficient thanks to history, more people live in
businesses are using data things like electronic health cities than in suburban or
science to optimize their records. It also may look a lot rural areas. An emerging field
operations and better meet more effective. Reduced called “urban informatics”
their customers’ readmissions, better care, and combines data science with
expectations. earlier detection are on the the unique challenges facing
horizon. the world’s growing cities
Two-Way Street for the Reducing Hospital Taking on Megacity Traffic
Ford Focus Electric Car Readmissions
Better Fraud Detection Better Point-of-Care Decisions Fighting Crime with Data
What is Boosts Customer "predictive policing"
happening? Satisfaction
E-Commerce Insights:
Domino’s Secret Sauce
What is possible Using Social Data to Medical Exams by Bathroom Instrumenting cities
Select Successful Retail Mirrors
Locations
.

25
Data Science: Case Study
Cancer Research
• Cancer is an incredibly complex disease; a single tumor can have
more than 100 billion cells, and each cell can acquire mutations
individually. The disease is always changing, evolving, and adapting.
• Employ the power of big data analytics and high-performance
computing.
• Leverage sophisticated pattern and machine learning algorithms to
identify patterns that are potentially linked to cancer
• Huge amount of data processing and recognition

26
Data Science: Case Study
Health Care

• Stanford Medicine, Google

team up to harness power of
data science for health care
• Stanford Medicine will use the
power, security and scale of
Google Cloud Platform to
support precision health and
more efficient patient care.
• Analyzing genetic data
• Focusing on precision health
• Data as the engine that
drives research

http://med.stanford.edu/news/all-news/2016/08/stanford-medicine-google-team-up-to-harness-power-of-data-science.html 27
Data Science: Case Study
Elections
• The Obama campaigns in 2008 and 2012 are credited for their
successful use of social media and data mining.
• Micro-targeting in 2012
– http://www.theatlantic.com/politics/archive/2012/04/the-
creepiness-factor-how-obama-and-romney-are-getting-to-know-
you/255499/
– http://www.mediabizbloggers.com/group-m/How-Data-and-Micro-
Targeting-Won-the-2012-Election-for-Obama---Antony-Young-
Mindshare-North-America.html
• Micro-profiles built from multiple sources accessed by aps, real-
time updating data based on door-to-door visits, focused media
buys, e-mails and Facebook messages highly targeted.
• 1 million people installed the Obama Facebook app that gave
access to info on “friends”.
22
Data Science: Case Study
Internet of Things (IoT)
• The Internet of Things is rapidly growing. It is predicted that more than 25 billion devices
will be connected by 2020.

• The Internet of Things (IOT) will soon produce a massive volume and variety of data at
unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's
soul. 23
Data Science: Case Study
Customer Analytics

30
Case Study - How Recommender Systems Work
(Netflix/Amazon)

https://www.youtube.com/watch?v=n3RKsY2H-NE

Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
65 pages
Predictive Analytics Complete Notes
100% (1)
Predictive Analytics Complete Notes
82 pages
PROII User-Added Subroutines User Guide
100% (1)
PROII User-Added Subroutines User Guide
536 pages
360DigiTmg E Book Data Science
100% (1)
360DigiTmg E Book Data Science
168 pages
Handwritten Machine Learning Notes
No ratings yet
Handwritten Machine Learning Notes
114 pages
Data Science Handwritten Notes
No ratings yet
Data Science Handwritten Notes
44 pages
Fundamentals of Data Structures - Ellis Horowitz, Sartaj SahniEll
No ratings yet
Fundamentals of Data Structures - Ellis Horowitz, Sartaj SahniEll
542 pages
Solved Big Data and Data Science Projects
100% (1)
Solved Big Data and Data Science Projects
85 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
Data Science 3
100% (1)
Data Science 3
216 pages
Unit 3: Classification & Regression: Question Bank and Its Solution
No ratings yet
Unit 3: Classification & Regression: Question Bank and Its Solution
180 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
Cdac Training Day2
No ratings yet
Cdac Training Day2
115 pages
Data Science Notes
100% (2)
Data Science Notes
59 pages
Machine Learning Tutorial
100% (1)
Machine Learning Tutorial
775 pages
Question Bank - Machine Learning (Repaired)
100% (1)
Question Bank - Machine Learning (Repaired)
78 pages
170 Machine Learning Interview Questios - Greatlearning
100% (1)
170 Machine Learning Interview Questios - Greatlearning
57 pages
Data Science 5
100% (4)
Data Science 5
216 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
Phase 1 Project Report
No ratings yet
Phase 1 Project Report
44 pages
Data Science Masters 2.0: Impact Batch 2.0
No ratings yet
Data Science Masters 2.0: Impact Batch 2.0
11 pages
Introduction To Data Science Lab Manual
100% (1)
Introduction To Data Science Lab Manual
76 pages
Kinetic Techref MultiSite 2024.2 CN
No ratings yet
Kinetic Techref MultiSite 2024.2 CN
310 pages
RHCSA 10 Syllabus Nehra Classes
No ratings yet
RHCSA 10 Syllabus Nehra Classes
7 pages
CCW331 BA IAT 1 Set 1 & Set 2 Questions
No ratings yet
CCW331 BA IAT 1 Set 1 & Set 2 Questions
19 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
ML Projects 1
No ratings yet
ML Projects 1
29 pages
Aria X-Ray Bone Densitometer User Manual v17SP4 - UM - LU45249EN-v17SP4 - 1
No ratings yet
Aria X-Ray Bone Densitometer User Manual v17SP4 - UM - LU45249EN-v17SP4 - 1
194 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
16 pages
Eop Deck All-Cards
No ratings yet
Eop Deck All-Cards
168 pages
EDA - With Python Question Bank
No ratings yet
EDA - With Python Question Bank
3 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Cassandra PPT Final
No ratings yet
Cassandra PPT Final
23 pages
Deep Learning Questions
50% (2)
Deep Learning Questions
51 pages
009 Naa Kutumbam 01 25 PDF
67% (100)
009 Naa Kutumbam 01 25 PDF
157 pages
Machine Learning Notes - TutorialsDuniya
100% (1)
Machine Learning Notes - TutorialsDuniya
58 pages
FDS - Lecture Notes - III AIML, CSM
No ratings yet
FDS - Lecture Notes - III AIML, CSM
101 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
100 Linux Commands PDF
No ratings yet
100 Linux Commands PDF
6 pages
Data Science Questions
100% (1)
Data Science Questions
45 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
93 pages
Machine Learning Projects For Final Year PDF
No ratings yet
Machine Learning Projects For Final Year PDF
4 pages
Attmos: An Attendance Monitoring System For Employees With The Use of QR Codes
No ratings yet
Attmos: An Attendance Monitoring System For Employees With The Use of QR Codes
37 pages
SEO Secrets Revealed Techniques For Higher Rankings
No ratings yet
SEO Secrets Revealed Techniques For Higher Rankings
38 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Python Machine Learning
100% (2)
Python Machine Learning
70 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
DATA SCIENCE Internship
100% (1)
DATA SCIENCE Internship
16 pages
Data Mining
100% (1)
Data Mining
53 pages
Sornwin
No ratings yet
Sornwin
2 pages
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
Machine Learning Notes
100% (4)
Machine Learning Notes
60 pages
R Unit 1 2018 Notes
No ratings yet
R Unit 1 2018 Notes
36 pages
Power Flex MV Selection Guide
No ratings yet
Power Flex MV Selection Guide
24 pages
Artificial Intelligence Based Missile Guidance System: Darshan Diwani, Archana Chougule, Debajyoti Mukhopadhyay
No ratings yet
Artificial Intelligence Based Missile Guidance System: Darshan Diwani, Archana Chougule, Debajyoti Mukhopadhyay
6 pages
CourseDescription - Deep Learing - Final Version
No ratings yet
CourseDescription - Deep Learing - Final Version
6 pages
Great Learning Notes
No ratings yet
Great Learning Notes
1 page
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
Data Science
100% (1)
Data Science
22 pages
M.tech - Data Science Lab
No ratings yet
M.tech - Data Science Lab
48 pages
Unit 1 - Machine Learning
No ratings yet
Unit 1 - Machine Learning
21 pages
Computer Final Exam 1
No ratings yet
Computer Final Exam 1
4 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Soft Computing (SC) Topper Solution
100% (2)
Soft Computing (SC) Topper Solution
35 pages
CO Unit 1 Chap 1 Notes
No ratings yet
CO Unit 1 Chap 1 Notes
11 pages
CS8091-BIG DATA ANALYTICS UNIT V Notes
100% (4)
CS8091-BIG DATA ANALYTICS UNIT V Notes
31 pages
Semantic Search Example 2
No ratings yet
Semantic Search Example 2
4 pages
Extractor 0CO PC ACT 10
No ratings yet
Extractor 0CO PC ACT 10
2 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
SSK5204 Chapter 6: Pushdown Automata
No ratings yet
SSK5204 Chapter 6: Pushdown Automata
27 pages
Filters in Wireshark
No ratings yet
Filters in Wireshark
11 pages
Reporting Tool Comparative Analysis
No ratings yet
Reporting Tool Comparative Analysis
12 pages
FDS Lesson Plan
No ratings yet
FDS Lesson Plan
8 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
100% (3)
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Bajaj Allianz
No ratings yet
Bajaj Allianz
6 pages
GKR Ivus 2VF DS
No ratings yet
GKR Ivus 2VF DS
2 pages
General Specifications: FCN/FCJ Autonomous Controller Functions
No ratings yet
General Specifications: FCN/FCJ Autonomous Controller Functions
7 pages
Assignment 1 - ImageMaker F2020
No ratings yet
Assignment 1 - ImageMaker F2020
7 pages
Keep Sensitive Data Safe With HPE Defective Media Solutions Solution Brief-4aa1-8067enw - 2
No ratings yet
Keep Sensitive Data Safe With HPE Defective Media Solutions Solution Brief-4aa1-8067enw - 2
2 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Data Science Use Cases
100% (1)
Data Science Use Cases
10 pages
Screenshot 2024-02-06 at 12.40.46 PM
No ratings yet
Screenshot 2024-02-06 at 12.40.46 PM
1 page
ML - Linear Algebra Review - Coursera PDF
No ratings yet
ML - Linear Algebra Review - Coursera PDF
4 pages
Reflective Essay
No ratings yet
Reflective Essay
5 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
Cod Bo1
No ratings yet
Cod Bo1
1 page
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet

Data Science 1

Uploaded by

Data Science 1

Uploaded by

DEPARTMENT OF ARTIFICIAL INTELLIGENCE &

INTRODUCTION TO DATA SCIENCE

LECTURE NOTES – UNIT 1

Prepared & compiled by

DR.G. ARUN SAMPAUL THOMAS,

Ranga Reddy(D), Hyderabad – 500 075, Telangana, India.

UNIT-I - Introduction to Data Science

UNIT-II - Data collection and management:

UNIT-III - Data analysis:

UNIT-IV - Data visualization:

UNIT-V - Practices and Case Studies in Data Science:

WEBSITE REFERENCES FOR SELF LEARNING

DR. G. ARUN SAMPAUL THOMAS

…which step is most challenging?

use data data analysis

• data: anything you can measure or record

• statistics: summarize (and visualize) main

• algorithms: apply algorithms to find

• data: anything you can measure or record

• model: specification of a (mathematical)

• evaluation: how well does the model

Detecting fraudulent Determining the

…creating and using models that learn from data…

• come up with predictions

à unsupervised learning/data mining

Categorize these Examples

Detecting fraudulent Determining the

à let’s have a closer look at the data we are using

• training data and test data 14

• What data structure to use?

• Data Science is about hacking

But it’s all datafication

A data scientist is a professional with the capabilities to gather large amounts of

DR. G. ARUN SAMPAUL THOMAS

Rapid Miner https://rapid-i.com/content/view/202/206/lang,en/#r

Python is used widely

R is an integrated suite of software facilities for data manipulation, calculation and

● a process or set of rules to be followed in calculations or other

● a self-contained step-by-step set of operations to be performed

● a procedure or formula for solving a problem, based on conducting a

● a procedure for solving a mathematical problem (as of finding the greatest

Conecpts- Support, Confidence, Lift

Start at some x value, use derivative at that value to tell

initial_b = 0 # initial y-intercept guess

Bagging, aka bootstrap aggregation, is a relatively simple way to increase the

Boosting is one of several classic methods for creating ensemble models,

DR. G. ARUN SAMPAUL THOMAS

• Lots of data is being collected

• Aggregation and Statistics

Parameter & Statistic

Quantitative Data & Categorical Data

Example: The weights of supermodels

• Categorical (or qualitative or attribute) data

Example: The gender (male/female) of professional athletes

Discrete & Continuous types:

Example: The number of tosses of a coin before getting tails

Example: The lengths of distances from 0 cm to 12 cm

• Nominal level of measurement

Determine the measurement level.

Variable Nominal Ordinal Interval Ratio Level

• “… the stylish job in the next 10 years will

• New Data Science institutes being created or

Image General purpose classifier

Nugent group / C3 LBL

Scientific Modeling Data-Driven Approach

Machine Learning Data Science

Data Science Data Engineering

Business Health Care Urban Leaving

• Stanford Medicine, Google

You might also like