Foundations of Data
Science
Course Code - 23DS3PCFDS
PREPARED BY - LAKSHMI SHREE K
AI & DS
Overview of the Course
Semester III
Course Foundations of Data Science
Title:
Course 23DS3PCFDS Total Contact Hours: 40 hours
Code:
L-T-P: 3-0-0 Total Credits: 3
DEPT OF AI & DS
UNIT I
• Introduction to Data Science: Describing Data science, The data science
Venn diagram, Python for Data Science, Data science case studies
• Types of Data: structured versus unstructured data, quantitative versus
qualitative data, the four levels of data: nominal, ordinal, interval and ratio
• Total information awareness, Bonferroni’s Principle, Rhine’s paradox.
• The Data Science Process: Overview, defining research goals, retrieving
data, Cleansing, integrating and transforming data, exploratory data analysis,
Build the models, Presenting findings. Data Analytics Lifecycle.
DEPT OF AI & DS
UNIT II
• Statistics & Probability: Statistics, Obtaining data, Sampling Data,
Statistical measures, empirical rule. Points estimates, Sampling distributions,
Confidence intervals, Hypothesis Tests: Conducting a hypothesis test, one
sample t-tests, Type I and type II errors, Hypothesis testing for categorical
variables
• Information Gain & Entropy, Probability Theory, Probability Types,
Probability Distribution Functions, Bayes’ Theorem, Inferential Statistics
DEPT OF AI & DS
UNIT III
• Correlation Analysis: Types of correlation, correlation coefficient.
• Regression Analysis: Linear Regression: Simple Linear Regression,
Multilinear Regression, p-values, Logistic Regression, Multinomial logistic
regression, Time-Series Model, Receiver Operating Characteristic
DEPT OF AI & DS
UNIT IV
• Dealing with missing data: single and multiple data imputation, Entropy based techniques, Monte Carlo and
MCMC simulations;
• Correcting inconsistent data: Deduplication, Entity resolution, Pairwise Matching; Fellegi-Sunter Model
• Dimensionality Reduction: Eigenvalues and Eigenvectors of Symmetric Matrices: Definitions, Computing
Eigenvalues and Eigenvectors, Finding Eigen pairs by Power Iteration, Eigenvector matrix
• Principal-Component Analysis: Example, Using Eigenvectors for Dimensionality Reduction, The matrix of
distances
• Singular-Value Decomposition: Definition, interpretation, Dimensionality Reduction Using SVD, Why
Zeroing Low Singular Values Works, Querying Using Concepts, Computing the SVD of a Matrix
DEPT OF AI & DS
UNIT V
• Data Analytics on Text: Major Text Mining Areas – Information Retrieval
– Data Mining – Natural Language Processing NLP) – Text analytics tasks:
Cleaning and Parsing, Searching, Retrieval, Text Mining, Part-of-Speech
Tagging, Stemming, Text Analytics Pipeline. NLP: Major components of
NLP, stages of NLP, and NLP applications.
DEPT OF AI & DS
Prescribed Text Book
Publi
Sl. No. Book Title Authors Edition Year
sher
Sinan Qzdemir, Sunil Kakade &
1. Principles of Data Science Second Edition Packt 2018
Macro Tibaldeschi
Sanjeev Wagh, Manisha Bhende,
CRC
2. Fundamentals of Data Science Anuradha First Edition 2022
Press
Thakare,
Introducing Data Science: Big Data, Machine Davy Cielen, Arno D.B. Manni
3. - 2016
Learning, and More Meysman, Mohamed Ali ng
Reference Text Book
Sl. No. Book Title Authors Edition Publisher Year
1. Doing Data Science Rachel Schutt, Cathy O’Neil - O’Reilly 2014
Jure Leskovec, Anand Dreamtech
2. Mining Massive Datasets 2nd 2016
Rajaraman, Jeffrey D Ullman Press
DEPT OF AI & DS
E-Book
Sl. Book
Authors Edition Publisher Year URL
No. Title
Data Dirk P. Kroese,
Science & ZdravkoI Botev, University of https://people.smp.uq.edu.au/DirkKroes
1. - 2023
Machine ThomasTaimre, Queensland e/DSML/DSML.pdf
Learning Radislav Vaisman
Becoming https://32net.id/bukaheula/share/QP2cf2
Alex J. Gutman
2. a Data - Wiley 2021 JLdeOPn00y3Nyu8aXHp1Slq1bc6P4Y
Jordan Goldmeier
Head cuI4.pdf
MOOC Course
Sl. Course
Course name Year URL
No. Offered By
IBM Data Science https://www.coursera.org/professional-certificates/ibm-data-scien
1. Coursera 2023
ce
Foundations of Data
2. DEPT OF AI & DS SWAYAM 2023 https://onlinecourses.swayam2.ac.in/imb23_mg64/preview
Science
Program Outcomes
PO1: Science and engineering PO7: Environment and Society
Knowledge PO8: Ethics
PO2: Problem Analysis PO9: Individual & Team Work
PO3: Design & Development PO10: Communication
PO4: Investigations of Complex PO11: Project Mgmt. & Finance
Problems PO12: Lifelong Learning
PO5: Modern Tool Usage
PO6: Engineer & Society
DEPT OF AI & DS
Course Outcomes
At the end of the course the student will be able to
CO1: Gain fundamental knowledge on data science
CO2: Analyze and visualize data for knowledge representation.
CO3: Demonstrate proficiency in data analysis.
CO4: Conduct experiments to demonstrate the use of various data science tools
DEPT OF AI & DS
Overview of UNIT 1
• Introduction to Data Science:
• Describing Data science
• The data science Venn diagram
• Python for Data Science
• Data science case studies
• Types of Data:
• structured versus unstructured data
• quantitative versus qualitative data
• the four levels of data: nominal, ordinal, interval and ratio
• Total information awareness, Bonferroni’s Principle, Rhine’s paradox.
• The Data Science Process:
• Overview
• Defining research goals
• Retrieving data
• Cleansing, integrating and transforming data
• Exploratory data analysis
• Build the models
• Presenting findings
• AI Data
DEPT OF & DSAnalytics Lifecycle.
Overview of UNIT 1
PART 1
• Introduction to Data Science
• Describing Data science
• The data science Venn diagram
• Python for Data Science
• Data science case studies
DEPT OF AI & DS
Data
• Industrial Age to Information Age
• Estimates around 64 zettabytes
• Data is created when you send a message, tweet, like , share, create a MS word doc
and so on.
• SO much data!!!! -In every industry
• Data leaks
• Make sense of the data – Data Age!!!
• Create insights and sources of knowledge that every human can benefit from.
DEPT OF AI & DS
History of Data Storage
How much data is created?
BIG DATA
• Data generated on the internet per minute
History of Data Science
• The art of uncovering insights and trends in data has been around since
ancient times.
• The ancient Egyptians used census data to increase efficiency in tax
collection and accurately predicted the Nile River's flooding every year.
• People have continued to use data to derive insights and predict outcomes.
DEPT OF AI & DS
Types of Digital Data
Makings of a skilled Data Scientist
• A data scientist has to be one with a very curious mind, willing to spend significant time and effort to explore
her hunches.
• Curious, argumentative, judgmental.
• Mathematical Sciences(Linear Algebra, Probability, Statistics, Calculus)
• Subject area knowledge
• Experience programming and analysing data
• Storyteller
• Adept at selecting suitable tools
• Apply expertise to problem-solving
• Diverse Background
DEPT OF AI & DS
Drew Conway’s Data Science Venn Diagram
DEPT OF AI & DS
Data Scientist's role in an organization
Data Stories Insights
Clarify the problem Recognition
Data Collection Storytelling
Analysis Visualization
DEPT OF AI & DS
Data Science: The Sexiest Job in the 21st
Century
• Because the digital revolution has touched every aspect of our lives, the
opportunity to benefit from learning about our behaviors is more so now
than ever before.
• Given the right data, marketers can take sneak peeks into our habit
formation.
• Research in neurology and psychology is revealing how habits and
preferences are formed and retailers like Target are out to profit from it.
DEPT OF AI & DS
Introduction to Data Science
• Data is collection of information
• Organized data:
• Data is sorted into a row/column structure, where every row represents a
single observation and the columns represent the characteristics of that
observation.
• Unorganized data:
• Data is in free form, usually text, raw audio/video signals
DEPT OF AI & DS
Introduction to Data Science
• Data Science is all about how we take data, use it to acquire knowledge, and then use that
knowledge to do the following:
• Make decisions
• Predict the future
• Understand the past/present
• Create new industries/products
• Data Science is using data in order to gain new insights that you would otherwise have
missed.
DEPT OF AI & DS
Why Data Science?
• Parsing the huge volume of data in a reasonable time frame with previous
forms of analysis is difficult
• Data can be missing, incomplete or wrong
• Data on different scales making it tough to compare
• Analytics on generated data decisions over stick-to-your-gut decisions
DEPT OF AI & DS
The Data Science Venn Diagram
DEPT OF AI & DS
Python Practices
DEPT OF AI & DS
Basic Logical Operators
DEPT OF AI & DS
Example for Basic Python
Create a list and access the items
DEPT OF AI & DS
Example for Basic Python
Parsing of a Tweet
DEPT OF AI & DS
Overview of UNIT 1
PART 2
• Types of Data
• Structured versus unstructured data
• Quantitative versus qualitative data
• The four levels of data:
• Nominal
• Ordinal
• Interval and
• Ratio
• Total information awareness, Bonferroni’s Principle, Rhine’s paradox.
DEPT OF AI & DS
Types of Digital Data
Structured Data
• Data insert, delete, update and append.
• Indexing to enable faster data retrieval.
• Scalability which enables increasing or decreasing capacities and data processing
operations such as storing, processing and analytics.
• Transaction’s processing which follows ACID rules (Atomicity, Consistency,
Isolation and Durability).
• Encryption and Decryption for data security.
Semi-Structured Data
• Semi-structured data contain tags and other markers.
• Here the data does not conform and associate with formal data model structures
• Examples of semi-structured data are XML and JSON documents.
Unstructured Data
• Unstructured data is information that either does not organize in a pre-defined manner or not have a
pre-defined data model.
• Absolute raw form.
• does not possess data features such as a table or a database.
• Some examples of unstructured data
• Mobile data: Text messages, chat messages, tweets, blogs and comments.
• Website content data: YouTube videos, browsing data, e-payments,
user-generated maps.
• Social media data: Images and videos from Instagram, Facebook, LinkedIn,
Flickr (upload, access, organize, edit and share photos from any device from
anywhere in the world).
• Satellite images, atmospheric data, surveillance, traffic videos.
Structured versus unstructured data
• Structured versus unstructured data
• Structured ( organized) data – This type of data is observations and characteristics
which is organised using a table ( row or columns)
• Scientific Observations
• Unstructured (unorganized) data –This data exists as a free entity and does not follow
any standard organization hierarchy.
• Text form of data from server logs and Facebook posts
• Genetic sequence of chemical nucleotides (ACGTATTGCA)
DEPT OF AI & DS
Quantitative versus qualitative data
DEPT OF AI & DS
Quantitative versus qualitative data
• Name of the coffee shop – Qualitative data
• Revenue – Quantitative data
• Zip Code - Qualitative data
• Average Monthly customers - Quantitative data
• Country of coffee origin - Qualitative data
DEPT OF AI & DS
Quantitative versus qualitative data
DEPT OF AI & DS
Four Levels of Data
• Nominal Level
• Ordinal Level
• Interval Level
• Ratio Level
DEPT OF AI & DS
Four Levels of Data
DEPT OF AI & DS
Four Levels of Data – ASSESSMENT -1
DEPT OF AI & DS
Four Levels of Data – ASSESSMENT -2
DEPT OF AI & DS
DEPT OF AI & DS
DEPT OF AI & DS
Levels of Data - Nominal Level
• This type of data is described purely by name or category
• Example : gender, nationality
• Mathematical Operations are not allowed except equality and membership
• Being a tech entrepreneur is same as being in tech industry, but not the other
way around
DEPT OF AI & DS
Levels of Data - Ordinal Level
• Data provides a rank order
• Example : Likert Scale – An ordinal level scale
• A survey asks users to rate a restaurant on a scale from 1-10
• Mathematical operations allowed
• Ordering – Ex: Spectrum of visible light
• Comparison
DEPT OF AI & DS
Levels of Data - Ordinal Level
• Mathematical operations not allowed
• Subtract
• Addition
• Appropriate way to define center is median
DEPT OF AI & DS
Levels of Data - Ordinal Level
• Measures of center
• Median
DEPT OF AI & DS
Levels of Data - Ordinal Level
DEPT OF AI & DS
Levels of Data - Interval Level
• Allows complicated mathematical operations
• Ordinal does not support subtraction but Interval supports
• Median and mode can be used for calculating the center of the data
DEPT OF AI & DS
Levels of Data - Interval Level
DEPT OF AI & DS
Levels of Data - Interval Level
DEPT OF AI & DS
Levels of Data - Interval Level
Example – calculate the variation
DEPT OF AI & DS
Levels of Data - Ratio Level
• Mathematical operations supported are
• Addition and Subtraction
• Multiply and Divide
• Calculate the mean using geometric mean
DEPT OF AI & DS
Levels of Data - Ratio Level
DEPT OF AI & DS