Lecture 4 Data Pre-Processing
Lecture 4 Data Pre-Processing
2
COURSE OUTCOMES
CO2 Understand data pre-processing techniques and apply these for data cleaning.
3
Unit-1 Syllabus
Unit-1 Introduction to Machine Learning
Introduction to Definition of Machine Learning, Working principles of Machine
Machine Learning Learning; Classification of Machine Learning algorithms: Supervised
Learning, Unsupervised Learning, Reinforcement Learning, Semi-
Supervised Learning; Applications of Machine Learning.
Data Pre- Data Sourcing and Cleaning, Handling Missing data, Encoding
Processing and Categorical data, Feature Scaling, Handling Time Series data; Feature
Feature Selection techniques, Data Transformation, Normalization,
Extraction Dimensionality reduction
Data Visualization Data Frame Basics, Different types of analysis, Different types of
plots, Plotting fundamentals using Matplotlib, Plotting Data
Distributions using Seaborn.
4
SUGGESTIVE READINGS
• TEXT BOOKS:
• There is no single textbook covering the material presented in this course. Here is a list of books
recommended for further reading in connection with the material presented:
• T1: Tom.M.Mitchell, “Machine Learning, McGraw Hill International Edition”.
• T2: Ethern Alpaydin,” Introduction to Machine Learning. Eastern Economy Edition, Prentice Hall of
India, 2005”.
• T3: Andreas C. Miller, Sarah Guido, Introduction to Machine Learning with Python, O’REILLY (2001).
• REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, Python Machine Learning, (2014)
• R2 Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classification, Wiley, 2nd Edition”.
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning, illustrated Edition, Springer, 2006”.
5
Data Sourcing
• For data sourcing Panda is used.
• Name?
• Panda = Panel Data + Python Data Analysis (Combination) gave the
name.
• Panel data is a subset of longitudinal data where observations are for
the same subjects each time.
By: Prof. (Dr.) Vineet Mehan 6
Data Sourcing
• Use of Panda ?
• Pandas can clean messy data sets, and make them readable and
relevant.
• Wrong data
• Duplicates
• This is usually OK, since data sets can be very big, and removing a few
rows will not have a big impact on the result.
• Mean Average
• Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.
• If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is between 30
and 60.
• To replace wrong data for larger data sets you can create some rules,
e.g. set some boundaries for legal values, and replace any values that
are outside of the boundaries.
• This way you do not have to find out what to replace them with, and
there is a good chance you do not need them to do your analyses.
• By taking a look at our test data set, we can assume that row 11 and
12 are duplicates.
40
Task
• Applying various methods that are used for sourcing the data by
taking a suitable arrays\datasets etc. (BT-Level3)
• https://www.tutorialspoint.com/machine_learning/index.htm
• https://www.w3schools.com/python/
42
THANK YOU
For queries
Email: [email protected]
43