0% found this document useful (0 votes)
19 views6 pages

CB0494 Notes

The document discusses various machine learning techniques, including classification, clustering, and anomaly detection, emphasizing their applications in data science. It covers the use of libraries like Pandas, NumPy, Matplotlib, and Seaborn for data manipulation and visualization, along with methods for data preparation and statistical analysis. Additionally, it touches on linear regression, model fitting, and the importance of understanding fitness indicators in supervised learning.

Uploaded by

ansonsee236
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

CB0494 Notes

The document discusses various machine learning techniques, including classification, clustering, and anomaly detection, emphasizing their applications in data science. It covers the use of libraries like Pandas, NumPy, Matplotlib, and Seaborn for data manipulation and visualization, along with methods for data preparation and statistical analysis. Additionally, it touches on linear regression, model fitting, and the importance of understanding fitness indicators in supervised learning.

Uploaded by

ansonsee236
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Tutorial 1

Discussion 1

 Classification is Supervised learning techniques – data is being


trained to set a model; hence historical data is used to predicts the
future trends. e.g. Property agents using historical data to build a
classifier, use the feature to input into classifier and give a result.
 Clustering is unsupervised Learning techniques – grouping the
similar data together, e.g. comparing house prices within the same
block.
 Detection in Anomalies e.g. Bank tracking our transactions (prevent
further losses).
 AI through Adaptive learning process e.g. Aviation Co-piloting using
AI To Improve Safety.
Relate problem with data science questions is the hardest.
 To know how many types of whales in a place during a timing:
Most suitable data is to use clustering (lack of data).

Question 1

 Text box is just html code


 Import libraries: Pandas (Standard)
 String Value (“red color”), computer will not read it as a variable
 Name house data as the variable: data frame variable, its able to
call all the panda’s function directly.
 . head (can input numbers to see more rows) is checking of data and
give a short summary for data.
 Dimensions of data:. shape (row * column)
 .dtypes give type of dataset
 .info gives the info and number of non-null vales of dataset
data type float64(memory bused to store number, decimal places),
int64(number of integer), object
 .descibe gives stats info of column with numerical values(not with
the objects)

 Import data from the internet, remember to check the Wi-Fi


connection, easier method to import files
 Type (dataset) to check the validity of the data
 Len tells how many tables
 () for functions {} for the lease for arrays; computer counts no from
0
 We want [: top] to get the top 20 result.
 Bonus Problem: for loop method uses 1 time memory only
 X is int, convert X into string and input into the import of the files
 Check after importing the files
 .append to insert the selected dataset into the blank list
 Integrate the printing of result into the for-loop codes
Tutorial 2

Discussion 2

 Classification

Question 2

 More Library
NumPy : Library for Numeric Computations in Python
Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization
 Warning Messages?
 # Basic Libraries
import matplotlib.pyplot (lesser tool needed) as plt # we only need
pyplot
 # Data Preparation
1(a)Import CSV file.
1(c)Extract only the needed data – 2 Methods

M1 Loc command (very lean,resource efficient,not so user


friendly,need to use loop for multiple string values)
:(all the rows needed)
== matching statement of LHS and RHS variable,compare and
extract each of LHS values with the RHS ‘np.int64’ (‘’becomes a
string value instead of a variable) into housedataNUM

M2 select_dtypes (user friendly,more consumption on resources.)


(do not put include to,able to extract multiple string (e.g
float64),cuz it is not a mathematical equations)

1(e) drop the unwanted string value(not ‘int64’)


Axis = 1 (dropping column, axis =0 means dropping rows)
Need to assign drop to the original one (update on the existing,
lesser variable, need to restart entire play if amendments are made)
Or define a new variable

 Find Statistics, using pd.data frame variable


Dataset Variable (Proper Table) vs series variable (values is not
presented well)
. describe extraction of statisctical values (how to extract solely
median from the dataset?)a
Data Visualization plt.figure(do not use subplot) (size of this
canvas,length * height)
Called out function for plotting. Boxplot, (adjust 3 parameters of
figure, orientation, color, which data using – assignment statement
sequence does not matters)
.histplot to plot histogram, .violinplot (combined of box and histplot)

2(c) lotarea
Present 6 plot together – Prepare the figure
F,axes(2 output variable) = Plt.subplot(2(Row),3(Column) cannot
change the number)
For the fifth figure: In order for the his to be green need to include
x=’lot area’

2(e) .reindex(only if combining data from different table that have


different indexes)
Sb.joint plot
Strong relationship will be increasing linearly
Correlation – CRI
Tutorial 3

 Statistics = .describe
 Statistics is not a function but a variable
 Same library same data
 .skew
 Find Total Number of outliners(for-loop)
Temp = pd. (extract data)
Compute Q1,Q3 using .quantile
Using | to check whether is it an outliners
*Extract data from .describe*
 Plot the lotArea using for loop again
Count = 0
Additional input(x=Var, color=color [count])
Count += to choose different column when moving to other plotting
*Last Week’s Homework*
 .corr
Sb.heatmap(linewidth=1(white column boxes))
 .dtypes change object to categorical data
 Sb.catplot

Tutorial 4:
 Indicators of fitness
R2 (Need to know upper, lower limit, does it logical in DS (only
interested range in the positive region),)> MSE
 Linear Regression is a supervised learning, (used historical data to
train model)
 1(c) Split the data into train and test sets (orderly splitting)
Retrieve the rows become individual values
 .fit to do LR. On training set
 Linreg.intercept to extract y intercept
Linreg.coefficients to extract coeffiecients
HW:?
 Undefitting vs overfitting

You might also like