Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
37 views
10 pages
Lab 01 Ds Project 01
Uploaded by
Nguyen Xuan Vi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download
Save
Save Lab 01 Ds Project 01 For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
0 ratings
0% found this document useful (0 votes)
37 views
10 pages
Lab 01 Ds Project 01
Uploaded by
Nguyen Xuan Vi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Carousel Previous
Carousel Next
Download
Save
Save Lab 01 Ds Project 01 For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
Download
Save Lab 01 Ds Project 01 For Later
You are on page 1
/ 10
Search
Fullscreen
(awor Data Scie ojest ot Paget of 10 Céch thc hién lam sach dit liu cho Machine Learning bang Python 1. BO dif ligu Ion x6n: Lam sach dit liéu dé cép dén viée xc dinh va stra céc I6i trong tap di liéu c6 thé téc dng tiéu cure dén mé hinh dy doan. # summarize the number of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset data = loadtxt (urlopen(path), delimiter=',') # summarize the number of unique values in each column for 4 in range(data.shape[1])+ print (i, len(unique(data[:, i]))) # summarize the number of unique values for each column using numpy from pandas import read_csv # define the location of the dataset path = ‘https: //raw.githubusercontent . com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset df = read_csv(path, header=None) # summarize the number of unique values in each column print (af nunique()) Xéa cc cot chtta mot gid tri # delete columns with a single unique value from pandas import read_esv # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset df = read_csv(path, header=None) print (df. shape) # get number of unique values for each column counts = df-nunique() # record columns to delete to_del = [i for i,v in enumerate(counts) if v print (to_del) # drop useless columns gence — (Seed Lach Compucter ial Intell Artif(Avot Data Science Projest ot Page 2 of 10 Gf.drop(to_del, axis=1, inplace=True) print (df.shape) # summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset data = loadtxt (urlopen(path), delimiter=", ') # summarize the nunber of unique values in each column for i in range(data.shape[1]): num = len(unique(data[:, i])) percentage = float (num) / data-shape[0] * 100 print('td, $d, $.1£89' & (i, num, percentage) # summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = ‘https: //raw.githubusercontent . com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset data = loadtxt (urlopen(path), delimiter=",') # summarize the number of unique values in each column for i in range(data.shape[1]): num = len(unique(datal:, i])) percentage = float (num) / data-shape[0] * 100 Af percentage < i: print ('ad, %d, 8.1f88' © (i, num, percentage) # delete columns where number of unique values is less than 18 of the rows from pandas import read csv # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset Gf = read_csv(path, header=None) print (df.shape) ‘tal mteligence — (Gea Lach Computer Science(aeot pata Page 2 of 10 # get number of unique values for each column counts = df.nunigue() # record columns to delete to del = [i for i,v in enumerate (counts) if (float (v)/df.shape[0]*100) < u print (to_del) # drop useless columns df.drop(to_del, axis=1, inplace=True) print (df.shape) Loai bé cdc cét cé phuong sai thap Mét céch tiép cn khdc cho van d@ loai bé cdc c6t cé it gid tri duy nhat la xem xét phuong sai ca cét. Hay nhd lai rang phuong sai la mot théng ké duoc tinh toan trén mét bién la ch€nh éch binh phuong trung binh cua cdc gia tri trén mAu so vdi gid tri trung binh. Phuong sai cé thé dugc sit dung lam bé loc dé xéc dinh cdc c6t can xéa khoi tap dit ligu, Mét cSt cO mét gid tri duy nhat cé phuong sai la 0,0 va mot cét 6 rat it gid tri duy nhat sé cé gid tri phurong sai nhd. Lép VarianceThreshold tl thy vién scikit-learn hé trg tinh nang nay nh mét loai Iya chon tinh nang. Mét thé hién ca Ip c6 thé duge tao chi dinh déi s6 “ ngudng , mac dinh la 0,0 dé xda cdc c6t c6 mét gid tri duy nhét. Sau dé, né ¢6 thé duge diéu chinh va ap dung cho tap dif ligu bang cach goi ham fit_transform() dé tao phién ban da chuyén déi cUa tap dif liéu trong do cdc cot co phuong sai thép hon ngudng sé ty déng bi x6a, # example of apply the variance threshold from pandas import read_csv from sklearn.feature selection import VarianceThreshold # define the location of the dataset path = 'https://raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset Gf = read_csv(path, header=None) # split data into inputs and outputs data = df.values X = datal:, :-1] Geod Leck ° Artificial intell(Avot Data Science Projest ot Page 4 of 10 y = datal:, -1] Print (X.shape, y.shape) # define the transform transform = VarianceThreshold () # transform the input data X_sel = transform. fit_transfoxm(X) print (X_sel-shape) # explore the effect of the variance thresholds on the number of selected features from numpy import arange from pandas import read_csv from sklearn.feature_selection import VarianceThreshold fzom matplotlib import pyplot # define the location of the dataset path = ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset Gf = read_csv(path, header=None) + split data into inputs and outputs data = df.values X= gata a y= data{:, -1) print (X.shape, y-shape) # define thresholds to check thresholds = arange(0.0, 0.55, 0.05) # apply transform with each threshold results = list() for t in thresholds: # define the transform transform = VarianceThreshold (threshold=t) # transform the input data x sel = transform. fit transform (x) + determine the nunber of input features n_features = X_sel.shape(1] print (">Thresholdet.2£, Features=id' % (t, n_features)) # store the result results.append(n features) # plot the threshold vs the number of selected features pyplot.plot (thresholds, results) pyplot.show () ‘ictal intelligence Good Lick Computer Science(Avot Data Science Projest ot Page 5 of 10 Xac dinh cdc hang chtta dif liéu tring lp from pandas import read_csv # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/iris.csv’ # load the dataset df = read_csv(path, header=None) # calculate duplicates dups = df.duplicated() # report if there are any duplicates print (dups.any()} # list all duplicate rows print (af [dups]) Xéa cdc hing chifa dir ligu tring lp # delete rows of duplicate data from the dataset from pandas import read csv # define the location of the dataset path = “https: //raw.githubusercontent .com/jbrownlee/Datasets/master/iris.csv’ # load the dataset Gf = read_csv(path, heade: print (df. shape) # delete duplicate rows f.drop_duplicates (inplace=True) print (af. shape) Ky thuat trich xudt dac trung dé chuan bi dé ligu Vie chuan bi dé igu c6 thé l& mét thach thifc. Cach tiép can thuéng duge quy dinh va tuan theo nhat la phan tich tap dif liéu, xem xét cdc yéu cau cla thuat toan va chuyén déi dif ligu thé dé dap Ung tét nhat mong gi clia thuat toan. Bidu nay cé thé hiéu qué nhung citing ch4m va cé thé ddi hdi chuyén mén sau ca v8 phan tich dif ligu va thuat toan hoc may. Mét cach tiép cn khac 8 coi viée chuan bj cdc bién dau vao nhu mét siéu tham s6 ca quy trinh mé hinh héa va digu chinh né cing véi viéc Iya chon thudt todn va cu hinh thuat todn. Day cling cé thé la mét cach tiép c4n hiéu qua dé tim ra céc giai phap khOng tryc quan va déi héi rat it chuyén mén, mac di né cé thé tén kém vé mat tinh todn. fone) Science Artificial intelligence — Saad Lah Computer(Avot Data Science Projest ot Page 6 of 10 Mét cach tiép cn tim kiém diém trung gian gitfa hai phyong phap chuan bi di liu nay [a coi viée chuyén déi dif ligu dau vo nhu mét quy trinh ky thuét tinh nang hodc trich xudt tinh nang . Biéu nay lién quan dén viéc 4p dung mét bé ky thuat chudn bi dif ligu phé bién hodc thuéng hitu ich cho dif ligu thé, sau dé tng hgp tat c cc tinh n&ng lai voi nhau dé tao ra mét tap div ligu Ién, sau dé diéu chinh va danh gid mé hinh trén dif ligu nay. Triét ly cla phuong phap nay coi méi ky thuat chuan bi dtr ligu nhu m@t phép bién i trich xuat cdc dic diém ndi bat ty dir ligu th6 dé trinh bay cho thuat todn hoc. Ly tung nhét la nhting phép bién déi nhu vay sé gé réi cdc méi quan hé phiic tap va cdc bién dau vao phic hgp, tir dé cho phép sir dung cac thuat todn mé hinh héa don gian hon, chang han nhu kj thuat hoc may tuyén tinh Vi thiéu tén nao hay hon, ching t6i sé goi day la “ Phuong phdp kf thuat tinh nang” hodc “ Phutong phép trich xuat tinh nang” dé dinh cu hinh chuan bi dif liéu cho dy &n mé hinh dy doan. Né cho phép str dung phan tich dif ligu va chuyén mén vé thuat todn trong vide lyva chon céc phurong phap chuan bi dif ligu va cho phép tim ra céc giai php khdng truc quan nhung véi chi phi tinh ton thép hon nhiéu Vigc loai trir sé lugng dac diém dau vao cting cé thé duge gidi quyét ré rang thong qua viée si dung cdc kV thuat Iva chon dac diém cé gang xép hang tam quan trong hoc gia tri cla s6 lurgng ldn cdc dc diém duc trich xuat va chi chon mét tap hop con nhé phi hgp nhat dé dur doan muc tiéu. Bién di Chiing ta cé thé khdm pha céch tiép can nay dé chudn bi dé ligu bing mét vi du hoat ééng. Example 01 # example of loading and summarizing the wine dataset from pandas import read_csv # define the location of the dataset url= ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/naster/wine.csv' # load the dataset as a data frane Gf = read_cav(url, header=None) # retrieve the numpy array data = df.values # split the columns into input and output variables X, y= datals, :-1], datal:, -1] # summarize the shape of the loaded data print (X.shape, y.shape) Artificial intelligence Skyeod Leck Computer Solence(Avot Data Science Projest ot Page 7 of 10 Example 02: # baseline model performance on the wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEnceder from sklearn.model_selection import RepeatedstratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # load the dataset url = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/wine.csv! af = read_csv(url, header=None) data = df.values X, y= datal:, :-1], datal:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder (). fit_transform(y.astype('str')) # define the model model = LogisticRegression(solver="1iblinear') # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy’, cv=cv, n_job: » + report performance print ('Accuracy: %.3f (8.3f)' & (mean(scores), std(scores))) # data preparation as feature engineering for wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedstratifiedxFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import Standardscaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSvD # load the dataset cience Artificial intelligence — Saad Lah Computer Si(Aor pata Science Project o8 url= ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/naster/wine.csv’ Gf = read_csv(url, header=None) data = df.values X, y = data 1, datat:, -1) + minimally prepare dataset X = K-astype(' float") y = Labelincoder () .fit_transform(y-astype('str')) # transforms for the feature union transforms = List () transforms append(("nms", MinMaxScaler())) transforms .append(("ss', Standardscaler())) transforms.append(("rs', RobustScaler())) transforms.append(("at', QuantileTransforner (n_quantiles=100, output_distribution='normal'))) transforms append (('kbd", KBinsDiscretizer(n_bin strategy="uniforn'))) transforms append(("pca', PCA(n_components=7))) transforms.append(("svd', TruncatedSVD(n_components=7) )) # create the feature union fu = FeatureUnion (transforms) # define the model model = LogisticRegression (solve: # define the pipeline steps = list () steps-append(("fu', £u)) steps-append(('m', model)) pipeline = Pipeline (steps=steps) # define the cross-validation procedure ev = Repeatedstratifiedxrold(n_split: + evaluate model 0, encode="ordinal', Liblinear') 0, n_repeate=3, random stat scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print (Accuracy: %.3f (%.3f)' & (mean(scores), std(scores))) Example 03 # data preparation as feature engineering with feature selection for wine dataset from numpy import mean from numpy import std from pandas import read_esv from sklearn.model_selection import RepeatedstratifiedKFold clence Artificial intelligence — Saad Lah Computer Si(Aor pata Science Project o8 from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.feature_selection import RFE from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedsvD # load the dataset url = ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/master/wine.csv’ df = read_csv(url, heade: data = df.values X, y= datal:, :-1], datal:, -1) # minimally prepare dataset X = X.astype('float") y = Labelincoder() . fit_transform(y.astype(‘str')) # transforms for the feature union transforms = list () transforms.append((‘nms', MinMaxScaler())) jone) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('at', QuantileTransformer (n_quantile: output_distribution="normal'))) transforms append (("kbd', KBinsDiscretizer(n_bini strategy="uniform'))) transforms.append(("pca', PCA(n_components=7))} transforms.append(('svd", TruncatedSvD(n_components=7))) # create the feature union fu = FeatureUnion (transforms) # define the feature selection rfe = RFE (estimator-LogisticRegression(solver="liblinear'), n_features_to_select=15) # define the model model = LogisticRegression (solve! # define the pipeline steps = List () steps-append(("fu', fu)) steps-append((‘rfe', rfe)) steps.append(('m', model)) 00, 0, encode="ordinal', Liblinear') clence Artificial intelligence — Saad Lah Computer Si(Aor pata Science Project o8 Page 10 of 10 pipeline = Pipeline (steps=steps) # define the cross-validation procedure ev = Repeatedstratifiedkrold(n splite=10, n_repeats=3, randon_state=1) # evaluate model scores = cross_val_score(pipeline, x, y, scorin n_jobs=-1) # report performance print ("Accuracy: §.3f (%.3£)' % (mean(scores), std(scores))) ccuracy', ev=cv, Vige chay vi du sé danh gia higu suat cla m6 hinh va bao cdo d6 chinh xac phan loal trung binh va d6 Iéch chudn Luu y : Két qua ola ban c6 thé thay di ty theo tinh ch&t ngau nhién cia thuat toan hodc quy trinh danh gia hoac su khac biét vé d6 chinh xac bang sé. Hay can nhac viée chay vi du nay mét vai lan va so snh két qua trung binh. Mét lan niva, chung ta cé thé thay higu sudt tang thém ttr 96.8% véi t&t cd cdc tinh nang duoc trich xuat Ién khong 98,9 v6i Iva chon tinh nang duoc sir dung truée khi lap mé hinh, Ban co thé dat duoc hiéu sudt tét hon bang ky thuat Iva chon tinh nang khac hoac voi nhidu hoac it tinh nang dugc chon hon khong? Hay cho t6i biét nh@ng gi ban kham pha. ial Intelligence Seed Lack Computer Science
You might also like
Code2pdf - 6747470b23712-Đã G P
PDF
No ratings yet
Code2pdf - 6747470b23712-Đã G P
4,332 pages
03 Pandas
PDF
No ratings yet
03 Pandas
51 pages
Zoo Data Set
PDF
No ratings yet
Zoo Data Set
37 pages
AIVN Numpy
PDF
No ratings yet
AIVN Numpy
94 pages
Tuan02 22280065 MacMinhPhuc
PDF
No ratings yet
Tuan02 22280065 MacMinhPhuc
13 pages
Pandas Py
PDF
No ratings yet
Pandas Py
20 pages
Revise Machine Learning Final 20 - 06
PDF
No ratings yet
Revise Machine Learning Final 20 - 06
16 pages
Exercise Data Analysis
PDF
No ratings yet
Exercise Data Analysis
25 pages
Weather Prediction Big Data
PDF
No ratings yet
Weather Prediction Big Data
29 pages
Week 2 - Friday - Data Structure - Updated
PDF
No ratings yet
Week 2 - Friday - Data Structure - Updated
95 pages
Revise Machine Learning Final 20 - 06
PDF
No ratings yet
Revise Machine Learning Final 20 - 06
14 pages
Get Lilog
PDF
No ratings yet
Get Lilog
5 pages
Altair Basic
PDF
No ratings yet
Altair Basic
16 pages
Group 2 TH
PDF
No ratings yet
Group 2 TH
25 pages
Data Preparation Project
PDF
No ratings yet
Data Preparation Project
23 pages
HỒ ĐẮC LÂM
PDF
No ratings yet
HỒ ĐẮC LÂM
21 pages
HMC Report-3
PDF
100% (2)
HMC Report-3
13 pages
2023 Logictic Regression VN
PDF
No ratings yet
2023 Logictic Regression VN
49 pages
Hanoi 2019 Và 2020-Descriptive Statistics
PDF
No ratings yet
Hanoi 2019 Và 2020-Descriptive Statistics
7 pages
Main
PDF
No ratings yet
Main
6 pages
A926534728 - 28953 - 8 - 2025 - Spark Mllib
PDF
No ratings yet
A926534728 - 28953 - 8 - 2025 - Spark Mllib
8 pages
Lí thuyết wavelet 3
PDF
No ratings yet
Lí thuyết wavelet 3
37 pages
CODE-R XSTK
PDF
No ratings yet
CODE-R XSTK
9 pages
Slide AI-ML-DL
PDF
No ratings yet
Slide AI-ML-DL
124 pages
Iot Chuong-4
PDF
No ratings yet
Iot Chuong-4
8 pages
LAB01
PDF
No ratings yet
LAB01
10 pages
Pyth
PDF
No ratings yet
Pyth
5 pages
Daily AI Exercise - Kmeans - KNN
PDF
No ratings yet
Daily AI Exercise - Kmeans - KNN
15 pages
AIO2024 LLamaIndex
PDF
No ratings yet
AIO2024 LLamaIndex
65 pages
WEB Data Mining by Clustering Technique
PDF
No ratings yet
WEB Data Mining by Clustering Technique
111 pages
Roadmap To DS
PDF
No ratings yet
Roadmap To DS
12 pages
2425 HK1 MMDS
PDF
No ratings yet
2425 HK1 MMDS
3 pages
bài tập xử lý ảnh
PDF
No ratings yet
bài tập xử lý ảnh
15 pages
Chitieu
PDF
No ratings yet
Chitieu
2 pages
Unit 6 Pyspark - MLlib
PDF
No ratings yet
Unit 6 Pyspark - MLlib
6 pages
Baitap 2 basicML
PDF
No ratings yet
Baitap 2 basicML
3 pages
Quan Ly Du An II
PDF
No ratings yet
Quan Ly Du An II
104 pages
Pyspark MLlib
PDF
No ratings yet
Pyspark MLlib
4 pages
Practice 4 Sol
PDF
No ratings yet
Practice 4 Sol
16 pages
Chuong 0
PDF
No ratings yet
Chuong 0
4 pages
Overview of Data Cleaning
PDF
No ratings yet
Overview of Data Cleaning
17 pages
Code R For Student
PDF
No ratings yet
Code R For Student
6 pages
Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU
PDF
No ratings yet
Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU
50 pages
Order Analysis
PDF
No ratings yet
Order Analysis
10 pages
ĐỀ TÀI BTL MT2013 - 2023 - danh - sach - de - tai
PDF
No ratings yet
ĐỀ TÀI BTL MT2013 - 2023 - danh - sach - de - tai
2 pages
Kết quả sơ bộ: Total Variance Explained
PDF
No ratings yet
Kết quả sơ bộ: Total Variance Explained
11 pages
DM C0 Introduction
PDF
No ratings yet
DM C0 Introduction
18 pages
Bai2 Data - Pandas
PDF
No ratings yet
Bai2 Data - Pandas
11 pages
1. Tìm Ki Ếm Chiều Rộng
PDF
No ratings yet
1. Tìm Ki Ếm Chiều Rộng
12 pages
7.2 - Data Frame Basics - mp4
PDF
No ratings yet
7.2 - Data Frame Basics - mp4
3 pages
Thuat Toan KNN
PDF
No ratings yet
Thuat Toan KNN
6 pages
Machine Learning Lab - Preprocessing
PDF
No ratings yet
Machine Learning Lab - Preprocessing
13 pages
1
PDF
No ratings yet
1
5 pages
PySpark Entity Resolution
PDF
No ratings yet
PySpark Entity Resolution
5 pages
Tài Liệu System Identification Toolbox Tiếng Việt - Tài Liệu, eBook, Giáo Trình
PDF
No ratings yet
Tài Liệu System Identification Toolbox Tiếng Việt - Tài Liệu, eBook, Giáo Trình
22 pages
Stock Price Prediction
PDF
No ratings yet
Stock Price Prediction
12 pages
NguyenTrungThinh BT3.3
PDF
No ratings yet
NguyenTrungThinh BT3.3
5 pages
BTVN1 - Colaboratory
PDF
No ratings yet
BTVN1 - Colaboratory
4 pages
ĐỀ TÀI BTL MT2013 - 2023
PDF
No ratings yet
ĐỀ TÀI BTL MT2013 - 2023
6 pages