0% found this document useful (0 votes)
37 views10 pages

Lab 01 Ds Project 01

Uploaded by

Nguyen Xuan Vi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

Lab 01 Ds Project 01

Uploaded by

Nguyen Xuan Vi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 10
(awor Data Scie ojest ot Paget of 10 Céch thc hién lam sach dit liu cho Machine Learning bang Python 1. BO dif ligu Ion x6n: Lam sach dit liéu dé cép dén viée xc dinh va stra céc I6i trong tap di liéu c6 thé téc dng tiéu cure dén mé hinh dy doan. # summarize the number of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset data = loadtxt (urlopen(path), delimiter=',') # summarize the number of unique values in each column for 4 in range(data.shape[1])+ print (i, len(unique(data[:, i]))) # summarize the number of unique values for each column using numpy from pandas import read_csv # define the location of the dataset path = ‘https: //raw.githubusercontent . com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset df = read_csv(path, header=None) # summarize the number of unique values in each column print (af nunique()) Xéa cc cot chtta mot gid tri # delete columns with a single unique value from pandas import read_esv # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset df = read_csv(path, header=None) print (df. shape) # get number of unique values for each column counts = df-nunique() # record columns to delete to_del = [i for i,v in enumerate(counts) if v print (to_del) # drop useless columns gence — (Seed Lach Compucter ial Intell Artif (Avot Data Science Projest ot Page 2 of 10 Gf.drop(to_del, axis=1, inplace=True) print (df.shape) # summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset data = loadtxt (urlopen(path), delimiter=", ') # summarize the nunber of unique values in each column for i in range(data.shape[1]): num = len(unique(data[:, i])) percentage = float (num) / data-shape[0] * 100 print('td, $d, $.1£89' & (i, num, percentage) # summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = ‘https: //raw.githubusercontent . com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset data = loadtxt (urlopen(path), delimiter=",') # summarize the number of unique values in each column for i in range(data.shape[1]): num = len(unique(datal:, i])) percentage = float (num) / data-shape[0] * 100 Af percentage < i: print ('ad, %d, 8.1f88' © (i, num, percentage) # delete columns where number of unique values is less than 18 of the rows from pandas import read csv # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset Gf = read_csv(path, header=None) print (df.shape) ‘tal mteligence — (Gea Lach Computer Science (aeot pata Page 2 of 10 # get number of unique values for each column counts = df.nunigue() # record columns to delete to del = [i for i,v in enumerate (counts) if (float (v)/df.shape[0]*100) < u print (to_del) # drop useless columns df.drop(to_del, axis=1, inplace=True) print (df.shape) Loai bé cdc cét cé phuong sai thap Mét céch tiép cn khdc cho van d@ loai bé cdc c6t cé it gid tri duy nhat la xem xét phuong sai ca cét. Hay nhd lai rang phuong sai la mot théng ké duoc tinh toan trén mét bién la ch€nh éch binh phuong trung binh cua cdc gia tri trén mAu so vdi gid tri trung binh. Phuong sai cé thé dugc sit dung lam bé loc dé xéc dinh cdc c6t can xéa khoi tap dit ligu, Mét cSt cO mét gid tri duy nhat cé phuong sai la 0,0 va mot cét 6 rat it gid tri duy nhat sé cé gid tri phurong sai nhd. Lép VarianceThreshold tl thy vién scikit-learn hé trg tinh nang nay nh mét loai Iya chon tinh nang. Mét thé hién ca Ip c6 thé duge tao chi dinh déi s6 “ ngudng , mac dinh la 0,0 dé xda cdc c6t c6 mét gid tri duy nhét. Sau dé, né ¢6 thé duge diéu chinh va ap dung cho tap dif ligu bang cach goi ham fit_transform() dé tao phién ban da chuyén déi cUa tap dif liéu trong do cdc cot co phuong sai thép hon ngudng sé ty déng bi x6a, # example of apply the variance threshold from pandas import read_csv from sklearn.feature selection import VarianceThreshold # define the location of the dataset path = 'https://raw.githubusercontent .com/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset Gf = read_csv(path, header=None) # split data into inputs and outputs data = df.values X = datal:, :-1] Geod Leck ° Artificial intell (Avot Data Science Projest ot Page 4 of 10 y = datal:, -1] Print (X.shape, y.shape) # define the transform transform = VarianceThreshold () # transform the input data X_sel = transform. fit_transfoxm(X) print (X_sel-shape) # explore the effect of the variance thresholds on the number of selected features from numpy import arange from pandas import read_csv from sklearn.feature_selection import VarianceThreshold fzom matplotlib import pyplot # define the location of the dataset path = ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/master/oil- spill.csv' # load the dataset Gf = read_csv(path, header=None) + split data into inputs and outputs data = df.values X= gata a y= data{:, -1) print (X.shape, y-shape) # define thresholds to check thresholds = arange(0.0, 0.55, 0.05) # apply transform with each threshold results = list() for t in thresholds: # define the transform transform = VarianceThreshold (threshold=t) # transform the input data x sel = transform. fit transform (x) + determine the nunber of input features n_features = X_sel.shape(1] print (">Thresholdet.2£, Features=id' % (t, n_features)) # store the result results.append(n features) # plot the threshold vs the number of selected features pyplot.plot (thresholds, results) pyplot.show () ‘ictal intelligence Good Lick Computer Science (Avot Data Science Projest ot Page 5 of 10 Xac dinh cdc hang chtta dif liéu tring lp from pandas import read_csv # define the location of the dataset path = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/iris.csv’ # load the dataset df = read_csv(path, header=None) # calculate duplicates dups = df.duplicated() # report if there are any duplicates print (dups.any()} # list all duplicate rows print (af [dups]) Xéa cdc hing chifa dir ligu tring lp # delete rows of duplicate data from the dataset from pandas import read csv # define the location of the dataset path = “https: //raw.githubusercontent .com/jbrownlee/Datasets/master/iris.csv’ # load the dataset Gf = read_csv(path, heade: print (df. shape) # delete duplicate rows f.drop_duplicates (inplace=True) print (af. shape) Ky thuat trich xudt dac trung dé chuan bi dé ligu Vie chuan bi dé igu c6 thé l& mét thach thifc. Cach tiép can thuéng duge quy dinh va tuan theo nhat la phan tich tap dif liéu, xem xét cdc yéu cau cla thuat toan va chuyén déi dif ligu thé dé dap Ung tét nhat mong gi clia thuat toan. Bidu nay cé thé hiéu qué nhung citing ch4m va cé thé ddi hdi chuyén mén sau ca v8 phan tich dif ligu va thuat toan hoc may. Mét cach tiép cn khac 8 coi viée chuan bj cdc bién dau vao nhu mét siéu tham s6 ca quy trinh mé hinh héa va digu chinh né cing véi viéc Iya chon thudt todn va cu hinh thuat todn. Day cling cé thé la mét cach tiép c4n hiéu qua dé tim ra céc giai phap khOng tryc quan va déi héi rat it chuyén mén, mac di né cé thé tén kém vé mat tinh todn. fone) Science Artificial intelligence — Saad Lah Computer (Avot Data Science Projest ot Page 6 of 10 Mét cach tiép cn tim kiém diém trung gian gitfa hai phyong phap chuan bi di liu nay [a coi viée chuyén déi dif ligu dau vo nhu mét quy trinh ky thuét tinh nang hodc trich xudt tinh nang . Biéu nay lién quan dén viéc 4p dung mét bé ky thuat chudn bi dif ligu phé bién hodc thuéng hitu ich cho dif ligu thé, sau dé tng hgp tat c cc tinh n&ng lai voi nhau dé tao ra mét tap div ligu Ién, sau dé diéu chinh va danh gid mé hinh trén dif ligu nay. Triét ly cla phuong phap nay coi méi ky thuat chuan bi dtr ligu nhu m@t phép bién i trich xuat cdc dic diém ndi bat ty dir ligu th6 dé trinh bay cho thuat todn hoc. Ly tung nhét la nhting phép bién déi nhu vay sé gé réi cdc méi quan hé phiic tap va cdc bién dau vao phic hgp, tir dé cho phép sir dung cac thuat todn mé hinh héa don gian hon, chang han nhu kj thuat hoc may tuyén tinh Vi thiéu tén nao hay hon, ching t6i sé goi day la “ Phuong phdp kf thuat tinh nang” hodc “ Phutong phép trich xuat tinh nang” dé dinh cu hinh chuan bi dif liéu cho dy &n mé hinh dy doan. Né cho phép str dung phan tich dif ligu va chuyén mén vé thuat todn trong vide lyva chon céc phurong phap chuan bi dif ligu va cho phép tim ra céc giai php khdng truc quan nhung véi chi phi tinh ton thép hon nhiéu Vigc loai trir sé lugng dac diém dau vao cting cé thé duge gidi quyét ré rang thong qua viée si dung cdc kV thuat Iva chon dac diém cé gang xép hang tam quan trong hoc gia tri cla s6 lurgng ldn cdc dc diém duc trich xuat va chi chon mét tap hop con nhé phi hgp nhat dé dur doan muc tiéu. Bién di Chiing ta cé thé khdm pha céch tiép can nay dé chudn bi dé ligu bing mét vi du hoat ééng. Example 01 # example of loading and summarizing the wine dataset from pandas import read_csv # define the location of the dataset url= ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/naster/wine.csv' # load the dataset as a data frane Gf = read_cav(url, header=None) # retrieve the numpy array data = df.values # split the columns into input and output variables X, y= datals, :-1], datal:, -1] # summarize the shape of the loaded data print (X.shape, y.shape) Artificial intelligence Skyeod Leck Computer Solence (Avot Data Science Projest ot Page 7 of 10 Example 02: # baseline model performance on the wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEnceder from sklearn.model_selection import RepeatedstratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # load the dataset url = ‘https: //raw.githubusercontent .com/jbrownlee/Datasets/master/wine.csv! af = read_csv(url, header=None) data = df.values X, y= datal:, :-1], datal:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder (). fit_transform(y.astype('str')) # define the model model = LogisticRegression(solver="1iblinear') # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy’, cv=cv, n_job: » + report performance print ('Accuracy: %.3f (8.3f)' & (mean(scores), std(scores))) # data preparation as feature engineering for wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedstratifiedxFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import Standardscaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSvD # load the dataset cience Artificial intelligence — Saad Lah Computer Si (Aor pata Science Project o8 url= ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/naster/wine.csv’ Gf = read_csv(url, header=None) data = df.values X, y = data 1, datat:, -1) + minimally prepare dataset X = K-astype(' float") y = Labelincoder () .fit_transform(y-astype('str')) # transforms for the feature union transforms = List () transforms append(("nms", MinMaxScaler())) transforms .append(("ss', Standardscaler())) transforms.append(("rs', RobustScaler())) transforms.append(("at', QuantileTransforner (n_quantiles=100, output_distribution='normal'))) transforms append (('kbd", KBinsDiscretizer(n_bin strategy="uniforn'))) transforms append(("pca', PCA(n_components=7))) transforms.append(("svd', TruncatedSVD(n_components=7) )) # create the feature union fu = FeatureUnion (transforms) # define the model model = LogisticRegression (solve: # define the pipeline steps = list () steps-append(("fu', £u)) steps-append(('m', model)) pipeline = Pipeline (steps=steps) # define the cross-validation procedure ev = Repeatedstratifiedxrold(n_split: + evaluate model 0, encode="ordinal', Liblinear') 0, n_repeate=3, random stat scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print (Accuracy: %.3f (%.3f)' & (mean(scores), std(scores))) Example 03 # data preparation as feature engineering with feature selection for wine dataset from numpy import mean from numpy import std from pandas import read_esv from sklearn.model_selection import RepeatedstratifiedKFold clence Artificial intelligence — Saad Lah Computer Si (Aor pata Science Project o8 from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.feature_selection import RFE from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedsvD # load the dataset url = ‘https: //raw.githubusercontent .con/jbrownlee/Datasets/master/wine.csv’ df = read_csv(url, heade: data = df.values X, y= datal:, :-1], datal:, -1) # minimally prepare dataset X = X.astype('float") y = Labelincoder() . fit_transform(y.astype(‘str')) # transforms for the feature union transforms = list () transforms.append((‘nms', MinMaxScaler())) jone) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('at', QuantileTransformer (n_quantile: output_distribution="normal'))) transforms append (("kbd', KBinsDiscretizer(n_bini strategy="uniform'))) transforms.append(("pca', PCA(n_components=7))} transforms.append(('svd", TruncatedSvD(n_components=7))) # create the feature union fu = FeatureUnion (transforms) # define the feature selection rfe = RFE (estimator-LogisticRegression(solver="liblinear'), n_features_to_select=15) # define the model model = LogisticRegression (solve! # define the pipeline steps = List () steps-append(("fu', fu)) steps-append((‘rfe', rfe)) steps.append(('m', model)) 00, 0, encode="ordinal', Liblinear') clence Artificial intelligence — Saad Lah Computer Si (Aor pata Science Project o8 Page 10 of 10 pipeline = Pipeline (steps=steps) # define the cross-validation procedure ev = Repeatedstratifiedkrold(n splite=10, n_repeats=3, randon_state=1) # evaluate model scores = cross_val_score(pipeline, x, y, scorin n_jobs=-1) # report performance print ("Accuracy: §.3f (%.3£)' % (mean(scores), std(scores))) ccuracy', ev=cv, Vige chay vi du sé danh gia higu suat cla m6 hinh va bao cdo d6 chinh xac phan loal trung binh va d6 Iéch chudn Luu y : Két qua ola ban c6 thé thay di ty theo tinh ch&t ngau nhién cia thuat toan hodc quy trinh danh gia hoac su khac biét vé d6 chinh xac bang sé. Hay can nhac viée chay vi du nay mét vai lan va so snh két qua trung binh. Mét lan niva, chung ta cé thé thay higu sudt tang thém ttr 96.8% véi t&t cd cdc tinh nang duoc trich xuat Ién khong 98,9 v6i Iva chon tinh nang duoc sir dung truée khi lap mé hinh, Ban co thé dat duoc hiéu sudt tét hon bang ky thuat Iva chon tinh nang khac hoac voi nhidu hoac it tinh nang dugc chon hon khong? Hay cho t6i biét nh@ng gi ban kham pha. ial Intelligence Seed Lack Computer Science

You might also like