0% found this document useful (0 votes)

16 views66 pages

Python For Machine Learning

The document provides an overview of using Python for machine learning, highlighting its advantages such as a large community and extensive libraries like TensorFlow. It covers installation steps, basic commands for data manipulation using libraries like pandas and matplotlib, and introduces concepts like data cleaning, visualization, and statistical analysis. Additionally, it discusses advanced topics like dimensionality reduction and association rules, as well as time series forecasting techniques.

Uploaded by

salmaabo341

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views66 pages

Python For Machine Learning

Uploaded by

salmaabo341

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

PYTHON MACHINE

LEARNING
Why Python?
 A large community
 tons of machine learning specific libraries
 easy to learn
 TensorFlow make Python the leading language in the data science community.

About Python
 It is case sensitive
 Text indentation is important in logic flow!
 Use # symbol to add a code comment
 Use ‘’’ ‘’’ to comment a block of code
Prepare your machine
 Install Python 2.7.xx
 Pip install pandas
 Pip install matplotlib Sklearn Statistics scipy seaborn Ipython
pip install --no-cache-dir pystan #to force redownload exist package in case of error
 Get Data set for training through
http://archive.ics.uci.edu/ml/datasets.html
How to Run Python ML application
 Open “IDLE (Python GUI)”  File  New File
 Write your code on the new window then  Run  Run Module

Alt+3  Comment lines

Alt+4  Remove lines comments
Anaconda
 Anaconda is Python distribution for data scientists .
 It has around 270 packages including the most important ones for most scientific
applications, data analysis, and machine learning such as NumPy, SciPy, Pandas,
IPython, matplotlib, and scikit-learn.
 Download it for free from https://www.continuum.io/downloads
Import Statistical libraries
import time
import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
from scipy import stats
import sklearn
import seaborn
from IPython.display import Image
import numpy as np
import pandas as pd
 Pandas DataFrame similar to Python lists but it has extra column (index) and Any
operation to perform on a DataFrame , get’s performed on every single element inside it.
 Example of convert python list to DataFrame 

Common Operations
1) Set dataset columns names  data.columns = [‘blue’,’green’,’red’]
2) Get value of any cell use cell index and column index/name like this : data.ix[2,’red’]
3) replace NaN with zero using : data = data.replace(np.nan,0)
4) Select a subset of columns in new dataFrame : NewDF= data [ [ ‘blue’ , ’red’ ] ]
5) Count of rows that has data on a column: data[‘red'].value_counts().sum()
The missing values are dennoted by NaN __
6) Select count of distinct value on a column: data[‘blue’].nunique()
7) Drop duplicates rows: data = data.drop_duplicates()
Basic commands
Reading the Data from disk into Memory
 data = pd.read_csv('examples/trip.csv’)

Reading the Data from HDFS into Memory

 with hd.open("/home/file.csv") as f:
 data = pd.read_csv(f)

Printing Size of the Dataset and Printing First/Last Few Rows

 print len(data)
 data.head()
 data.tail()

 data.describe() #get data summary count, mean, the min and max values

Order Dataset and get first and last element of a column

 data = data.sort_values(by= 'birthyear')
 data.reset_index()
 print data.ix[1, 'birthyear’ ] #first element
 print data.ix[len(data)-1, 'birthyear’ ] #last element
DataFrame cleaning
 If need to convert column with negative values
data['AwaitingTime'] = data['AwaitingTime'].apply(lambda x: abs(x))
 Convert string values to integer values (DayOfTheWeek to Integers)
Day_mapping = {'Monday' : 0, 'Tuesday' : 1, 'Wednesday' : 2, 'Thursday' : 3, 'Saturday' : 5, 'Sunday' : 6}
data['DayOfTheWeek'] = data['DayOfTheWeek'].map(Day_mapping)
 Remove rows that has age column less than zero
data = data[data['Age'] >= 0]
 Delete column from Dataset
del data[‘from_station_id']
Filter data to remove rows that has NA/0
> df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
 df = df[(df.T != 0).any()] #keep row if any feature contain values
> print df
> df = df[(df.T != 0).all()] #keep row if all features contain values
> print df

Original DataFrame Keep raw if ANY contains a value Keep raw if ALL contains values

a b a b a b
1 0 0 2 1 0 4 1 1
2 1 0 3 0 1
3 0 1 4 1 1 df[(df.T != 0).all()]

4 1 1
df[(df.T != 0).any()]
Drop the columns where Any/All elements are NaN
 data.dropna(how='any’)

how : {‘any’, ‘all’}

 any : if any NA values are present, drop that label
 all : if all values are NA, drop that label
Drop duplicate rows
Consider duplicate if all the row are the same, return result in new dataframe
 dataframe1 = df.drop_duplicates(subset=None, inplace=False)

Consider duplicate if Column1, Column3 are the same, drop duplicate except last row, return result in the same dataframe
 df.drop_duplicates(subset=[‘Column1', ‘Column3'], keep=‘last’)

Save unique data

 df.to_csv(file_name_output)
Split Dataset, Create training, and test dataset
How to split dataset by 70-30?
[70% of the observations fall into training and 30% of observations fall into the test dataset.]

 x_train,x_test, y_train, y_test = train_test_split(data[‘x’],data[‘y'], test_size=0.3, random_state=1)

import matplotlib.pyplot as plt
 plt.bar – creates a bar chart
 plt.scatter – makes a scatter plot
 plt.boxplot – makes a box and whisker plot
 plt.hist – makes a histogram
 plt.plot – creates a line plot
Plot for Data array
import numpy as np
import matplotlib.pyplot as plt

Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])

Column1=0
Column2=1

for i in range( len(Data) ):

plt.plot( Data[ i ] [Column1] , Data[ i ] [Column2] ,'r.’) #r. :red

plt.show()
Bar plot for Data array
import numpy as np
import matplotlib.pyplot as plt

Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])

Column1=0
Column2=1
BarWidth=1/1.5
for i in range( len(Data) ):
plt.bar( Data[ i ] [Column1] , Data[ i ] [Column2] , BarWidth , color="blue")
plt.show()
Scatter plot for Data
import numpy as np
import matplotlib.pyplot as plt
Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])
Column1=0
Column2=1

plt.scatter( Data[:, Column1] , Data[:, Column2] , marker = "x“ , s=150 , linewidths = 5 , zorder = 1)
plt.show()

#Where s is the size of the marker

Calculating Pair Plot Between All Features
import numpy as np
import matplotlib.pyplot as plt
data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])

import pandas
df = pandas.DataFrame(Data)

#Calculating Pair Plot Between All Features

seaborn.pairplot(df)

plt.show()
Basic commands
Sort, Filter, group, and Plotting the Data
 data = data.sort_values(by='birthyear’)
 data = data[(data['birthyear'] >= 1931) & (data['birthyear']<=1999)]
 groupby_birthyear = data.groupby('birthyear').size()
 groupby_birthyear.plot.bar(title = 'Distribution of birth years’, figsize = (15,4))
 plt.show()
Stacked column chart
data1 = data.groupby(['birthyear', 'gender']).size().unstack('gender').fillna(0)
data1.plot.bar(title ='Distribution of birth years by Gender', stacked=True, figsize = (15,4))
Convert string to DateTime

Create new column as date/time

 data['StartTime1'] = data['starttime'].apply(lambda x: datetime.datetime.strptime(x, "%m/%d/%Y %H:%M") )

Extract Year, Month, Day, and Hour

 data['year'] = data['StartTime1'].apply(lambda x: x.year )
 data['month'] = data['StartTime1'].apply(lambda x: x.month )
 data['day'] = data['StartTime1'].apply(lambda x: x.day )
 data['hour'] = data['StartTime1'].apply(lambda x: x.hour )
Split String column to create new columns for Year-Month-Day

for index, component in enumerate(['year', 'month', 'day']):

data[component] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-')[index]))

OR
data[‘year’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[0]))
data[‘month’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[1]))
data[‘day’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[2]))

Notes
‘abcde’[0:2]  ‘ab’
‘abcde’[:2]  ‘ab’
‘abcde’[:-1]  ‘abcd’
Concepts
 Mean : Average
 Median: When the frequency of observations in the data is odd, the middle
data point is returned as the median.
 Mode: returns the observation in the dataset with the highest frequency.
 Variance: represents variability of data points about the mean.
 Standard deviation: just like variance, also captures the spread of data along
the mean. The only difference is that it is a square root of the variance.
 Normal distribution (Gaussian distribution): the mean lies at the center of this
distribution with a spread (i.e., standard deviation) around it. Some 68% of the
observations lie within 1 standard deviation from the mean; 95% of the
observations lie within 2 standard deviations from the mean, whereas 99.7% of
the observations lie within 3 standard deviations from the mean.
 Outliers: refer to the values distinct from majority of the observations. These
occur either naturally, due to equipment failure, or because of entry mistakes.
Plot mean
Draw mean of ‘TripDuration’ group by ‘StartTime_date’
 data.groupby('starttime_date')['tripduration'].mean()
.plot.bar(title = 'Distribution of Trip duration by date', figsize = (15,4))

Calculate Mean, Standard deviation, Median

 print data['tripduration_mean'].mean()
 print data['tripduration_mean'].std()
 print data['tripduration_mean'].median()
Seasonal pattern vs Cyclic Pattern vs Trend
 Seasonal: pattern over fixed period, like monthly pattern
 Cycle: pattern overall without worry about periods, to check
if patterns repeat over non-periodic time cycles.
 Trend: In long-term does the continuous variable increase/decrease

Seasonal pattern Cycle pattern Trend

Correlation directions
Calculate the correlation
 pd.set_option('precision', 3)
 correlations = data[['tripduration','age']].corr()
 print(correlations)
.corr(method='pearson')
method : {‘pearson’, ‘kendall’, ‘spearman’}

pearson : standard correlation coefficient

kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation

For more information http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/

Rename Dataset columns' names
 Get data from the next URL and save as “concrete_data.csv”
http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength

 data = pd.read_csv('examples/concrete_data.csv’)
 print len(data)
 data.head()

Renaming the Columns

 data.columns = ['cement_component', 'furnace_slag', 'flay_ash’,\
'water_component’, 'superplasticizer', 'coarse_aggregate’, \
'fine_aggregate', 'age', 'concrete_strength']
Loop through dataset
Draw relation between each feature and concrete_strength

plt.figure(figsize=(15,10.5))
plot_count = 1
for feature in list(data.columns)[:-1]:
plt.subplot(3,3,plot_count)
plt.scatter(data[feature], data['concrete_strength'])
plt.xlabel(feature.replace('_',' ').title())
plt.ylabel('Concrete strength')
plot_count+=1
plt.show()
#//Get the correlation between data
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
Dimensionality Reduction
PCA Dimensionality Reduction
When we have a big dataset and we need to reduce the number of columns for fast
training (remove redundancy features).
Example : Dataset (data) contains 20 features and we need to reduce them to 4 features

from sklearn.decomposition import PCA

NewDataFrame = PCA(n_components = 4).fit_transform(data)

Association Algorithm
Association Rules
Association rules analysis is a technique to uncover how items are
associated to each other.
There are three common ways to measure association:
1) Support: This says how popular an itemset is, as measured by the
proportion of transactions in which an itemset appears.

2) Confidence: This says how likely item Y is purchased when item X is

purchased (not good if one item is common)

3) Lift: This says how likely item Y is purchased when item X is purchased
Association Rules (Apriori Algorithm)
from apyori import apriori

transactions = [
['beer', 'nuts'],
['beer', 'cheese'],
]
results = list(apriori(transactions))
Association Rules (Example)
 pip install apyori

# Import the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv(‘apriori_data2.csv’, header = None)
records = [] for i in range(0, 11):
records.append([str(dataset.values[i,j]) for j in range(0, 10)])
# Train Apriori Model
from apyori import apriori
rules = apriori(records, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)
# Visualising the results
results = list(rules)
Time Series
Supervised machine learning
Time Series
A time series forecast is different from regression in that time acts as an exploratory variable and should be continuous
along equal intervals.

Algorithms
1) Autoregressive Forecast Model: uses observations at previous time steps to predict observations at future time step.

2) ARIMA Forecast Model: linear model, predict after remove trends and seasonality.

3) Prophet: Facebook library, it is quick and gives very good results, forecasting time series data.
Related Packages: pip install fbprophet
The input to Prophet is always a dataframe with two columns: ds and y.
The ds (datestamp) column must contain a date or datetime (either is fine).
The y column must be numeric, and represents the measurement we wish to forecast,
It is good to use log(y) and not the actual y to remove trends and noise data
FB Prophet
COMPUTER VISION
Open CV for computer vision
 To install OpenCV pip install opencv-python
 Normal images are 4 dimensional array for each pixel (Blue ,Green, Red, and Alpha)
 It is normal to process images in Gray scale for fast performance
 Video is just a loop of images. So, any code processing images can process video too

For more information about open CV please visit

 https://www.tutorialspoint.com/opencv/index.htm
 https://www.youtube.com/playlist?list=PLQVvvaa0QuDdttJXlLtAJxJetJcqmqlQq
Common OpenCV functions
 image = cv2.imread(ImgPATH) #Read image
 cv2.cvtColor(ImgPATH, cv2.COLOR_BGR2GRAY) #Convert to gray scale
 cv2.imshow( ‘Window Title’, image ) #Show Image in a window
 cv2.imwrite(ImgPATH, image) # to write image to HD
 _ , image = cv2.threshold(image, threshold, maxval, thresholdType) #change pixel to maxval if value greater than threshold
 image = cv2.adaptiveThreshold(image, maxValue, adaptiveMethod, thresholdType, blockSize, C) #Threshold is automatic calculated

Original THRESH_BINARY THRESH_BINARY_INV THRESH_TRUNC THRESH_TOZERO THRESH_TOZERO_INV

thresholdType

adaptiveMethod
 ADAPTIVE_THRESH_MEAN_C : threshold value is the mean of neighborhood area.
 ADAPTIVE_THRESH_GAUSSIAN_C : threshold value is the weighted sum of neighborhood values where weights are a Gaussian window.
blockSize : A variable of the integer type representing size of the pixelneighborhood used to calculate the threshold value.
C : A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean).
OpenCV Video Stream Example
#VideoStream1 = cv2.VideoCapture('c:/boxed-correct.avi’)  read stream from avi video file
VideoStream1 = cv2.VideoCapture(0)  read stream from the first connected video cam
while True:
IsReturnFrame1,ImgFrame1=VideoStream1.read()
if IsReturnFrame1 == 0: break
GrayImgFrame1=cv2.cvtColor(ImgFrame1,cv2.COLOR_BGR2GRAY)
cv2.imshow('ImageTitle1',ImgFrame1)
cv2.imshow('ImageTitle2',GrayImgFrame1)
if cv2.waitKey(0): break

VideoStream1.release()
cv2.destroyAllWindows()
Image processing using OpenCV

Original

Threshold_Color

ThresholdGray Threshold_GAUSSIAN Threshold_MEAN

Object Recognition MainImage

template

Search for image with exact lighting/scale/angle

Result
NLP

Natural Language Processing

NLP (Natural Language Processing)
Why we need NLP packages?
 NLP Package handle a wide range of tasks such as Named-entity recognition , part-of-speech (POS) tagging, sentiment
analysis, document classification, topic modeling, and much more.

Named-entity recognition: means extract names of persons, organizations, locations, time, quantities, percentages, etc.
example: Jim bought 300 shares of Acme Corp. in 2006.  [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.
Part-of-speech tagging: used to identification of words as nouns, verbs, adjectives, adverbs, etc.

Top 5 Python NLP libraries

 NLTK good as education and research tool. Its modularized structure makes it excellent for learning and exploring NLP
concepts, but it's not meant for production.
 TextBlob is built on top of NLTK, and it's more easily-accessible. good library for fast-prototyping or building applications that
don't require highly optimized performance. Beginners should start here.
 Stanford's CoreNLP is a Java library with Python wrappers. It's in many existing production systems due to its speed.
 SpaCy is a new NLP library that's designed to be fast, streamlined, and production-ready. It's not as widely adopted.
 Gensim is most commonly used for topic modeling and similarity detection. It's not a general-purpose NLP library, but for the
tasks it does handle, it does them well.
NLTK Package for Natural Language Processing
 Pip install nltk
 To download all packages using GUI, write  nltk.download()
NLTK concepts
Tokenizing
 Word tokenizers : split text by words
 Sentence tokenizers : split text by paragraphs

Lexicon
 Get words and their actual means in the context

Corpora
 Text classification. Ex: medical journals, presidential speech

Lemmatizing (stemming)
 return the word to its root, ie (gone, went, going)  go
WordNet
 List of different words that have the same meaning for the given word
NLTK Simple example
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."
WordsArray = sent_tokenize(EXAMPLE_TEXT)
print(WordsArray)

ListOfAvailableStopWords= set(stopwords.words("english"))
print(ListOfAvailableStopWords)
textblob
 pip install textblob
 python -m textblob.download_corpora
Sentimental Analysis Example
 We have a sample of 3000 random users comments on imdb.com, amazon.com, and
yelp.com website
http://archive.ics.uci.edu/ml/machine-learning-databases/00331/
 Each comment has a score, Score is either 1 (for positive) or 0 (for negative)
Python Deep Learning
Common Deep learning fields
Main Fields Visual examples:
 Computer vision  Colorization of Black and White Images.
https://www.youtube.com/watch?v=_MJU8VK2PI4
 Speech recognition
 Adding Sounds To Silent Movies.
 Natural language processing https://www.youtube.com/watch?v=0FW99AQmMc8

 Audio recognition  Automatic Machine Translation.

 Social network filtering  Object Classification in Photographs.

 Machine translation  Automatic Handwriting Generation.

 Character Text Generation.
 Image Caption Generation.
 Automatic Game Playing.
https://www.youtube.com/watch?v=TmPfTpjtdgg
Common Deep learning development tool (Keras)
 Keras is a free Artificial Neural Networks (ANN) library (deep learning library).
 it is a high-level neural networks API, written in Python and capable of running on top of
TensorFlow, CNTK, or Theano.
 It was developed with a focus on enabling fast experimentation. Being able to go from
idea to result with the least possible delay is key to doing good research.

Keras
TensorFlow CNTK Theano
Deep learning Computer vision (image processing)
3 common ways to detect objects
 median based features
 edge based features
 threshold based features
How to use Neural Network Models in Keras
Five steps
1. Define Network.
2. Compile Network.
3. Fit Network.
4. Evaluate Network.
5. Make Predictions.
Step 1. Define Network
 Neural networks are defined in Keras as a sequence of layers.
 The first layer in the network must define the number of inputs to expect. for a Multilayer
Perceptron model this is specified by the input_dim attribute.
 Example of small Multilayer Perceptron model (2 inputs, 5 hidden layers, 1 output)
model = Sequential()
model.add(Dense(5, input_dim=2))
model.add(Dense(1))
 Re-write after add activation function
model = Sequential()
model.add(Dense(5, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
Available Activation Functions
Optional, it is like a filter, used to solve some common predictive modeling problem, to get significant boost in performance.
 Sigmoid: used for Binary Classification (2 class) one neuron the output layer. What ever the input it will map to zero or one.
 Softmax: used for Multiclass Classification (>2 class), one output neuron per class value.
 Linear: used for Regression, the number of neurons matching the number of outputs.
 Tanh: what ever the input it will convert to number between -1 and 1
 Relu: either 0 for a<0 or a for a>0. so, it just remove the negative values and pass the positive as it is.
 LeakyReLU: minimize the value of negative values and pass as it is if positive
 elu
 selu
 softplus
 softsign
 hard_sigmoid
 PReLU
 ELU
 ThresholdedReLU
Step 2. Compile Network
 Specifically the optimization algorithm to use to train the network and the loss function used to evaluate the network that is
minimized by the optimization algorithm.

model.compile(optimizer='sgd', loss='mse’)

Optimizers tool to minimize loss between prediction and real value. Commonly used optimization algorithms:
 ‘sgd‘ (Stochastic Gradient Descent) requires the tuning of a learning rate and momentum.
 ADAM requires the tuning of learning rate.
 RMSprop requires the tuning of learning rate.

Loss functions:
 Regression: Mean Squared Error or ‘mse‘.
 Binary Classification (2 class): ‘binary_crossentropy‘.
 Multiclass Classification (>2 class): ‘categorical_crossentropy‘.

Finally, you can also specify metrics to collect while fitting the model in addition to the loss function. Generally, the most useful
additional metric to collect is accuracy.

model.compile(optimizer='sgd', loss='mse', metrics=['accuracy'])

Optimizers
Step 3. Fit Network
history = model.fit(X, y, batch_size=10, epochs=100)

The network is trained using the backpropagation algorithm

 Batch size is the number of samples that going to be propagated through the network.
 epochs is the number of training times (for ALL the training examples)

Example: if you have 1000 training examples, and your batch size is 500, then it will take 2
iterations to complete 1 epoch.

Step 4. Evaluate Network
 The model evaluates the loss across all of the test patterns, as well as any other metrics
specified when the model was compiled.

 For example, for a model compiled with the accuracy metric, we could evaluate it on
a new dataset as follows:
loss, accuracy = model.evaluate(X, Y)
print("Loss: %.2f, Accuracy: %.2f% %" % (loss, accuracy*100))
Step 5. Make Predictions
probabilities = model.predict(X)
predictions = [float(round(x)) for x in probabilities]
accuracy = numpy.mean(predictions == Y) #count the number of True and divide by the total size
print("Prediction Accuracy: %.2f% %" % (accuracy*100))
Binary classification using Neural Network in Keras
Diabetes Data Set
Detect Diabetes Disease based on analysis
Dataset Attributes:
1. Number of times pregnant
2. Plasma
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
Save prediction model
 After train our model, ie, model.fit(X_train, Y_train), we can save this traning to use later.
 This task can done by Pickle package(Python Object Serialization Library), using dump
and load methods. Pickle can save any object not just the prediction model.

import pickle
…………….
model.fit(X_train, Y_train)
# save the model to disk
pickle.dump(model, open("c:/data.dump", 'wb’)) #wb= write bytes

# some time later... load the model from disk

model = pickle.load(open("c:/data.dump", 'rb’)) #rb= read bytes

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
ECS Card Photograph Endorsement Form: Setting The Standards
No ratings yet
ECS Card Photograph Endorsement Form: Setting The Standards
1 page
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Realism Middlemarch
100% (1)
Realism Middlemarch
31 pages
AL Notes
No ratings yet
AL Notes
61 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Phython Example
No ratings yet
Phython Example
12 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
DAV Practical
No ratings yet
DAV Practical
12 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Data Science
No ratings yet
Data Science
18 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Python in Research
No ratings yet
Python in Research
18 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
Even Students
No ratings yet
Even Students
36 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Python Cheatsheet.pptx
No ratings yet
Python Cheatsheet.pptx
2 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
Vertopal.com AsgOne
No ratings yet
Vertopal.com AsgOne
10 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Hduud
No ratings yet
Hduud
55 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Code Shabab Error 7
No ratings yet
Code Shabab Error 7
5 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
UNITIV BtechIot
No ratings yet
UNITIV BtechIot
43 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Sa1 Frame
No ratings yet
Sa1 Frame
51 pages
Catalog Tong May Phat Dien Cummins
No ratings yet
Catalog Tong May Phat Dien Cummins
114 pages
Proposal Parallel Engines1
No ratings yet
Proposal Parallel Engines1
14 pages
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
No ratings yet
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
36 pages
Scope of License
No ratings yet
Scope of License
5 pages
Experiment No. 1: Title
No ratings yet
Experiment No. 1: Title
22 pages
Infectious Smile Gui
No ratings yet
Infectious Smile Gui
4 pages
Malaysian School Counsellors' Challenges in Job Description, Job Satisfaction and Competency
No ratings yet
Malaysian School Counsellors' Challenges in Job Description, Job Satisfaction and Competency
7 pages
Measures of Central Tendency and Box and Whisker Plots
No ratings yet
Measures of Central Tendency and Box and Whisker Plots
36 pages
Assignment 1
No ratings yet
Assignment 1
1 page
Amlocor Steel Grade: Aml C
No ratings yet
Amlocor Steel Grade: Aml C
8 pages
83 Revision Questions For IGCSE Questions Solutions PDF
100% (4)
83 Revision Questions For IGCSE Questions Solutions PDF
5 pages
Acgih Manual 1998 (401-500)
No ratings yet
Acgih Manual 1998 (401-500)
100 pages
Report of Industrial Work Practices: Pt. Kawai Indonesia Plant (Kip) 2
No ratings yet
Report of Industrial Work Practices: Pt. Kawai Indonesia Plant (Kip) 2
5 pages
Health Psychology: Well-Being in A Diverse World Regan A R Gurung Instant Download
100% (1)
Health Psychology: Well-Being in A Diverse World Regan A R Gurung Instant Download
59 pages
System and Communication
No ratings yet
System and Communication
9 pages
PicoWay Candela Specifications Brochure Resolve
No ratings yet
PicoWay Candela Specifications Brochure Resolve
8 pages
Diffuse Double Layer
No ratings yet
Diffuse Double Layer
14 pages
Energymetabolism Chinchilla
100% (4)
Energymetabolism Chinchilla
7 pages
Capstone Update 2
No ratings yet
Capstone Update 2
2 pages
PC101 W14GatheringAgenda-1
No ratings yet
PC101 W14GatheringAgenda-1
2 pages
What Happened To You Book Discussion Guide-National Version
No ratings yet
What Happened To You Book Discussion Guide-National Version
7 pages
Lab Ex1
100% (1)
Lab Ex1
7 pages
Trgt-Hl-Ptcl-03057-Usa-Chair or Rocking Chair-Hmz-V8
No ratings yet
Trgt-Hl-Ptcl-03057-Usa-Chair or Rocking Chair-Hmz-V8
12 pages
M4 L1Assessment For Learning Using Assessment To Classify Learning and Understanding
No ratings yet
M4 L1Assessment For Learning Using Assessment To Classify Learning and Understanding
5 pages
SBM Assessment Tool For Online Validation With Essential MOVs
No ratings yet
SBM Assessment Tool For Online Validation With Essential MOVs
10 pages
IFPS User Manual
100% (1)
IFPS User Manual
678 pages
C13-Rating A
100% (1)
C13-Rating A
5 pages

Python For Machine Learning

Uploaded by

Python For Machine Learning

Uploaded by

PYTHON MACHINE

Alt+3  Comment lines

Reading the Data from HDFS into Memory

Printing Size of the Dataset and Printing First/Last Few Rows

Order Dataset and get first and last element of a column

how : {‘any’, ‘all’}

Save unique data

 x_train,x_test, y_train, y_test = train_test_split(data[‘x’],data[‘y'], test_size=0.3, random_state=1)

Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])

for i in range( len(Data) ):

Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])

#Where s is the size of the marker

#Calculating Pair Plot Between All Features

Create new column as date/time

Extract Year, Month, Day, and Hour

for index, component in enumerate(['year', 'month', 'day']):

Calculate Mean, Standard deviation, Median

Seasonal pattern Cycle pattern Trend

pearson : standard correlation coefficient

For more information http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/

Renaming the Columns

from sklearn.decomposition import PCA

NewDataFrame = PCA(n_components = 4).fit_transform(data)

2) Confidence: This says how likely item Y is purchased when item X is

# Import the libraries

For more information about open CV please visit

Original THRESH_BINARY THRESH_BINARY_INV THRESH_TRUNC THRESH_TOZERO THRESH_TOZERO_INV

ThresholdGray Threshold_GAUSSIAN Threshold_MEAN

Search for image with exact lighting/scale/angle

Natural Language Processing

Top 5 Python NLP libraries

 Audio recognition  Automatic Machine Translation.

 Social network filtering  Object Classification in Photographs.

 Machine translation  Automatic Handwriting Generation.

model.compile(optimizer='sgd', loss='mse', metrics=['accuracy'])

The network is trained using the backpropagation algorithm

# some time later... load the model from disk

You might also like