Python Data Science Cheat Sheet
Python Data Science Cheat Sheet
1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> [Link](x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> [Link]("A Title") Add plot title
data=titanic)
>>> uniform_data = [Link](10, 12) >>> [Link]("Survived") Adjust the label of the y-axis
>>> [Link](data=iris,orient="h") Boxplot with wide-form data
>>> data = [Link]({'x':[Link](1,101), >>> [Link]("Sex") Adjust the label of the x-axis
'y':[Link](0,4,100)}) Violinplot >>> [Link](0,100) Adjust the limits of the y-axis
>>> [Link](x="age", Violin plot >>> [Link](0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> [Link](ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)
From RDDs
>>> [Link]("firstName", Show firstName, and lastName is
[Link]("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from [Link] import * Startswith - Endswith >>> [Link](10)\ df with 10 partitions
>>> [Link]("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = [Link] [Link] \ lastName starts with Sm
.startswith("Sm")) \ >>> [Link](1).[Link]() df with 1 partition
>>> lines = [Link]("[Link]")
.show()
>>> parts = [Link](lambda l: [Link](",")) >>> [Link]([Link]("th")) \ Show last names ending in th
>>>
>>>
people = [Link](lambda p: Row(name=p[0],age=int(p[1])))
peopledf = [Link](people)
.show() Running SQL Queries Programmatically
Substring
Specify Schema >>> [Link]([Link](1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = [Link](lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> [Link]("people")
>>> schemaString = "name age" Between >>> [Link]("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> [Link]([Link](22, 24)) \ Show age: values are TRUE if between >>> [Link]("customer")
field_name in [Link]()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> [Link](people, schema).show()
+--------+---+
| name|age|
Add, Update & Remove Columns >>> df5 = [Link]("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = [Link]("SELECT * FROM global_temp.people")\
|
|
Mine| 28|
Filip| 29|
Adding Columns .show()
|Jonathan| 30|
+--------+---+ >>> df = [Link]('city',[Link]) \
.withColumn('postalCode',[Link]) \
From Spark Data Sources .withColumn('state',[Link]) \
.withColumn('streetAddress',[Link]) \
Output
.withColumn('telePhoneNumber', Data Structures
JSON explode([Link])) \
>>> df = [Link]("[Link]") .withColumn('telePhoneType',
>>> [Link]() >>> rdd1 = [Link] Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode([Link])) >>> [Link]().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> [Link]() Return the contents of df as Pandas
|[New York,10021,N...| 25|
|[New York,10021,N...| 21|
John|
Jane|
Smith|[[212 555-1234,ho...|
Doe|[[322 888-1234,ho...|
Updating Columns DataFrame
+--------------------+---+---------+--------+--------------------+
>>> df2 = [Link]("[Link]", format="json")
>>> df = [Link]('telePhoneNumber', 'phoneNumber') Write & Save to Files
Parquet files Removing Columns >>> [Link]("firstName", "city")\
>>> df3 = [Link]("[Link]") .write \
TXT files >>> df = [Link]("address", "phoneNumber") .save("[Link]")
>>> df4 = [Link]("[Link]") >>> df = [Link]([Link]).drop([Link]) >>> [Link]("firstName", "age") \
.write \
.save("[Link]",format="json")
Inspect Data
>>> [Link] Return df column names and data types >>> [Link]().show() Compute summary statistics Stopping SparkSession
>>> [Link]() Display the content of df >>> [Link] Return the columns of df
>>> [Link]() >>> [Link]()
>>> [Link]() Return first n rows Count the number of rows in df
>>> [Link]() Return first row >>> [Link]().count() Count the number of distinct rows in df
>>> [Link](2) Return the first n rows >>> [Link]() Print the schema of df DataCamp
>>> [Link] Return the schema of df >>> [Link]() Print the (logical and physical) plans
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Model Architecture Inspect Model
>>> model.output_shape
Sequential Model Model output shape
Keras >>> from [Link] import Sequential
>>>
>>>
[Link]()
model.get_config()
Model summary representation
Model configuration
Learn Python for data science Interactively at [Link] >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> [Link](optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from [Link] import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> [Link](Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> [Link](optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> [Link](Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> [Link](Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> [Link](optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from [Link] import Sequential Multi-Class Classification metrics=['mae'])
>>> from [Link] import Dense >>> from [Link] import Dropout
>>> data = [Link]((1000,100)) >>> [Link](Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = [Link](2,size=(1000,1)) >>> [Link](Dropout(0.2)) >>> [Link](loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> [Link](Dense(512,activation='relu')) metrics=['accuracy'])
>>> [Link](Dense(32, >>> [Link](Dropout(0.2))
activation='relu', >>> [Link](Dense(10,activation='softmax'))
>>>
input_dim=100))
[Link](Dense(1, activation='sigmoid'))
Regression Model Training
>>> [Link](optimizer='rmsprop', >>> [Link](Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> [Link](x_train4,
loss='binary_crossentropy', >>> [Link](Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> [Link](data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = [Link](data) >>> from [Link] import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> [Link](Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
[Link](Activation('relu'))
[Link](Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> [Link](Activation('relu')) >>> score = [Link](x_test,
>>> [Link](MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> [Link](Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> [Link](Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
[Link](Activation('relu'))
[Link](Conv2D(64,(3, 3)))
Prediction
>>> from [Link] import boston_housing, >>> [Link](Activation('relu')) >>> [Link](x_test4, batch_size=32)
mnist, >>> [Link](MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> [Link](Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
[Link](Flatten())
[Link](Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> [Link](Activation('relu')) >>> from [Link] import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> [Link](Dropout(0.5)) >>> [Link]('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> [Link](Dense(num_classes))
>>> [Link](Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from [Link] import urlopen
>>> data = [Link](urlopen("[Link]
ml/machine-learning-databases/pima-indians-diabetes/
>>> from [Link] import Embedding,LSTM Optimization Parameters
[Link]"),delimiter=",") >>> [Link](Embedding(20000,128)) >>> from [Link] import RMSprop
>>> X = data[:,0:8] >>> [Link](LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> [Link](Dense(1,activation='sigmoid')) >>> [Link](loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from [Link] import EarlyStopping
>>> from [Link] import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> [Link](x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from [Link] import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from [Link] import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = [Link](x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = [Link](x_test2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Glyphs Customized Glyphs Also see Data
Bokeh Scatter Markers Selection and Non-Selection Glyphs
Learn Bokeh Interactively at [Link], >>> [Link]([Link]([1,2,3]), [Link]([3,2,1]), >>> p = figure(tools='box_select')
taught by Bryan Van de Ven, core contributor fill_color='white') >>> [Link]('mpg', 'cyl', source=cds_df,
>>> [Link]([Link]([1.5,3.5,5.5]), [1,4,3], selection_color='red',
color='blue', size=1) nonselection_alpha=0.1)
Line Glyphs
Plotting With Bokeh >>> [Link]([1,2,3,4], [3,4,5,6], line_width=2) Hover Glyphs
>>> p2.multi_line([Link]([[1,2,3],[5,6,7]]), >>> hover = HoverTool(tooltips=None, mode='vline')
The Python interactive visualization library Bokeh [Link]([[3,4,5],[3,2,1]]), >>> p3.add_tools(hover)
enables high-performance visual presentation of color="blue")
Colormapping
large datasets in modern web browsers. Rows & Columns Layout US
Asia >>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
Europe
[21.4,4,109, 'Europe']]),
Label 2
Label 3
5
>>> p = Scatter(df, x='mpg', y ='hp', marker='square',
Show or Save Your Plots
y-axis
>>> from [Link] import figure xlabel='Miles Per Gallon',
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') ylabel='Horsepower')
x-axis