Cheat Sheet
Cheat Sheet
data set:
other format works as intui7vely with pandas.
h.p://pandas.pydata.org Each variable is saved
in its own column
Each observa8on is
saved in its own row
M * A
Syntax – Crea7ng DataFrames Reshaping Data – Change the layout of a data set
a b c df.sort_values('mpg')
1 4 7 10 Order rows by values of a column (low to high).
2 5 8 11
3 6 9 12
df.sort_values('mpg',ascending=False)
Order rows by values of a column (high to low).
df = pd.DataFrame(
{"a" : [4 ,5, 6], pd.melt(df) df.pivot(columns='var', values='val') df.rename(columns = {'y':'year'})
"b" : [7, 8, 9], Gather columns into rows. Spread rows into columns. Rename the columns of a DataFrame
"c" : [10, 11, 12]},
index = [1, 2, 3]) df.sort_index()
Sort the index of a DataFrame
Specify values for each column.
df.reset_index()
df = pd.DataFrame(
[[4, 7, 10], Reset index of DataFrame to row numbers, moving
[5, 8, 11], index to columns.
[6, 9, 12]], pd.concat([df1,df2]) pd.concat([df1,df2], axis=1) df.drop(['Length','Height'], axis=1)
index=[1, 2, 3], Append rows of DataFrames Append columns of DataFrames Drop columns from DataFrame
columns=['a', 'b', 'c'])
Specify values for each row.
n v
a b c
Subset Observa8ons (Rows) Subset Variables (Columns)
1 4 7 10
d
2 5 8 11
e 2 6 9 12
df = pd.DataFrame( df[['width','length','species']]
df[df.Length > 7] df.sample(frac=0.5) Select mul7ple columns with specific names.
{"a" : [4 ,5, 6],
Extract rows that meet logical Randomly select frac7on of rows. df['width'] or df.width
"b" : [7, 8, 9],
criteria. df.sample(n=10)
"c" : [10, 11, 12]}, Select single column with specific name.
df.drop_duplicates() Randomly select n rows. df.filter(regex='regex')
index = pd.MultiIndex.from_tuples(
Remove duplicate rows (only df.iloc[10:20]
[('d',1),('d',2),('e',2)], Select columns whose name matches regular expression regex.
considers columns). Select rows by posi7on.
names=['n','v'])))
df.head(n) df.nlargest(n, 'value') regex (Regular Expressions) Examples
Create DataFrame with a Mul7Index
Select first n rows. Select and order top n entries. '\.' Matches strings containing a period '.'
df.tail(n) df.nsmallest(n, 'value')
Method Chaining
'Length$' Matches strings ending with word 'Length'
Select last n rows. Select and order bo.om n entries. '^Sepal' Matches strings beginning with the word 'Sepal'
Most pandas methods return a DataFrame so that '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
another pandas method can be applied to the Logic in Python (and pandas) ''^(?!Species$).*' Matches strings except the string 'Species'
result. This improves readability of code. < Less than != Not equal to
df = (pd.melt(df) df.loc[:,'x2':'x4']
.rename(columns={ > Greater than df.column.isin(values) Group membership Select all columns between x2 and x4 (inclusive).
'variable' : 'var', == Equals pd.isnull(obj) Is NaN df.iloc[:,[1,2,5]]
'value' : 'val'}) <= Less than or equals pd.notnull(obj) Is not NaN
Select columns in posi7ons 1, 2 and 5 (first column is 0).
.query('val >= 200') df.loc[df['a'] > 10, ['a','c']]
>= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all
) Select rows mee7ng logical condi7on, and only the specific columns .
h.p://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (h.ps://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Wri.en by Irv Lus7g, Princeton Consultants
Summarize Data Handling Missing Data Combine Data Sets
df['w'].value_counts() df.dropna() adf bdf
Count number of rows with each unique value of variable Drop rows with any column having NA/null data. x1 x2 x1 x3
len(df) df.fillna(value) A 1 A T
# of rows in DataFrame. Replace all NA/null data with value. B 2 B F
df['w'].nunique() C 3 D T
# of dis7nct values in a column.
df.describe() Make New Columns Standard Joins
Basic descrip7ve sta7s7cs for each column (or GroupBy) x1 x2 x3 pd.merge(adf, bdf,
A 1 T how='left', on='x1')
B 2 F Join matching rows from bdf to adf.
C 3 NaN
df.assign(Area=lambda df: df.Length*df.Height)
pandas provides a large set of summary func8ons that operate on Compute and append one or more new columns. x1 x2 x3 pd.merge(adf, bdf,
different kinds of pandas objects (DataFrame columns, Series, df['Volume'] = df.Length*df.Height*df.Depth A 1.0 T how='right', on='x1')
GroupBy, Expanding and Rolling (see below)) and produce single Add single column. B 2.0 F Join matching rows from adf to bdf.
values for each of the groups. When applied to a DataFrame, the pd.qcut(df.col, n, labels=False) D NaN T
result is returned as a pandas Series for each column. Examples: Bin column into n buckets.
x1 x2 x3 pd.merge(adf, bdf,
sum() min()
A 1 T how='inner', on='x1')
Sum values of each object. Minimum value in each object. Vector Vector B 2 F Join data. Retain only rows in both sets.
count() max() func8on func8on
Count non-NA/null values of Maximum value in each object.
each object. mean() x1 x2 x3 pd.merge(adf, bdf,
median() Mean value of each object. pandas provides a large set of vector func8ons that operate on all A 1 T how='outer', on='x1')
Median value of each object. var() columns of a DataFrame or a single selected column (a pandas B 2 F Join data. Retain all values, all rows.
quantile([0.25,0.75]) Variance of each object. Series). These func7ons produce vectors of values for each of the C 3 NaN
Quan7les of each object. std() columns, or a single Series for the individual Series. Examples: D NaN T
apply(function) Standard devia7on of each max(axis=1) min(axis=1) Filtering Joins
Apply func7on to each object. object. Element-wise max. Element-wise min. x1 x2 adf[adf.x1.isin(bdf.x1)]
clip(lower=-10,upper=10) abs() A 1 All rows in adf that have a match in bdf.
Group Data Trim values at input thresholds Absolute value. B 2
df.groupby(by="col") The examples below can also be applied to groups. In this case, the x1 x2 adf[~adf.x1.isin(bdf.x1)]
Return a GroupBy object, func7on is applied on a per-group basis, and the returned vectors C 3 All rows in adf that do not have a match in bdf.
grouped by values in column are of the length of the original DataFrame.
named "col". shift(1) shift(-1) ydf zdf
Copy with values shihed by 1. Copy with values lagged by 1. x1 x2 x1 x2
df.groupby(level="ind") rank(method='dense') cumsum() A 1 B 2
Return a GroupBy object, Ranks with no gaps. Cumula7ve sum. B 2 C 3
grouped by values in index rank(method='min') cummax() C 3 D 4
level named "ind". Ranks. Ties get min rank. Cumula7ve max.
Set-like Opera7ons
All of the summary func7ons listed above can be applied to a group. rank(pct=True) cummin()
Addi7onal GroupBy func7ons: Ranks rescaled to interval [0, 1]. Cumula7ve min. x1 x2 pd.merge(ydf, zdf)
size() agg(function) rank(method='first') cumprod() B 2 Rows that appear in both ydf and zdf
Size of each group. Aggregate group using func7on. Ranks. Ties go to first value. Cumula7ve product. C 3 (Intersec7on).
x1 x2
Windows PloUng A
B
1
2
pd.merge(ydf, zdf, how='outer')
Rows that appear in either or both ydf and zdf
(Union).
df.expanding() df.plot.hist() df.plot.scatter(x='w',y='h') C 3
Return an Expanding object allowing summary func7ons to be Histogram for each column Sca.er chart using pairs of points D 4 pd.merge(ydf, zdf, how='outer',
applied cumula7vely. indicator=True)
df.rolling(n) x1 x2
A 1 .query('_merge == "left_only"')
Return a Rolling object allowing summary func7ons to be .drop(['_merge'],axis=1)
applied to windows of length n. Rows that appear in ydf but not zdf (Setdiff).
h.p://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (h.ps://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Wri.en by Irv Lus7g, Princeton Consultants
LEARN DATA SCIENCE ONLINE
Start Learning For Free - www.dataquest.io
KEY IMPORTS
We’ll use shorthand in this cheat sheet Import these to start
df - A pandas DataFrame object import pandas as pd
s - A pandas Series object import numpy as np
1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)
>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect()
Merge the rdd values for
each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() Return (key,value) pairs as a
.mapValues(list)
.collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Summary
Aggregating
>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
>>> sc.appName Return application name min) .collect() applying a function
>>> sc.applicationId Retrieve application ID
>>>
>>>
sc.defaultParallelism Return default level of parallelism
sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
>>> rdd.first() Take first RDD element
Set which master the context connects to with the --master argument, and
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),
("b",["p", "r"])])
['a',2,'b',7]
>>> rdd.keys().collect() Return (key,value) RDD's keys
Stopping SparkContext
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \
.show()
in the groups
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F
Select Filter
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age
From RDDs
>>> df.select("firstName", Show firstName, and lastName is
df.lastName.like("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>>
>>>
people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
peopledf = spark.createDataFrame(people)
.show() Running SQL Queries Programmatically
Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+
| name|age|
Add, Update & Remove Columns >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
|
|
Mine| 28|
Filip| 29|
Adding Columns .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
From Spark Data Sources .withColumn('state',df.address.state) \
.withColumn('streetAddress',df.address.streetAddress) \
Output
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25|
|[New York,10021,N...| 21|
John|
Jane|
Smith|[[212 555-1234,ho...|
Doe|[[322 888-1234,ho...|
Updating Columns DataFrame
+--------------------+---+---------+--------+--------------------+
>>> df2 = spark.read.load("people.json", format="json")
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() >>> spark.stop()
>>> df.head() Return first n rows Count the number of rows in df
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.take(2) Return the first n rows >>> df.printSchema() Print the schema of df DataCamp
>>> df.schema Return the schema of df >>> df.explain() Print the (logical and physical) plans
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Bokeh Glyphs Grid Layout
Learn Bokeh Interactively at www.DataCamp.com, Scatter Markers >>> from bokeh.layouts import gridplot
taught by Bryan Van de Ven, core contributor >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> row1 = [p1,p2]
fill_color='white') >>> row2 = [p3]
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], >>> layout = gridplot([[p1,p2],[p3]])
color='blue', size=1)
Plotting With Bokeh Line Glyphs Tabbed Layout
>>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> from bokeh.models.widgets import Panel, Tabs
The Python interactive visualization library Bokeh >>> tab1 = Panel(child=p1, title="tab1")
pd.DataFrame([[3,4,5],[3,2,1]]),
enables high-performance visual presentation of color="blue") >>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
large datasets in modern web browsers.
Customized Glyphs Also see Data
Linked Plots
Bokeh’s mid-level general purpose bokeh.plotting Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select') Linked Axes
interface is centered around two main components: data >>> p.circle('mpg', 'cyl', source=cds_df, >>> p2.x_range = p1.x_range
and glyphs. selection_color='red', >>> p2.y_range = p1.y_range
nonselection_alpha=0.1) Linked Brushing
>>> p4 = figure(plot_width = 100,
+ = Hover Glyphs tools='box_select,lasso_select')
>>> from bokeh.models import HoverTool
>>> p4.circle('mpg', 'cyl', source=cds_df)
data glyphs plot >>> hover = HoverTool(tooltips=None, mode='vline')
>>> p5 = figure(plot_width = 200,
>>> p3.add_tools(hover)
tools='box_select,lasso_select')
The basic steps to creating plots with the bokeh.plotting >>> p5.circle('mpg', 'hp', source=cds_df)
interface are: US
Colormapping >>> layout = row(p4,p5)
1. Prepare some data: >>> from bokeh.models import CategoricalColorMapper
Asia
Europe
Python lists, NumPy arrays, Pandas DataFrames and other sequences of values
2. Create a new plot
>>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
palette=['blue', 'red', 'green'])
4 Output & Export
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
>>> p = figure(title="simple line example", Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
1 Data Also see Lists, NumPy & Pandas
>>> p.add_layout(legend, 'right')
Legend Orientation
>>> from bokeh.embed import components
>>> script, div = components(p)
Under the hood, your data is converted to Column Data
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row
Iss
ue
4
DJANGO CHEAT SHEET default value escape
force_escape
GENERAL
default_if_none value
Version 1.0
ESCAPING
www.mercurytide.co.uk/careers yesno "yes,no,none" safe don’t escape
stringformat "s" escapejs \x20 escapes
python “%” formatting
iriencode IRI to URI
Template tags ... - end tag required
urlencode %20 escapes
first
{# one line comment #} ifequal… var1 var2 last add 5
autoescape… on/off ifnotequal… var1 var2 random divisibleby 3
block… name include "template" length
NUMBERS
floatformat decimal_places
comment… load tag_library length_is number filesizeformat
cycle "one" "two" "three" now "date format" join ", " get_digit n
LISTS
debug regroup list_of_dicts by key as var make_list nth-rightmost digit from integer
extends "template" spaceless… makes list of digits/characters pluralize "y,ies"
filter... filter1|filter2 templatetag openblock slice "1:5"
open or close block, brace, variable, comment date "date_format"
firstof var1 var2 "default" dictsort "key"
url view arg,kwarg=value time "date_format"
& TIMES
dictsortreversed "key"
DATES
for... item in a_list
widthratio abc timesince datetime
if…else…endif boolean expression a÷b×c
unordered_list
adds <li> tags timeuntil datetime
ifchanged… var with… var1.attr as var2
lower
Template date formats upper
h 01 to 12 d 01 to 31 F January T EST, MDT title
TIME
DAY
MONTH
Z -43200 to 43200 (seconds)
G 0 to 23 Friday N Jan., Feb., March, May ljust width
f 1, 1:30
FORMATS
DAY OF
WEEK
TEXT FORMATTING
YEAR
wordcount
MISC
& PM
MISC
AM
Iss
ue
4
DJANGO CHEAT SHEET BooleanField DateTimeField ForeignKey(model)
auto_now =False related_name ="model_set"
Version 1.0 NullBooleanField
CharField auto_now_add =False limit_choices_to =query_kwargs
max_length to_field ="key_field"
TimeField
auto_now =False
Model fields common options TextField
auto_now_add =False
ManyToManyField(model)
SlugField related_name ="model_set"
null =False max_length =50 EmailField limit_choices_to =query_kwargs
blank =False max_length =75 through ="IntermediateModel"
FilePathField
choices =list_of_tuples symmetrical =True
path ="/home/images" IPAddressField
db_column ="column_name" match =r"\.jpg$" URLField OneToOneField(model)
db_index =False recursive =True verify_exists =True parent_link ="field"
db_tablespace ="tablespace_name" max_length =100 max_length =200
GenericForeignKey("content_type_field",
default =value_or_func IntegerField FileField "object_id_field")
editable =True PositiveIntegerField upload_to ="uploads/"
help_text ="text" AutoField max_length =100
primary_key =False DecimalField storage =FileSystemStorage
unique =False max_digits =10 Form error_messages keys
ImageField
unique_for_date ="date_field" decimal_places =2 required max_decimal_places
upload_to ="uploads/"
unique_for_month ="date_field" FloatField max_length =100 max_length max_whole_digits
unique_for_year ="date_field" storage =FileSystemStorage min_length missing
SmallIntegerField
height_field ="field_name" invalid empty
PositiveSmallIntegerField invalid_choice invalid_image
width_field ="field_name"
Meta class options CommaSeparatedIntegerField max_value invalid_list
max_length =50 XMLField min_value invalid_link
abstract =False schema_path =path_to_RelaxNG_schema max_digits
DateField
db_table ="table_name" auto_now =False
db_tablespace ="tablespace_name" auto_now_add =False
get_latest_by ="field_name"
order_with_respect_to ="fk_field_name" Form fields
ordering =list_of_columns
BooleanField FloatField ModelChoiceField IPAddressField
permissions =list_of_tuples max_value queryset
NullBooleanField FileField
unique_together =list_of_tuples min_value empty_label ="----"
CharField ImageField
verbose_name ="Model" max_length cache_choices =False
DateField FilePathField
verbose_name_plural ="Models" min_length input_formats =list_of_formats ModelMultipleChoiceField path ="/home/images"
queryset match =r"\.jpg$"
IntegerField DateTimeField
Form fields common options max_value input_formats =list_of_formats
cache_choices =False recursive =False
min_value URLField RegexField
required =True TimeField max_length
input_formats =list_of_formats regex
label ="Field name" DecimalField min_length
max_value max_length
initial ={} ChoiceField verify_exists =False min_length
min_value choices =list_of_tuples validator_user_agent
widget =Widget
max_digits
help_text ="text" decimal_places MultipleChoiceField EmailField Copyright 2008 Mercurytide Ltd.
error_messages ={} choices =list_of_tuples max_length Released under the Creative Commons
min_length Attribution-Share Alike 2.5 UK: Scotland Licence.
Python Cheat Sheet
by Dave Child (DaveChild) via cheatography.com/1/cs/19/
Python sys Variables Python Class Special Methods Python String Methods (cont)
replace() utcoffset()
isoformat() dst()
__str__() tzname()
strftime(format)
%A Weekday (Sunday)
%p AM or PM
%w Weekday² (0 to 6)
%x Date
%X Time
%Y Year (2008)
In the terminal:
plot_url = py.plot ( fig )