Aws Three Practical Use Cases With Databricks Ebook v5 101221
Aws Three Practical Use Cases With Databricks Ebook v5 101221
Three Practical
Use Cases for
Databricks on AWS
Solve your big data and AI challenges
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 2
Contents Introduction 3
Getting started 4
Conclusion 33
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 3
Rather than describe what Databricks does, we’re going to actually show you. In this eBook, you’ll
find three scenarios where Databricks helps data scientists take on specific challenges and what
the outcomes look like. We will cover:
Getting started The demos in this eBook show how Databricks notebooks help teams analyze and solve problems.
You can read through the demos here, or you can try using Databricks yourself by signing up for a
free account.
If you do want to try out the notebooks, once you’ve set up your free account, use the following
initial set-up instructions for any notebook.
Once you have selected Databricks in the You are all set to import the Databricks
AWS portal, you can start running it by notebooks.
creating a cluster. To run these notebooks,
To import the notebooks:
you can accept all the default settings in
Databricks for creating your cluster. 1. Click the Workspace icon
1. Click the Clusters icon in the left bar 3. Click the dropdown for Import. Drop your
notebook files into this dialog.
2. Select “Create Cluster”
4. In the notebook, click the dropdown that
3. Input a cluster name
says “Detached”
4. Click the “Create Cluster” button
5. Select the cluster you created in the
previous step
NOTEBOOK 1
Churn analysis demo Customer churn also known as customer attrition, customer turnover or customer defection, is
the loss of clients or customers. Predicting and preventing customer churn is vital to a range of
businesses.
In this notebook, we will use a pre-built model on Databricks to analyze customer churn. With this
model, we can predict when a customer is going to churn with 90% accuracy, so we can set up a
report to show customers that are about to churn, and then provide a remediation strategy, such
as a special offer to try and prevent them from churning. In this example we are looking at cellular
carriers, and the goal is to keep them from jumping to another carrier.
This notebook:
• Contains functionality that is relevant to data scientists, data engineers and business users
• Lends itself to a data-driven storytelling approach that demonstrates how notebooks can be
used within Databricks
• Illustrates a simple churn analysis workflow. We use a Customer Churn data set from the
UCI Machine Learning Repository
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 7
%sh
Step 1: Ingest Churn Data to a Notebook
mkdir /tmp/churn
We download the UCI data set hosted at the UCI site. wget http://www.sgi.com/tech/mlc/db/churn.data -O /tmp/churn/churn.data
wget http://www.sgi.com/tech/mlc/db/churn.test -O /tmp/churn/churn.test
From the churn.names metadata file, we can see the meaning of the --2017-08-25 19:52:36-- http://www.sgi.com/tech/mlc/db/churn.data
data columns: Resolving www.sgi.com (www.sgi.com)... 192.48.178.134
Connecting to www.sgi.com (www.sgi.com)|192.48.178.134|:80... connected.
• state: discrete. • total eve charge: continuous. HTTP request sent, awaiting response... 200 OK
Length: 376493 (368K) [text/plain]
• account length: continuous. • total night minutes: continuous.
Saving to: ‘/tmp/churn/churn.data’
• area code: continuous. • total night calls: continuous.
0K .......... .......... .......... .......... .......... 13% 131K 2s
• phone number: discrete. • total night charge: continuous.
50K .......... .......... .......... .......... .......... 27% 650K 1s
• international plan: discrete. • total intl minutes: continuous. 100K .......... .......... .......... .......... .......... 40% 336K 1s
• voice mail plan: discrete. • total intl calls: continuous. 150K .......... .......... .......... .......... .......... 54% 672K 1s
200K .......... .......... .......... .......... .......... 67% 57.9M 0s
• number vmail messages: • total intl charge: continuous.
250K .......... .......... .......... .......... .......... 81% 667K 0s
continuous. • number customer service calls: 300K .......... .......... .......... .......... .......... 95% 69.0M 0s
• total day minutes: continuous. continuous. 350K .......... ....... 100% 166M=0.8s
• total day calls: continuous. • churned: discrete <- This is 2017-08-25 19:52:37 (485 KB/s) - ‘/tmp/churn/churn.data’ saved
• total day charge: continuous. the label we wish to predict, [376493/376493]
• total eve minutes: continuous. indicating whether or not the --2017-08-25 19:52:37-- http://www.sgi.com/tech/mlc/db/churn.test
• total eve calls: continuous. customer churned. Resolving www.sgi.com (www.sgi.com)... 192.48.178.134
Connecting to www.sgi.com (www.sgi.com)|192.48.178.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 188074 (184K) [text/plain]
Saving to: ‘/tmp/churn/churn.test’
Mount the data locally. The second step is to create the schema in the Data Frame.
df = (spark.read.option(“delimiter”, “,”)
.option(“inferSchema”, “true”)
.schema(schema)
.csv(“dbfs:/mnt/churn/churn.data”))
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 9
display(df)
state account_length area_code phone_number international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes
Step 2: Enrich the Data to Get Additional Insights on the Churn Data Set
We count the number of data points and separate the churned from The data is converted to a Parquet file, which is a data format that is well
the unchurned. suited to analytics on large data sets.
numCases = df.count()
numChurned = df.filter(col(“churned”) == ‘ True.’).count()
numCases = numCases
numChurned = numChurned
numUnchurned = numCases - numChurned
print(“Total Number of cases: {0:,}”.format( numCases ))
print(“Total Number of cases churned: {0:,}”.format( numChurned ))
print(“Total Number of cases unchurned: {0:,}”.format( numUnchurned ))
Total Number of cases: 3,333
Total Number of cases churned: 483
Total Number of cases unchurned: 2,850
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 11
%sql
SELECT state, count(*) as statewise_churn FROM temp_idsdata where
churned= “ True.” group by state
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 12
Step 4: Visualization
Show the distribution of the account length.
display(df.select(“account_length”).orderBy(id)) df.printSchema()
root
|-- state: string (nullable = true)
|-- account_length: double (nullable = true)
|-- area_code: double (nullable = true)
|-- phone_number: string (nullable = true)
|-- international_plan: string (nullable = true)
|-- voice_mail_plan: string (nullable = true)
|-- number_vmail_messages: double (nullable = true)
|-- total_day_minutes: double (nullable = true)
|-- total_day_calls: double (nullable = true)
|-- total_day_charge: double (nullable = true)
|-- total_eve_minutes: double (nullable = true)
|-- total_eve_calls: double (nullable = true)
|-- total_eve_charge: double (nullable = true)
|-- total_night_minutes: double (nullable = true)
|-- total_night_calls: double (nullable = true)
|-- total_night_charge: double (nullable = true)
|-- total_intl_minutes: double (nullable = true)
|-- total_intl_calls: double (nullable = true)
|-- total_intl_charge: double (nullable = true)
|-- number_customer_service_calls: double (nullable = true)
|-- churned: string (nullable = true)
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 13
indexed1 = indexer1.transform(df)
finaldf = indexed1.withColumn(“censor”, lit(1))
aft = GBTClassifier()
aft.setLabelCol(“churnedIndex”)
print aft.explainParams()
inputCols: input column names. (current: [‘account_length’, ‘total_day_calls’, ‘total_eve_calls’, ‘total_night_calls’, ‘total_intl_calls’, ‘number_customer_service_calls’])
outputCol: output column name. (default: VectorAssembler_402dae9a2a13c5e1ea7f__output, current: features)
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 14
cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance.
Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. (default: 10)
featuresCol: features column name. (default: features)
labelCol: label column name. (default: label, current: churnedIndex)
lossType: Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic (default: logistic)
maxBins: Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature. (default: 32)
maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 5)
maxIter: max number of iterations (>= 0). (default: 20)
maxMemoryInMB: Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed
this size. (default: 256)
minInfoGain: Minimum information gain for a split to be considered at a tree node. (default: 0.0)
minInstancesPerNode: Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than
minInstancesPerNode, the split will be discarded as invalid. Should be >= 1. (default: 1)
predictionCol: prediction column name. (default: prediction)
seed: random seed. (default: 2857134701650851239)
stepSize: Step size to be used for each iteration of optimization (>= 0). (default: 0.1)
subsamplingRate: Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0)
# We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar.
lrPipeline = Pipeline()
# Now we’ll tell the pipeline to first create the feature vector, and then do the linear regression
lrPipeline.setStages([vecAssembler, aft])
predictionsAndLabelsDF = lrPipelineModel.transform(finaldf)
confusionMatrix = predictionsAndLabelsDF.select(‘churnedIndex’, ‘prediction’)
print metrics.falsePositiveRate(0.0)
print metrics.accuracy
0.0514705882353
0.891689168917
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 16
Results Interpretation
The plot below shows the index used to measure each churn type. Confusion matrix in Matplotlib
fmt = ‘.2f’
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment=“center”,
color=“white” if cm[i, j] > thresh else “black”)
plt.tight_layout()
plt.ylabel(‘True label’)
plt.xlabel(‘Predicted label’)
plt.show()
display()
NOTEBOOK 2
Movie recommendation Recommendation engines are used across many industries, from external use on retail sites, to
engine internal use on employee sites. A recommendation engine delivers plans to end users based on
points that matter to them.
This demonstration is a simple example of a consumer using a movie website to select a movie to
watch. Recommendation engines look at historical data on what people have selected, and then
predict the selection the user would make.
This notebook:
• Is built on the Databricks platform and uses a machine learning ALS recommendation
algorithm to generate recommendations on movie choices
• Demonstrates a movie recommendation analysis workflow, using movie data from the
Kaggle data set
• Provides one place to create the entire analytical application, allowing users to collaborate
with other participants
Select 10 random movies from the most rated, as those are likely to be
commonly recognized movies. Create Databricks widgets to allow a user to
enter ratings for those movies.
Step 2: Enrich the Data and Prep for Modeling Step 3: Model Creation
Fit an ALS model on the ratings table.
ratings = sqlContext.table(“ratings”)
ratings = ratings.withColumn(“rating”, ratings.rating.cast(“float”))
rmse = evaluator.evaluate(predictions)
user_id movie_id rating prediction
0 5 NaN
0 5 NaN
0 3 NaN
0 4 NaN
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 23
Results Interpretation
%sql The table shown at left gives the top 10 recommended movie choices for
SELECT
name, prediction
the user based on the predicted outcomes using the movie demographics
from and the ratings provided by the user.
myPredictions
join
most_rated_movies on myPredictions.movie_id = most_rated_movies.movie_id
order by
prediction desc
LIMIT
10
name prediction
Casablanca 5.278496
Intrusion detection Intrusion detection system (IDS) is a device or software application that monitors a network or
system demo system for malicious activity or policy violations.
This notebook demonstrates how a user can get a better detection of web threats. We show how to
monitor network activity logs in real-time to generate suspicious activity alerts, support a security
operations center investigation into suspicious activity and develop network propagation models to
map network surface and entity movement to identify penetration points.
This notebook:
• Is a pre-built solution on top of Apache Spark™, written in Scala inside the Databricks platform
• Uses a logistic regression to identify intrusions by looking for deviations in behavior to identify
new attacks
• Demonstrates how the first three insights are gained through the visualization
• Allows data scientists and data engineers to improve the accuracy by getting more data or
improving the model
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 26
Unzip the data and upload the CIDDS-001>traffic>ExternalServer>*.csv from the unzipped folder to the Databricks notebooks.
display(idsdata)
Date first seen Duration Proto Src IP Addr Src Pt Dst IP Addr Dst Pt Packets Bytes Flows Flags Tos class attackType attackID attackDescription
2017-03-15T00:01:16.632+0000 0 TCP 192.168.100.5 445 192.168.220.16 58544 1 108 1 .AP... 0 normal --- --- ---
2017-03-15T00:01:16.552+0000 0 TCP 192.168.100.5 445 192.168.220.15 48888 1 108 1 .AP... 0 normal --- --- ---
2017-03-15T00:01:16.551+0000 0.004 TCP 192.168.220.15 48888 192.168.100.5 445 2 174 1 .AP... 0 normal --- --- ---
2017-03-15T00:01:16.631+0000 0.004 TCP 192.168.220.16 58844 192.168.100.5 445 2 174 1 .AP... 0 normal --- --- ---
2017-03-15T00:01:16.552+0000 0 TCP 192.168.100.5 445 192.168.220.15 48888 1 108 1 .AP... 0 normal --- --- ---
2017-03-15T00:01:16.631+0000 0.004 TCP 192.168.220.16 58844 192.168.100.5 445 2 174 1 .AP... 0 normal --- --- ---
2017-03-15T00:01:17.432+0000 0 TCP 192.168.220.9 37884 192.168.100.5 445 1 66 1 .AP... 0 normal --- --- ---
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 27
val newNames = Seq(“datefirstseen”, “duration”, “proto”, “srcip”, “srcpt”, “dstip”, “dstpt”, “packets”, “bytes”, “flows”, “flags”, “tos”, “transtype”, “label”,
“attackid”, “attackdescription”)
val dfRenamed = idsdata.toDF(newNames: _*)
val dfReformat = dfRenamed.select(“label”, “datefirstseen”, “duration”, “proto”, “srcip”, “srcpt”, “dstip”, “dstpt”, “packets”, “bytes”, “flows”, “flags”, “tos”,
“transtype”, “attackid”, “attackdescription”)
newNames: Seq[String] = List(datefirstseen, duration, proto, srcip, srcpt, dstip, dstpt, packets, bytes, flows, flags, tos, transtype, label, attackid, attackdescription)
dfRenamed: org.apache.spark.sql.DataFrame = [datefirstseen: timestamp, duration: double ... 14 more fields]
dfReformat: org.apache.spark.sql.DataFrame = [label: string, datefirstseen: timestamp ... 14 more fields]
Step 2: Enrich the Data to Get Additional Insights on the IDS Data Set
We create a temporary table from the file location “/tmp/wesParquet” in
Parquet file format.
Parquet file format is the preferred file format since it’s optimized for the
notebooks in the Databricks on AWS platform. Calculate statistics on the content sizes returned.
%sql %sql
select min(trim(bytes)) as min_bytes,max(trim(bytes)) as max_
CREATE TEMPORARY TABLE temp_idsdata bytes,avg(trim(bytes)) as avg_bytes from temp_idsdata
USING parquet
OPTIONS ( min_bytes max_bytes avg_bytes
path “/tmp/wesParquet”
) 1.0 M 99995 1980.1018585682032
%sql
Step 4: Visualization
Visualizing and finding outliers.
View a list of IP addresses that have accessed the server more than N times.
%sql
-- Use the parameterized query option to allow a viewer to dynamically specify a value for N.
-- Note how it’s not necessary to worry about limiting the number of results.
-- The number of values returned are automatically limited to 1000.
-- But there are options to view a plot that would contain all the data to view the trends.
SELECT srcip, COUNT(*) AS total FROM temp_idsdata GROUP BY srcip HAVING total > $N order by total desc
Command skipped
Explore statistics about the protocol used for the attack using Spark SQL
%sql
-- Display a plot of the distribution of the number of hits across the endpoints.
SELECT Proto, count(*) as num_hits FROM temp_idsdata GROUP BY Proto ORDER BY num_hits DESC
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 30
Explore statistics about the protocol used for the attack using Matplotlib
%python %r
import matplotlib.pyplot as plt library(SparkR)
importance = sqlContext.sql(“SELECT Proto as protocol, count(*) as library(ggplot2)
num_hits FROM temp_idsdata GROUP BY Proto ORDER BY num_hits DESC”) importance_df = collect(sql(sqlContext,‘SELECT Proto as protocol, count(*)
importanceDF = importance.toPandas() as num_hits FROM temp_idsdata GROUP BY Proto ORDER BY num_hits DESC’))
ax = importanceDF.plot(x=“protocol”, y=“num_hits”, ggplot(importance_df, aes(x=protocol, y=num_hits)) + geom_
lw=3,colormap=‘Reds_r’,title=‘Importance in Descending Order’, bar(stat=‘identity’) + scale_x_discrete(limits=importance_
fontsize=9) df[order(importance_df$num_hits), “protocol”]) + coord_flip()
ax.set_xlabel(“protocol”)
ax.set_ylabel(“num_hits”)
plt.xticks(rotation=12)
plt.grid(True)
plt.show()
display()
E B O O K : T H R E E P R A C T I C A L U S E C A S E S F O R D ATA B R I C K S O N AW S 31
Results Interpretation
Conclusion
The plot on the previous page shows the index used to measure each
attack type. Find relevant insights in all your data
As you can see from the preceding scenarios, Databricks on AWS was
1. The most common type of attack was Denial of Service (DoS), followed
designed to give you more ways to enhance your insights and solve
by port scan
problems. It was built to work for you and your team, giving you more
2. IP 192.168.220.16 was the origin for most of the attacks, amounting to at avenues for collaboration, more analytics power and a faster way to solve
least 14% of all attacks the problems unique to your business. We hope you found it helpful and will
try using Databricks on AWS yourself.
3. Most of the attacks used TCP protocol
4. As you can infer from the RMSE on running the model on the test data to Get started
© Databricks 2021. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use