GCP Fund Module 8 Big Data and Machine Learning in The Cloud Coursera
GCP Fund Module 8 Big Data and Machine Learning in The Cloud Coursera
in the Cloud
GCP Fundamentals: Core Infrastructure
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Google Cloud’s big data services are fully managed and scalable
Google Cloud Big Data solutions are designed to help you transform your business
and user experiences with meaningful data insights. It is an integrated, serverless
platform. “Serverless” means you don’t have to provision compute instances to run
your jobs. The services are fully managed, and you pay only for the resources you
consume. The platform is “integrated” so GCP data services work together to help
you create custom solutions.
Cloud Dataproc is managed Hadoop
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Cloud Dataproc is a fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on
Google Cloud Platform. All you have to do is to request a Hadoop cluster. It will be
built for you in 90 seconds or less, on top of Compute Engine virtual machines whose
number and type you can control. If you need more or less processing power while
your cluster’s running, you can scale it up or down. You can use the default
configuration for the Hadoop software in your cluster, or you can customize it. And
you can monitor your cluster using Stackdriver.
Why use Cloud Dataproc?
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 5
company and product names may be trademarks of the respective companies with which they are associated.
You can also save money by telling Cloud Dataproc to use preemptible Compute
Engine instances for your batch processing. You have to make sure that your jobs
can be restarted cleanly if they’re terminated and you get a significant break in the
cost of the instances. At the time this video was made, preemptible instances were
around 80% cheaper. Be aware that the cost of the Compute Engine instances isn’t
the only component of the cost of a Dataproc cluster, but it’s a significant one.
Once your data is in a cluster, you can use Spark and Spark SQL to do data mining,
and you can use MLlib, which is Apache Spark’s Machine Learning Libraries, to
discover patterns through machine learning.
Cloud Dataflow offers managed data pipelines
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 6
company and product names may be trademarks of the respective companies with which they are associated.
Cloud Dataproc is great when you have a dataset of known size, or when you
want to manage your cluster size yourself. But what if your data shows up in
realtime? Or it’s of unpredictable size or rate? That’s where Cloud Dataflow is
a particularly good choice. It’s both a unified programming model and a
managed service, and it lets you develop and execute a big range of data
processing patterns: extract-transform-and-load, batch computation, and
continuous computation. You use Dataflow to build data pipelines, and the
same pipelines work for both batch and streaming data.
Resource Management
Cloud Dataflow fully automates management of required processing
resources. No more spinning up instances by hand.
On Demand
All resources are provided on demand, enabling you to scale to meet your
business needs. No need to buy reserved compute instances.
Auto Scaling
Horizontal auto scaling of worker resources to meet optimum throughput
requirements results in better overall price-to-performance.
Open Source
Developers wishing to extend the Dataflow programming model can fork and
or submit pull requests on the Java-based Cloud Dataflow SDK. Dataflow
pipelines can also run on alternate runtimes like Spark and Flink.
Monitoring
Integrated into the Google Cloud Platform Console, Cloud Dataflow provides
statistics such as pipeline throughput and lag, as well as consolidated worker
log inspection—all in near-real time.
Integrated
Integrates with Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud
Bigtable, and BigQuery for seamless data processing. And can be extended to
interact with others sources and sinks like Apache Kafka and HDFS.
BigQuery
Transforms
Sink
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
Cloud Storage
company and product names may be trademarks of the respective companies with which they are associated.
This example Dataflow pipeline reads data from a BigQuery table (the “source”),
processes it in various ways (the “transforms”), and writes its output to Cloud Storage
(the “sink”). Some of those transforms you see here are map operations, and some
are reduce operations. You can build really expressive pipelines.
Each step in the pipeline is elastically scaled. There is no need to launch and manage
a cluster. Instead, the service provides all resources on demand. It has automated
and optimized work partitioning built in, which can dynamically rebalance lagging
work. That reduces the need to worry about “hot keys” -- that is, situations where
disproportionately large chunks of your input get mapped to the same custer.
Why use Cloud Dataflow?
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 8
company and product names may be trademarks of the respective companies with which they are associated.
People use Dataflow in a variety of use cases. For one, it serves well as a
general-purpose ETL tool.
And its use case as a data analysis engine comes in handy in things like these: fraud
detection in financial services; IoT analytics in manufacturing, healthcare, and
logistics; and clickstream, Point-of-Sale, and segmentation analysis in retail.
And, because those pipelines we saw can orchestrate multiple services, even
external services, it can be used in realtime applications such as personalizing
gaming user experiences.
BigQuery is a fully managed data warehouse
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 10
company and product names may be trademarks of the respective companies with which they are associated.
BigQuery is Google's fully managed, petabyte scale, low cost analytics data
warehouse. BigQuery is NoOps: there is no infrastructure to manage and you
don't need a database administrator, so you can focus on analyzing data to
find meaningful insights, use familiar SQL, and take advantage of our
pay-as-you-go model. BigQuery is a powerful big data analytics platform used
by all types of organizations, from startups to Fortune 500 companies.
BigQuery’s features:
Global Availability
You have the option to store your BigQuery data in European locations while
continuing to benefit from a fully managed service, now with the option of
geographic data control, without low-level cluster maintenance.
Security and Permissions
You have full control over who has access to the data stored in BigQuery. If
you share datasets, doing so will not impact your cost or performance; those
you share with pay for their own queries.
Cost Controls
BigQuery provides cost control mechanisms that enable you to cap your daily
costs at an amount that you choose. For more information, see Cost Controls.
Highly Available
Transparent data replication in multiple geographies means that your data is
available and durable even in the case of extreme failure modes.
Fully Integrated
In addition to SQL queries, you can easily read and write data in BigQuery via
Cloud Dataflow, Spark, and Hadoop.
BigQuery can make Create, Replace, Update, and Delete changes to databases,
subject to some limitations and with certain known issues.
BigQuery runs on Google’s high-performance infrastructure
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 10
company and product names may be trademarks of the respective companies with which they are associated.
It’s easy to get data into BigQuery. You can load from Cloud Storage or Cloud
Datastore, or stream it into BigQuery at up to 100,000 rows per second.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 13
company and product names may be trademarks of the respective companies with which they are associated.
Cloud Pub/Sub is a fully managed real-time messaging service that allows you
to send and receive messages between independent applications. You can
leverage Cloud Pub/Sub’s flexibility to decouple systems and components
hosted on Google Cloud Platform or elsewhere on the internet. By building on
the same technology Google uses, Cloud Pub/Sub is designed to provide “at
least once” delivery at low latency with on-demand scalability to 1 million
messages per second (and beyond).
Highly Scalable
Any customer can send up to 10,000 messages per second, by default—and
millions per second and beyond, upon request.
Encryption
Encryption of all message data on the wire and at rest provides data security
and protection.
Replicated Storage
Designed to provide “at least once” message delivery by storing every
message on multiple servers in multiple zones.
Message Queue
Build a highly scalable queue of messages using a single topic and
subscription to support a one-to-one communication pattern.
End-to-End Acknowledgement
Building reliable applications is easier with explicit application-level
acknowledgements.
Fan-out
Publish messages to a topic once, and multiple subscribers receive copies to
support one-to-many or many-to-many communication patterns.
REST API
Simple, stateless interface using JSON messages with API libraries in many
programming languages.
Why use Cloud Pub/Sub?
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 12
company and product names may be trademarks of the respective companies with which they are associated.
Cloud Pub/Sub builds on the same technology Google uses internally. It’s an
important building block for applications where data arrives at high and unpredictable
rates, like Internet of Things systems. If you’re analyzing streaming data, Cloud
Dataflow is a natural pairing with Pub/Sub.
Cloud Datalab offers interactive data exploration
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 16
company and product names may be trademarks of the respective companies with which they are associated.
Cloud Datalab lets you use Jupyter notebooks to explore, analyze, and
visualize data on the Google Cloud Platform. It runs in a Compute Engine
virtual machine. To get started, you specify the virtual machine type you want
and what GCP region it should run in. When it launches, it presents an
interactive Python environment that’s ready to use. And it orchestrates
multiple GCP services automatically, so you can focus on exploring your data.
You only pay for the resources you use; there’s no additional charge for
Datalab itself.
Integrated
Cloud Datalab handles authentication and cloud computation out of the box
and is integrated with BigQuery, Compute Engine, and Cloud Storage.
Multi-Language Support
Cloud Datalab currently supports Python, SQL, and JavaScript (for BigQuery
user-defined functions).
Notebook Format
Cloud Datalab combines code, documentation, results, and visualizations
together in an intuitive notebook format.
Pay-per-use Pricing
Only pay for the cloud resources you use: the App Engine application,
BigQuery, and any additional resources you decide to use, such as Cloud
Storage.
Collaborative
Git-based source control of notebooks with the option to sync with non-Google
source code repositories like GitHub and Bitbucket.
Open Source
Developers who want to extend Cloud Datalab can fork and/or submit pull
requests on the GitHub hosted project.
Custom Deployment
Specify your minimum VM requirements, the network host, and more.
IPython Support
Cloud Datalab is based on Jupyter (formerly IPython) so you can use a large
number of existing packages for statistics, machine learning, etc. Learn from
published notebooks and swap tips with a vibrant IPython community.
Why use Cloud Datalab?
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 14
company and product names may be trademarks of the respective companies with which they are associated.
Cloud Datalab is integrated with BigQuery, Compute Engine, and Cloud Storage, so
accessing your data doesn’t run into authentication hassles.
When you’re up and running, you can visualize your data with Google Charts or
matplotlib. And, because there’s a vibrant interactive Python community, you can
learn from published notebooks. There are many existing packages for statistics,
machine learning, and so on.
You can attach a GPU to a Cloud Datalab instance for faster processing. At the time
of this writing, this feature was in beta, which means that no SLA is available and that
the feature could change in backwards-incompatible ways.
15
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Machine Learning APIs enable apps that see, hear, and understand
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Machine learning is one branch of the field of artificial intelligence. It’s a way of
solving problems without explicitly coding the solution. Instead, human coders
build systems that improve themselves over time, through repeated exposure
to sample data, which we call “training data.”
Major Google applications use Machine Learning, like YouTube, Photos, the
Google mobile app, and Google Translate. The Google machine learning
platform is now available as a cloud service, so that you can add innovative
capabilities to your own applications.
Cloud Machine Learning Platform
Open source tool to build and run neural network models
● Wide platform support: CPU or GPU; mobile, server, or cloud
Suppose you want a more managed service. Google Cloud Machine Learning
Engine lets you easily build machine learning models that work on any type of
data, of any size. It can take any TensorFlow model and perform large scale
training on a managed cluster.
Classification and
regression Image and video
analytics
Recommendation
Text analytics
Anomaly detection
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
People use the Cloud Machine Learning platform for lots of applications.
Generally, they fall into two categories, depending on whether the data they
work on is structured or unstructured.
Based on structured data, you can use ML for various kinds of classification
and regression tasks, like customer churn analysis, product diagnostics, and
forecasting. It can be the heart of a recommendation engine, for content
personalization and cross-sells and up-sells. You can use ML to detect
anomalies, as in fraud detection, sensor diagnostics, or log metrics.
Based on unstructured data, you can use ML for image analytics, such as
identifying damaged shipment, identifying styles, and flagging content. You
can do text analytics too, like call center log analysis, language identification,
topic classification, and sentiment analysis.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 19
company and product names may be trademarks of the respective companies with which they are associated.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 20
company and product names may be trademarks of the respective companies with which they are associated.
The Cloud Speech API enables developers to convert audio to text. Because
you have an increasingly global user base, the API recognizes over 80
languages and variants. You can transcribe the text of users dictating to an
application’s microphone, enable command-and-control through voice, or
transcribe audio files.
Cloud Natural Language API
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 27
company and product names may be trademarks of the respective companies with which they are associated.
It can do entity recognition: in other words, it can parse text and flag mentions
of people, organizations, locations, events, products and media.
Syntax Analysis
● Extract tokens and sentences, identify parts of speech (PoS), and create
dependency parse trees for each sentence.
Entity Recognition
● Identify entities and label by types such as person, organization,
location, events, products and media.
Sentiment Analysis
● Understand the overall sentiment expressed in a block of text.
Multi-Language
● Enables you to easily analyze text in multiple languages including
English, Spanish, and Japanese.
Integrated REST API
● Access via REST API. Text can be uploaded in the request or integrated
with Cloud Storage.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 22
company and product names may be trademarks of the respective companies with which they are associated.
The Translation API supports the standard Google API Client Libraries in
Python, Java, Ruby, Objective-C, and other languages.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other 23
company and product names may be trademarks of the respective companies with which they are associated.
The Google Cloud Video Intelligence API allows developers to use Google
video analysis technology as part of their applications. The REST API enables
users to annotate videos stored in Google Cloud Storage with video and
frame-level (1 fps) contextual information. It helps you identify key entities --
that is, nouns -- within your video, and when they occur. You can use it to make
video content searchable and discoverable.
The API supports the annotation of common video formats, including .MOV,
.MPEG4, .MP4, and .AVI.
24
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
25
Quiz
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
26
Quiz Answers
When would you use Cloud You can use it to migrate on-premises Hadoop
Dataproc? jobs to the cloud. You can also use it for data
mining and analysis of cloud-based data.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
27
Quiz Answers
When would you use Cloud You can use it to migrate on-premises Hadoop
Dataproc? jobs to the cloud. You can also use it for data
mining and analysis of cloud-based data.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
28
Quiz Answers
When would you use Cloud You can use it to migrate on-premises Hadoop
Dataproc? jobs to the cloud. You can also use it for data
mining and analysis of cloud-based data.
Name three use cases for the Fraud detection, sentiment analysis, content
Google machine learning personalization
platform.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
29
Lab instructions
In this lab, you will load server log data into BigQuery and perform a SQL
query on it.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
In this lab, you load a CSV file into a BigQuery table. After loading the data, you
query it using the BigQuery web user interface, the CLI, and the BigQuery shell.
30
More resources
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.