M2 Ingesting New Datasets Into BigQuery
M2 Ingesting New Datasets Into BigQuery
Ingesting New
Datasets into
BigQuery
Evan Jones
Now so far, we've only queried data sets that already exist within BigQuery. The next
logical step after you're finished with all these courses is to load your own datasets in
the BigQuery and analyze them.
So, that's why in this module, we'll cover how you can load extra node data into
BigQuery, and create your very own datasets. First, let's cover the difference between
loading data into BigQuery versus querying it directly from an external data source.
Proprietary + Confidential
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
As you can see on the left, there's a lot of different file format types and even systems
that you can actually ingest and grab data from, and then load permanently into
BigQuery-managed storage.
So, to name a few of very common staging areas Google Cloud Storage, we could
have your massive CSV files stored into Cloud Storage buckets, which is very
common, or Dataflow jobs, your data engineering team has set up these beautiful
pipelines. And as part of one of the steps in the pipelines, you can have that data
write out or materialize itself into a BigQuery table for analysis. That's very common.
And as you saw as one of the UI layers for Dataflow, that Dataprep tool that you got a
lot of practice with the last course, does exactly that. It will invoke that materialization
step for a Dataflow and then write that out to BigQuery-managed storage.
Other Google Cloud tools, big data tools like Cloud Bigtable, you can export or copy
that data from Bigtable into a BigQuery-managed storage.
Proprietary + Confidential
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
And of course, you can manually upload through your desktop or a file browser ingest
those tables into BigQuery-managed storage.
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
So that big concept or that big icon that you see there in the middle is a key core
component of the BigQuery service.
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
It's the query engine that will process your queries and it's also the data management
piece behind the scenes that handles and stores and optimizes all of your data. So
things like caching your data, storing it into column format and compressing those
columns, which we're going to talk a little bit more about in the advanced course on
the architecture of BigQuery, and expanding the data and making sure that it's
replicated, and all these things that are traditional, like a database administrator
wouldn't handle for you, the BigQuery team here at Google manages that for you
behind the scenes.
Proprietary + Confidential
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
Why am I making such a big deal about managed storage? Because you might guess
like, "Hey, all right. Cool. It's managed storage. I don't have to worry about that. When
does my data never going to be in managed storage?" Right? And the answer is, it
could quite possibly never even hit managed storage….
Proprietary + Confidential
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
… if you connect directly to the external data source. This is like the mind-blowing
concept, right? You can write a SQL query and that SQL query can be passed through
and underline your actual data source.
Proprietary + Confidential
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
It could be a Google Drive spreadsheet that someone is maintaining, and that data is
not ingested and permanently stored inside of BigQuery. That's an extreme case
because naturally you can see the caveats of relying on a collaborative spreadsheet
as your system of record for a lot of your your data. But this is a common occurrence
for things like one-time extract transform, load jobs where you have a CSV that's
stored in Cloud Storage, and you basically want to, instead of ingesting that data and
storing that raw data inside a BigQuery, storing it in two places, a Cloud Storage and
BigQuery, you instead query it, perform some preprocessing steps, clean it all up, and
then at the end of that query, store the results of the query as a permanent table
inside of BigQuery. So, that's one of the common use cases that I could think for
creating or establishing this pointer or this external connection.
Proprietary + Confidential
Dataflow Dataprep
BigQuery-managed storage BigQuery query engine
Now, as you see that big arrow over BigQuery-managed storage, you're just using the
query engine. You get none of the performance advantages from BigQuery, the
managed storage piece, and a lot of other drawbacks.
Proprietary + Confidential
Limitations:
We largely discussed batch loading a CSV or massive CSVs into BigQuery, but know
that there is a streaming option available through the API where you can actually set it
up, where you can ingest individual records at a time into BigQuery-managed storage
and then run queries on those as well. So, the streaming API is well-documented and
you guys can access that if you have a streaming or a new real-time data need for
your application.
Proprietary + Confidential
Lab Intro
Ingesting New Datasets into
BigQuery
Now it's time for us to ingest and query brand new data sources in BigQuery.
In this next lab, you'll practice loading data into BigQuery from external sources like
Google Cloud Storage. You'll also learn how to set up an external data connection,
but beware the caveats we discussed earlier.