Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
5/5
()
About this ebook
Google Cloud Platform for Data Engineering is designed to take the beginner through a journey to become a competent and certified GCP data engineer. The book, therefore, is split into three parts; the first part covers fundamental concepts of data engineering and data analysis from a platform and technology-neutral perspective. Reading part 1 will bring a beginner up to speed with the generic concepts, terms and technologies we use in data engineering. The second part, which is a high-level but comprehensive introduction to all the concepts, components, tools and services available to us within the Google Cloud Platform. Completing this section will provide the beginner to GCP and data engineering with a solid foundation on the architecture and capabilities of the GCP. Part 3, however, is where we delve into the moderate to advanced techniques that data engineers need to know and be able to carry out. By this time the raw beginner you started the journey at the beginning of part 1 will be a knowledgable albeit inexperienced data engineer. However, by the conclusion of part 3, they will have gained the advanced knowledge of data engineering techniques and practices on the GCP to pass not only the certification exam but also most interviews and practical tests with confidence. In short part 3, will provide the prospective data engineer with detailed knowledge on setting up and configuring DataProc - GCPs version of the Spark/Hadoop ecosystem for big data. They will also learn how to build and test streaming and batch data pipelines using pub/sub/ dataFlow and BigQuery. Furthermore, they will learn how to integrate all the ML and AI Platform components and APIs. They will be accomplished in connecting data analysis and visualisation tools such as Datalab, DataStudio and AI notebooks amongst others. They will also by now know how to build and train a TensorFlow DNN using APIs and Keras and optimise it to run large public data sets. Also, they will know how to provision and use Kubeflow and Kube Pipelines within Google Kubernetes engines to run container workloads as well as how to take advantage of serverless technologies such as Cloud Run and Cloud Functions to build transparent and seamless data processing platforms. The best part of the book though is its compartmental design which means that anyone from a beginner to an intermediate can join the book at whatever point they feel comfortable.
Read more from Alasdair Gilchrist
A Practical Guide Wireshark Forensics Rating: 5 out of 5 stars5/5The Layman's Guide GDPR Compliance for Small Medium Business Rating: 5 out of 5 stars5/5REST API Design Control and Management Rating: 4 out of 5 stars4/5Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science Rating: 0 out of 5 stars0 ratingsGoogle Cloud Platform an Architect's Guide Rating: 5 out of 5 stars5/5An Executive Guide to Identity Access Management - 2nd Edition Rating: 4 out of 5 stars4/5Six Sigma Yellow Belt Certification Study Guide Rating: 0 out of 5 stars0 ratingsA Concise Guide to Object Orientated Programming Rating: 0 out of 5 stars0 ratingsGoogle Cloud Platform - Networking Rating: 0 out of 5 stars0 ratingsConcise Guide to DWDM Rating: 5 out of 5 stars5/5Concise Guide to OTN optical transport networks Rating: 4 out of 5 stars4/5A Concise Guide to Microservices for Executive (Now for DevOps too!) Rating: 1 out of 5 stars1/5A Last Minute Hands-on Guide to GDPR Readiness Rating: 0 out of 5 stars0 ratingsDigital Success: A Holistic Approach to Digital Transformation for Enterprises and Manufacturers Rating: 0 out of 5 stars0 ratingsConcise and Simple Guide to IP Subnets Rating: 5 out of 5 stars5/5The Certified Ethical Hacker Exam - version 8 (The concise study guide) Rating: 3 out of 5 stars3/5An Introduction to SDN Intent Based Networking Rating: 5 out of 5 stars5/5A concise guide to PHP MySQL and Apache Rating: 4 out of 5 stars4/5Supply Chain 4.0: From Stocking Shelves to Running the World Fuelled by Industry 4.0 Rating: 3 out of 5 stars3/5The Concise Guide to SSL/TLS for DevOps Rating: 5 out of 5 stars5/5Tackling Fraud Rating: 4 out of 5 stars4/5GDPR for DevOp(Sec) - The laws, Controls and solutions Rating: 5 out of 5 stars5/5PSD2 - Open Banking for DevOps(Sec) Rating: 5 out of 5 stars5/5Concise Guide to CompTIA Security + Rating: 3 out of 5 stars3/5Management Accounting for New Managers Rating: 1 out of 5 stars1/5ChatGPT Will Won't Save The World Rating: 0 out of 5 stars0 ratingsWhy Industry 4.0 Sucks! Rating: 0 out of 5 stars0 ratingsFinTech Rising: Navigating the maze of US & EU regulations Rating: 5 out of 5 stars5/5
Related to Google Cloud Platform for Data Engineering
Related ebooks
Data Analytics with Google Cloud Platform: Build Real Time Data Analytics on Google Cloud Platform Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsData Engineering on Azure Rating: 0 out of 5 stars0 ratingsBig Data for Enterprise Architects Rating: 5 out of 5 stars5/5Architecting Big Data & Analytics Solutions - Integrated with IoT & Cloud Rating: 5 out of 5 stars5/5Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS Rating: 5 out of 5 stars5/5Learn Hadoop in 24 Hours Rating: 0 out of 5 stars0 ratingsData Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Rating: 0 out of 5 stars0 ratingsOfficial Google Cloud Certified Professional Data Engineer Study Guide Rating: 5 out of 5 stars5/5Google Cloud Platform - Networking Rating: 0 out of 5 stars0 ratingsHands-on Cloud Analytics with Microsoft Azure Stack Rating: 0 out of 5 stars0 ratingsHadoop Essentials Rating: 5 out of 5 stars5/5Data Analysis with Python and PySpark Rating: 0 out of 5 stars0 ratingsBig data Hadoop Interview Guide Rating: 0 out of 5 stars0 ratingsHadoop Beginner's Guide Rating: 4 out of 5 stars4/5Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition) Rating: 0 out of 5 stars0 ratingsGoogle Cloud Platform GCP Third Edition Rating: 0 out of 5 stars0 ratingsData Science with Jupyter: Master Data Science skills with easy-to-follow Python examples Rating: 0 out of 5 stars0 ratingsData Lake for Enterprises Rating: 0 out of 5 stars0 ratingsPractical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions Rating: 0 out of 5 stars0 ratingsOfficial Google Cloud Certified Associate Cloud Engineer Study Guide Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Midjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/5Nexus: A Brief History of Information Networks from the Stone Age to AI Rating: 4 out of 5 stars4/5Summary of Super-Intelligence From Nick Bostrom Rating: 4 out of 5 stars4/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Co-Intelligence: Living and Working with AI Rating: 4 out of 5 stars4/5ChatGPT For Dummies Rating: 4 out of 5 stars4/5ChatGPT For Fiction Writing: AI for Authors Rating: 5 out of 5 stars5/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Coding with AI For Dummies Rating: 0 out of 5 stars0 ratings2084: Artificial Intelligence and the Future of Humanity Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5The Roadmap to AI Mastery: A Guide to Building and Scaling Projects Rating: 3 out of 5 stars3/5ChatGPT Millionaire: Work From Home and Make Money Online, Tons of Business Models to Choose from Rating: 5 out of 5 stars5/5101 Midjourney Prompt Secrets Rating: 3 out of 5 stars3/5Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures Rating: 3 out of 5 stars3/5The Coming Wave: Technology, Power, and the Twenty-first Century's Greatest Dilemma Rating: 5 out of 5 stars5/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Killer ChatGPT Prompts: Harness the Power of AI for Success and Profit Rating: 2 out of 5 stars2/5
Reviews for Google Cloud Platform for Data Engineering
1 rating0 reviews
Book preview
Google Cloud Platform for Data Engineering - alasdair gilchrist
Google Cloud Platform for Data Engineering
– From Beginner to Data Engineer using Google Cloud Platform
––––––––
Copyright © Alasdair Gilchrist 2019
Table of Contents
Google Cloud Platform for Data Engineering
– From Beginner to Data Engineer using Google Cloud Platform
Google Cloud Platform for Data Engineering
Chapter 1: An Introduction to Data Engineering
Chapter 2 - Defining Data Types
Chapter 3 – Deriving Knowledge from Information
Chapter 5 – Data Modelling
Chapter 6 – Alternative OLAP Data Schemas
Chapter 7 - Designing a Data Warehouse
Chapter 8–Advanced Data Analysis & Business Intelligence
Chapter 9 - Introduction to Data Mining Algorithms
Chapter 10 – On-premise vs. Cloud Technologies
Chapter 11 –An introduction to Machine Learning
Chapter 12 – Working with Error
Chapter 13 – Planning the ML Process
Part II – Google Cloud Platform Fundamentals
Chapter 14 - An Introduction to the Google Cloud Platform
Chapter 15 – Introduction to Cloud Security
Chapter 16 - Interacting with Google Cloud Platform
Chapter 17 - Compute Engine and Virtual Machines
Chapter 18 – Cloud Data Storage
Chapter 19 - Containers and Kubernetes Engine
Chapter 20 - App Engine
Chapter 21 – Serverless Compute with Cloud Functions and Cloud Run
Chapter 22 – Using GCP Cloud Tools
Chapter 23 - Cloud Big Data Solutions
Chapter 24 - Machine Learning
Part III – Data Engineering on GCP
Chapter 25 – Data Lifecycle from a GCP Perspective
Ingest
Store
Process and Analyse
Access and Query data
Explore and Visualize
Chapter 26 - Working with Cloud DataProc
Hadoop Ecosystem in GCP
Cloud Dataflow and Apache Spark
Chapter 27 - Stream Analytics and Real-Time Insights
Streaming - Processing and Storage
Cloud Pub/Sub
Chapter 28 - Working with Cloud Dataflow SDK (Apache Beam)
Chapter 29 - Working with BigQuery
Big Query – GCP’s Data Warehouse
Chapter 30 - Working with Dataprep
Chapter 31 - Working with Datalab
Chapter 32 – Integrating BigQuery BI Engine with Data Studio
Chapter 33 - Orchestrating Data Workflows with Cloud Composer
Chapter 34 - Working with Cloud AI Platform
Training a TensorFlow model with Kubeflow
(Optional) Test the code in a Jupyter notebook
Chapter 35 – Cloud Migration
Google Cloud Platform for Data Engineering
Part 1 – An introduction to Data Engineering
Chapter 1: An Introduction to Data Engineering
A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. A data engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A data engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models. (Google)
In recent years data engineering has emerged as a separate and related role that works in concert with data analysts and data scientists. Typically, the differentiator is that data scientists will focus on finding new insights from a data set, while data engineers are primarily concerned with the technologies and the preparation of the data such as: clean, structure, model, scale, secure, amongst others.
As a result data engineers primarily focus on the following areas:
Clean and wrangle data
However, it is not all about playing with technology and connectors as there is a lot of time spent cleaning and wrangling data to prepare it for input into the analytical systems. Hence, data engineers must make sure that the data the organization is using is clean, reliable, and prepped specifically for each job. Consequently, a large part of the data engineer’s job is to parse, clean and wrangle the data. This important task is about taking a raw dataset and refining it into something useful. The objective is to restructure and format the data into a state that is fit for analysis and can have queries run against it.
Build and maintain data pipelines
It will be the responsibility of the data engineer to plan and construct the necessary data pipelines that will encompass the journey and processes that data undergoes within a company. Creating a data pipeline is rarely easy, but at big data scale it can be challenging as it requires integrating data I/O from many different big data technologies. Moreover, a data engineer needs to understand and select the right tools or technologies for the job. In short the data engineer is the subject matter expert (SME) when it comes to technologies and frameworks so they will be expected to have in-depth knowledge of how to combine often diverse technologies in order to create data pipelines solutions, which enable a company’s business and analytical processes.
What does a data engineer need to know?
According to Google, and as this book is ultimately about data engineering on the Google Cloud Platform – so who better to ask – the required body of knowledge expected of a certified data engineer is as follows:
1. Designing data processing systems
1.1 Selecting the appropriate storage technologies. Considerations include:
Mapping storage systems to business requirements
Data modelling
Trade-offs involving latency, throughput, transactions
Distributed systems
Schema design
1.2 Designing data pipelines. Considerations include:
Data publishing and visualization (e.g., BigQuery)
Batch and streaming data (e.g., Cloud Dataflow, Cloud Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Cloud Pub/Sub, Apache Kafka)
Online (interactive) vs. batch predictions
Job automation and orchestration (e.g., Cloud Composer)
1.3 Designing a data processing solution. Considerations include:
Choice of infrastructure
System availability and fault tolerance
Use of distributed systems
Capacity planning
Hybrid cloud and edge computing
Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
At least once, in-order, and exactly once, etc., event processing
1.4 Migrating data warehousing and data processing. Considerations include
Awareness of current state and how to migrate a design to a future state
Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
Validating a migration
2. Building and operationalizing data processing systems
2.1 Building and operationalizing storage systems. Considerations include:
Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Cloud Datastore, Cloud Memorystore)
Storage costs and performance
Lifecycle management of data
2.2 Building and operationalizing pipelines. Considerations include:
Data cleansing
Batch and streaming
Transformation
Data acquisition and import
Integrating with new data sources
2.3 Building and operationalizing processing infrastructure. Considerations include:
Provisioning resources
Monitoring pipelines
Adjusting pipelines
Testing and quality control
3. Operationalizing machine learning models
3.1 Leveraging pre-built ML models as a service. Considerations include:
ML APIs (e.g., APIs such as Vision API, Speech API)
Customizing ML APIs (e.g., customising AutoML Vision, Auto ML text, or others)
Conversational experiences (e.g., Dialogflow)
3.2 Deploying an ML pipeline. Considerations include:
Ingesting appropriate data
Retraining of machine learning models (Cloud Machine Learning Engine, BigQuery ML, Kubeflow, Spark ML)
Continuous evaluation
3.3 Choosing the appropriate training and serving infrastructure. Considerations include:
Distributed vs. single machine
Use of edge compute
Hardware accelerators (e.g., GPU, TPU)
3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:
Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
Impact of dependencies of machine learning models
Common sources of error (e.g., assumptions about data)
4. Ensuring solution quality
4.1 Designing for security and compliance. Considerations include:
Identity and access management (e.g., Cloud IAM)
Data security (encryption, key management)
Ensuring privacy (e.g., Data Loss Prevention API)
Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children's Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))
4.2 Ensuring scalability and efficiency. Considerations include:
Building and running test suites
Pipeline monitoring (e.g., Stackdriver)
Assessing, troubleshooting, and improving data representations and data processing infrastructure
Resizing and autoscaling resources
4.3 Ensuring reliability and fidelity. Considerations include:
Performing data preparation and quality control (e.g., Cloud Dataprep)
Verification and monitoring
Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
Choosing between ACID, idempotent, eventually consistent requirements
4.4 Ensuring flexibility and portability. Considerations include:
Mapping to current and future business requirements
Designing for data and application portability (e.g., multi-cloud, data residency requirements)
Data staging, cataloguing, and discovery
However, with the explosion in interest and adoption of big data analytics over the last decade or so a data engineer’s required body of knowledge is rapidly expanding. Currently a data engineer will be expected to have a good general knowledge of the different big data technologies. But these technologies fall under numerous areas of speciality, such as file formats, ingestion engines, stream and batch processing pipelines, NoSQL data storage, container and cluster management, transaction and analytical databases, serverless web frameworks, data visualizations, and machine learning pipelines, to name just a few.
A holistic understanding of data is a prerequisite. But what is really desirable is for data engineers’ to understand the business objectives – the purpose of analytics - and how the entire big data operation works to deliver on that goal and then look for ways to make it better. What that means is thinking and acting like an engineer one moment and as a traditional product manager the next.
Data Engineering is not just a critical skill when it comes to advanced data analytics or Machine Learning every data scientist should know enough about data engineering to be competent in the skill of evaluating how data projects are aligned with the business goals and the competency of their company.
Furthermore, the topic of generic data engineering skills is also a crucial element in the certification exam. Therefore, in this section of the book we will provide a detailed introduction to the concepts and principles behind data engineering from a vendor agnostic perspective. If you are a beginner, you will certainly need to know this, as Google assumes you have at least one year’s practical experience, so if you are pursuing a career in the discipline or are looking to take the certification exam we recommend you read through Part 1 to get familiar with the concepts and terms you will need to know later on.
The topics we will cover in this the first part of Data Engineering for the Google Cloud Platform will deal with the generic and platform agnostic principles of data engineering. If you already have a good back-ground in the following topics:
Types of Data
Data Modelling
Types of OLTP and OLAP systems
Data Warehousing
ETL and ELT
Machine Learning models, concepts and algorithms
Big Data ecosystems (Hadoop, Spark, etc.)
You may want to skip this section and go straight to part II Google Cloud Platform Fundamentals.
Chapter 2 - Defining Data Types
Data is dumb, it’s not about the data it’s about the information (Stupid).
Data in itself is meaningless without either context or processing upon which it becomes information. That is the common explanation of data’s value and why we need to process it so that it will transform into information. An example of this could be the stream of data contained within computer logs, which to the untrained eye the data is meaningless:
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] GET /apache_pb.gif HTTP/1.0
200 2326
It is only when that log snippet is placed in context as being the output from an Apache Web Server log that we can actually gain any understanding from it. Then we can gain information such as the local address of the server, the identifier of the person making the request and the resource they requested. Hence we have transformed raw data, the log, into information by applying context. And this relationship between data and information is the foundation of what is called the Data/Information/Knowledge/Wisdom or DIKW pyramid.
dikw.pngThe DIKW pyramid refers loosely to a class of models for representing purported structural and/or functional relationships between data, information, knowledge, and wisdom.
The principle of the Data, Information, Knowledge, Wisdom, hierarchy was introduced by Russell Ackoff in his address to the International Society for General Systems Research in 1989.
The DIKW sequence as it became known reinforces the premise that information is a refinement of data and to be more precise that information is the value we extract from data. The DIKW sequence has been generally accepted and taught in most computer science courses. However, some have a problem with this sequence, specifically the middle steps in the sequence between information and knowledge. The general consensus however is that at least the first sequence is correct; as from data we derive information and so that part is accepted.
In order to be able to understand the complex domain of Data Analytics or Business Intelligence and all it entails we need to start by having a good understanding of the basic elements – data.
Data can be defined as discrete, raw or unorganized pieces of information, such as facts or figures. Data is ambiguous, unorganized or unprocessed facts. When a system or process handles data and arranges, sorts, or uses data to calculate something, then it is processing the data. Raw data, which is the term for unprocessed data, are the raw material used to make information. Unprocessed data is the input to an information system that the system organizes and processes. Raw data is fed into a computer program as variables, or strings, that these can be the binary representation on a computer disk, or the digital inputs from electronic sensors that themselves collect data in the form of analogue signals from their environment.
Data can be qualitative or quantitative; the difference being quantitative data – which is measured and objective - can be represented as a number so can be statistically analysed, as it can be represented as ordinal, interval or ratio scales. Qualitative data – subjective and opinion based – cannot be a number, so it is represented on a nominal scale such as likes or dislikes, etc.
Data can be represented in many formats, but as distinct pieces of information formatted in a specific way such as; character, symbols, strings, numeric, aural (Morse code) and visual (frames in a movie.) Data can also have many attributes, unverified, unformatted, unparsed. It may also have attributes regards its status; verified, unreliable, uncertified or validated.
Quantitative Data: Continuous Data and Discrete Data
There are two types of quantitative data, which is also referred to as numeric data: continuous and discrete. As a general rule, we consider counts to be discrete and measurements are continuous.
For example, discrete data is a count that can't be made more precise. Typically it involves integers and exact figures. For instance, the number of children in your family is a set of discrete data. After all there are no half children, they are or they are not.
Continuous data, on the other hand, could be granular and reduced to finer and finer levels of grain. For example, you can measure the time taken to commute to work in the morning.
Continuous data is therefore valuable in many different kinds of hypothesis tests when comparing figures. Some other analyses use continuous and discrete quantitative data at the same time as it may reveal performance such as time over distance. For instance, we could perform a regression analysis to see if the speed over per meter (continuous data) is correlated with the number of meters run (discrete data).
Qualitative Data: Binomial Data, Nominal Data, and Ordinal Data
We commonly use quantities and qualities to classify or categorize something, and it is easy to categorize quantity – but if you have to relate to Qualitative data it is not so easy. There are three main kinds of qualitative data.
There is the case where results are displayed as Binary data and this is when output is one of two mutually exclusive categories: right/wrong, true/false, or accept/reject.
There is also the situation when we are collecting unordered or nominal data, and we assign individual items to named categories that do not have an implicit or natural value or rank. For example, if I went through a list of results and recorded each that would be nominal data.
However we can also can have ordered or ordinal data, in which some items are categorized so that they do have some kind of natural order, such as Short, Medium, or Tall.
Importantly, there are also three types of Data structure which we should know about:
Structured data: This is data, which is relational and can be stored in a database such as SQL in tables with rows and columns. They have a relational key and can be easily mapped into pre-designed fields. Unfortunately, most data is not structured, and the level of structured data collected by organizations represents only about 5% to 10% of all business data.
There is of course another type of data called semi-structured data, which is information that doesn’t reside in a relational database, due to its lack of correlation but nonetheless does have structure and organizational properties that make it easier to analyse. Some examples of semi-structured data are: XML and JSON documents which are semi structured documents that do not fit an exact data type, NoSQL databases are considered as semi structured data stores.
But what is surprising is that data stored as semi structured data again only represents a small minority of business data (5% to 10%) so the last data type is the most prevalent one: unstructured data.
Unstructured data represent around 80% of all collected data in today’s business. It has become so prevalent it will often include video, voice, emails, multimedia content, music, social media chats and photos amongst many other formats. Note that while these sorts of files may have an internal structure, they are still considered to be unstructured, because the data they contain doesn’t fit neatly in a database schema.
Actually, unstructured data is everywhere and is not just ubiquitous it is in fact the way most individuals and organizations conduct their work and social communications. After all social media chat, voice, text and video are the way that most people live, and they interact with others through exchanging unstructured data. However, to confuse things, as with structured data, unstructured data can be further classified to be either machine generated or human generated.
Here are some examples of machine-generated unstructured data:
Computer Logs: these contain information without knowledge unless you understand the context.
IoT sensors: these will provide streams of binary data from a wide variety of machine or environmental sensors
Satellite images: These include weather data or satellite surveillance imagery.
Scientific data: This includes seismic imagery, atmospheric data, and without context it’s difficult to extract any meaning?
Photographs and video: This type of information includes security, surveillance, and traffic video.
Radar or sonar data: This technology allows autonomous vehicles, to take advantage of visual, audio, meteorological, and oceanographic seismic profiles.
The following list shows a few examples of human-generated unstructured data:
Chats, e-mails, documents, and even verbal conversation.
Social media data: This source of data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.
Mobile data: This includes shadow IT where users store data, text messages and location information in the cloud.
Website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram.
The unstructured data group is a vast source of raw data and it is growing quickly and has become far more pervasive than traditional forms of documented text based files. Unstructured data can be a problem with regards security as it is easily leaked out with the business boundaries but that is not the real issue. Social media and unstructured data such as You Tube videos, chat texts, and social media comments, are typically easily accessed but they actually allow data miners to determine the poster’s attitudes and their sentiments. Indeed analysis of social media comments for sentiment is a valuable form of information which could help in business decision making.
Another important distinction we must make when evaluating the business is what type of data we are storing and processing. There are typically a few categories of data that have very distinct characteristics and therefore we must be able to identify them and cater for their specific requirements. In most business use cases where data is being or planned to be analysed for Business Intelligence purpose data will fall into one of three types:
Transactional data: this type of data is derived from application web servers, they may be on-premises, web, mobile or SaaS applications and they will be producing data base records for transactions. These transaction focused systems are data write optimised in order to support high throughput of customer transactions whether that be sales on an ecommerce web server or process transactions on an industrial robotic controller. Their purpose is to record transactions by creating and storing data. A preferred collection and processing method for transactional data is in-memory caching and processing while being stored on SQL or NoSQL databases.
Data Files: this type of data category relates to log files, search results and historical transactional reports. The data is relatively large sized packets and is slow moving so can be managed with traditional disk/writes and database storage.
Messaging & Events: This type of data known as data streams consist of very small packets in very high volumes and high velocity which, require real time handling. This type of data can come from IoT sensors or the internet but is typically collected and managed by using Publish/Subscribe queues and protocols. Data streams may require specific storage requirements if they are related to industrial application that require real-time handling, such as stream enabled tables in NoSQL databases, or stream optimised databases that are designed for stream management, storage and processing.
In addition data also has other important attributes that we need to consider such as it can have velocity where it is fast moving and may be short lived, i.e. has very high entropy where it loses its value very quickly. For example, the revolutions per second reading obtained from a machine sensor. On the other hand some operational data is slow moving but of potentially long term value such as monthly sales figures. We can categorise these types of data as being hot or cold respectively and how we store them will depend on our design trade-offs. Generally speaking hot data is streaming data, which is processed in real-time and processed in-memory or in-cache for fast and efficient processing times. Examples of streaming data would be industrial sensors and other IoT alarms and publish/subscribe messages used to control industrial equipment and operational processes. Hot data’s value is measured in milliseconds and requires immediate processing.
Cold data on the other hand has high durability very low entropy and can be stored for long periods. Examples, of cold data would be file data particularly historical data like reports and previous year’s transaction data which is stored and used for reference. Another characteristic of cold data is it may require only infrequent access and this is important to consider when designing appropriate storage.
A summary of the qualities of Hot and Cold data is shown in the table, with some of their respective operational qualities – some data such as transactional data will fall somewhere along the spectrum within the warm area.
00003.jpeg––––––––
As we have seen there are several types of data which we are required to manage and handle proficiently and that requires an understanding of several components. Firstly we should consider the data structure, for example does it have a fixed schema which makes it suitable for standard relational SQL or is it JSON (schema free) or Key/Value style unstructured or semi-structured data in which case NoSQL or in-memory storage or processing will be the appropriate method. We will discuss this further later on, but for now all we need to understand is that we must make the correct choice that matches our requirements. Therefore there is a need to plan how often we will require to access the data and with what latency, and this will depend on the data’s characteristics, i.e. whether it hot, warm or cold. As a general rule we should always store the data in the same method we wish to access it, i.e. if we only require infrequent access store in cold or warm storage. Lastly, and very importantly we should ensure that we handle and manage the data in the most cost effective manner that meets our operational requirements – there can be significant unnecessary costs if you choose an inefficient or inappropriate method. We will go into all of these design choices much later when we consider cloud deployments and the plethora of tools and choices we have to fit our design requirements in a design efficient and cost effective manner.
Chapter 3 – Deriving Knowledge from Information
The European Committee for Standardization’s official Guide to Good Practice in Knowledge Management
says: Knowledge is the combination of data and information, to which is added expert opinion, skills and experience, to result in a valuable asset which can be used to aid decision making.
In the introductory chapter to this book we saw that Information can be distilled from raw data but we did not examine in any depth how we manage this feat. The answer is of course through the application of the techniques and the subject of this introduction – data analytics.
Data Analytics
In this section we will investigate the data analytical methodologies and technologies that are feasible for SMEs in the pursuit of business intelligence. Most small medium enterprise businesses still run on spreadsheets and that isn’t an issue as they perform more than adequately for the consumers of strategic, tactical and even operational information. Indeed so successful are spreadsheets that you will find weaning executives, managers and decision-makers of their favourite analytical tools is easier said than done.
Spreadsheets provide a way for managers and executives to analyse data and importantly get that data away from the control of IT. Spreadsheets allow managers to do their own data preparation and analysis. It also provides a means of self-sufficiency and the ways that spreadsheets can accomplish this feat are through:
Financial Modelling: Spreadsheets are great for the kind of assumptions and testing needed to put together month-by-month forecasts of financial performance.
Hypothesis testing: Spreadsheets are fast and easy to use so are perfect for on the spot calculations or hypothesis checking on a new data set.
One-Time Analysis: Spreadsheets are great for one-time modelling as you can quickly load source data, run an analysis, and draw conclusions quickly.
There are several types of data analytics that are commonly used in business. Descriptive analysis is the first type of data analysis that is usually conducted as it describes the main aspects of the data being analysed. For example, it may describe how well a certain model of mobile phone is performing by contrasting its number of sales compared to the norm. This allows for comparisons to be made among different models of phones and helps in decision making as it aides in predicting what stock holdings are required per model.
There is another common type of data analysis which is called exploratory analysis and here the goal is to look for previously unknown relationships. This type of analysis is a way to find new connections and to provide future recommendations and is commonly used in supermarket basket analysis to find interesting correlations between the products bought by a customer on a specific visit.
Predictive analysis as the name suggests predicts future happenings by looking at current and past facts. This sounds very grand but can also be as simple as trending analysis which graphs performance against time so that a researcher can see straightaway if there is a recognisable and predictable pattern or not.
There is also inferential analysis where a small sample is used to infer a result from a much larger sample, this method is commonly used in the analysis of voters in exit polls. Causal analysis is used to find out what happens to one variable when you change some other variable. For example, how are sales affected if a product is placed in a different location, adjacent to another product or on a higher/lower shelf?
Evaluating Information
Gaining information through data analytics however, may be a better basis for decision making than nothing – or just gut feeling – but it isn’t at this level really knowledge as some quantity of that was required to understand the problem initially. Furthermore, knowledge was required to comprehend what data would be required to provide the information to prove or disprove a hypothesis, and then knowledge had to be referenced in order to comprehend and validate the information. Hence, we can consider that although Knowledge may sit above data and information in the DIKW pyramid it may not be a perfect fit. It may be considered to be an actionable component derived from the original data and information. However for now the focus is on data and most importantly how we derive information from the raw data that we collect, sometimes not just from the by-products of our operations but by the social interactions with our customers.
There has recently been an epiphany regards the worth of data and its value to an organisation. Data is something most companies, even SME’s have potentially vast quantities off, typically they have no use for it so they have previously only stored what was necessary for historical reporting and ditched the rest due to the high costs associated with storage. Over the last decade, the cost of storage has plummeted and with the advent of Big Data and cloud storage it became both possible to store and analyse vast quantities of all sorts of data. Moreover, cheap cloud storage supplemented with new analytical techniques and tools, which promise to reveal insights mined from this data through predictive and proscriptive analytics have changed executive thinking. Now there is potential value to even the small business in the collection, storage and analysis of data based on predictions and trends. Consequently, business and technology leaders have begun the process of building data warehouses and even in some cases data lakes in order to harvest and hoard this precious commodity of raw data.
An interesting aside though is that although just over 52% of SME companies that have actively pursued a BI or data analytics program say they derived benefits few seem so far to have generated quantifiable success. This of course may be down to the fact that only around 1% of all data they harvest is actually analysed.
For several years the procedures and techniques of data mining or the manipulation of Big Data was perceived to be the domain of big enterprises with big budgets. However, open source tools allied with cloud computing services have brought data analytics into the reach of even the most financially constrained SME. Also for SME to succeed in such an endeavour requires that they understand what data is, how it is analysed, for what purpose, and very importantly how it is managed. This is because for all the rhetoric data analytics is only as good as the data you feed it. With data analytics the computer maxim of garbage in, garbage out is a certainty and if we are looking for quality information then we need to ensure quality data at the source. In addition, we also need to know the right questions to ask of our data, and that is where most companies stumble – as they just don’t actually know what they want.
The Information that SME businesses require comes from many sources and has many characteristics some of these may be desirable, essential, comprehensive and accurate but others may not be so; for example the information derived may well be incomplete, unverified, cosmetic or extraneous. In order to evaluate the differing characteristics of information when considering its value we have to have a process to put that information through in order to characterize its attributes as valid information. Today we have vast amounts of new information that is being sourced from data collected from many diverse sources such as social media, online news sites and forums as well as coming in many formats such as video, text, memes or photographs and these are not so easily classified and verified especially when we are doing text or sentiment analysis.
To complicate matters further the marketplace is changing due to the advent of social media and ubiquitous mobile communication, which make customers far better informed. This proliferation of information comes about through active collaboration as customers will provide product reviews and exchange views on goods and services with potential customers on social media sites. As a result, customers are making more informed decisions based upon data that they feel they can trust, which ultimately makes retaining customers more difficult. Indeed many customers in retail are now actively targeting retaining customers through improved goods and services by taking a more proactive position on social media marketing.
Therefore when we are evaluating the credibility, veracity and quality of the information we must also consider its provenance or source especially if it is coming from the internet. There are several key areas to consider:
Authority of the source or the publisher (this is especially important when taking data from social media sources)
Objectivity of the author (again social media sources can be difficult to verify and objectivity or bias hard to quantify)
Quality of the information
Currency of the information
Relevancy of the information
These were the generally accepted steps taken to verify and authenticate information in the pre-internet social media days of the printed press, journals and libraries. However these steps are just as important today when we have to authenticate information from the internet. For instance when we evaluate authority we have to ask several questions of the source, and this does not just apply to social media authors as data can come from many diverse locations such as sensors, servers, third party distributors/aggregators and the Internet of Things, and any of these can be fraudulent;
Who is the source?
What are their credentials?
What is the sources reputation?
Who is the publisher, if any?
Is the source associated with a reputable institution or organization?
To evaluate objectivity we ask the following questions, especially of data obtained through third party brokers;
Does the author of the data, information or algorithm exhibit a bias?
Does the information appear valid and true?
To evaluate quality;
Is the information well structured?
Is the information source legitimate?
Does the information have the required features and attributes?
In addition we may take into account the Information characteristics themselves, for example, is the information;
Accurate
Comprehensive
Unbiased
Timely
Reliable
Verifiable
Current
Valuable
Relevant
These are all characteristics we should check for when dealing with information especially when sourced from the internet. Information coming from the internet especially should be checked for being timely and current and hence relevant. One of the issues with search engine algorithms is that they typically rate the most popular relevant entries, but these cannot be the most timely and current. Hence you may end up supporting your hypothesis with data that is ten years old.
Bias and completeness should also be considered, as often information is supplied in social media sites, in a one sided manner, in order to support the publisher’s agenda. This phenomenon isn’t particularly new as traditional printed media were at this game for centuries but it has become more pronounced and prevalent where social media on the internet is concerned. The concept of an echo chamber is how it is best described where like-minded individuals congregate and exchange their similar views, which is not a bad thing – but it is when it is to the exclusion of any opposing point of view. Of course this isn’t a healthy environment and it is a Petri dish for cultures of false news and the propagation of deliberate falsehoods to support an agenda. This appears to have exacerbated the all too human condition, best explained in the Simon and Garfunkel song The Boxer, All lies and Jest, still a man hears what he wants to hear and disregards the rest
.
We must be careful how we handle Information as can be seen through the proliferation and perceived damage cause by ‘fake news’ as it can have many characteristics.
Within a business or organization, information may come from several sources and be categorized by the role that information will play. Hence, within a SME we may find that some forms of information have roles and can be separated into five main categories;
Planning - A business needs to know what resources it has (e.g. cash, people, inventory, property, customers). It needs information about the markets in which it operates and the actions of competitors.
Recording – Financial transactions, Sales orders, stock invoices and inventory all need to be recorded.
Controlling – Information is required to apply controls and to see if plans are performing better or worse than expected.
Measuring – Performance in a business need to be measured to ascertain if sales and profits are meeting targets and operational costs are controlled.
Decision making – Within the decision making group we find subsets of information that further separates information into three classes, operational, tactical and strategic, which is dependent on the information’s role and purpose.
(1) Strategic information: this is highly summarized information used to help plan the objectives of the business as a whole and to measure how well those objectives are being achieved. Examples of strategic information include:
Profitability of each part of the business
Size, growth and competitive structure of the markets in which a business operates
Investments made by the business and the returns (e.g. profits, cash inflows) from those investments
(2) Tactical Information: this is used to decide how the resources of the business should be employed. Examples include:
Information about business productivity (e.g. units produced per employee; staff turnover)
Profit and cash flow forecasts in the short term
Pricing information from the market
3) Operational Information: this information is used to make sure that specific operational tasks are executed as planned/intended (i.e. things are done properly). For example, a production manager will want information about the extent and results of quality control checks carried out in the manufacturing process.
Cognitive bias and its impact on data analytics
Cognitive bias is defined as a limitation in a person’s objective thinking that comes about due to their favouring of information that matches their personal experience and preferences.
The problem is that while data analytics technology can produce results, it is still up to the individuals to interpret those results. Furthermore they may unwittingly even skew the entire process by selecting what data should be analysed, which can cause digital tools used in predictive analytics and prescriptive analytics to generate false results.
Cognitive Bias is not the only bias we should be aware of as there are several more that can have a telling effect on data analysis and how the results are interpreted:
Clustering illusion - the tendency for individuals to want to see a pattern in what is actually a random sequence of numbers or events.
Confirmation bias - the tendency for individuals to value new information that supports existing ideas.
Framing effect - the tendency for individuals to arrive at different conclusions when reviewing the same information depending upon how the information is presented.
Group think - the tendency for individuals to place high value on consensus.
Analysts should be aware of the potential pitfalls of deploying and using predictive modelling without examining the provenance of the data selected for analysis for cognitive bias. For example, over the last decade pollsters and election forecasters around the world have deployed predictive analysis models with shockingly poor results. This is due chiefly to an over reliance on weak polling data and flawed predictive models, which resulted in an unpredicted outcome.
In this chapter we have learned that knowledge, information and data differ mainly in abstraction, with data being the least abstract and knowledge the most. Information is