AI Primer
AI Primer
● Data Science: Dealing with unstructured and structured data. Data Science is a field that
comprises everything that is related to data cleansing, preparation, and analysis. Data
Science is the combination of statistics, mathematics, programming, problem-solving,
capturing data in ingenious ways, the ability to look at things differently, and the activity of
cleansing, preparing, and aligning the data. In simple terms, it is the umbrella of techniques
used when trying to extract insights and information from data.
● Big Data: Big Data refers to humongous volumes of data that cannot be processed
effectively with the traditional applications that exist. The processing of Big Data begins
with the raw data that isn’t aggregated and is most often impossible to store in the memory
of a single computer. The definition of Big Data, given by Gartner, is, “Big data is high-
volume, and high-velocity or high-variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced insight, decision making,
and process automation.”
● Data Analytics: the science of examining raw data to conclude that information. Data
Analytics involves applying an algorithmic or mechanical process to derive insights and,
for example, running through several data sets to look for meaningful correlations between
each other. It is used in several industries to allow organisations and companies to make
better decisions as well as verify and disprove existing theories or models. The focus of
Data Analytics lies in inference, which is the process of deriving conclusions that are solely
based on what the research already knows.
Data Structure
Structured vs Unstructured Data
Sensor data: when you talk about radio frequency ID Satellite images: when you take into
tags, smart meters, medical devices, and Global consideration the weather data or the
Positioning System data, you are basically referring to data that government agencies
machine generated structure data. Supply chain procure through its satellite
management and inventory control is what gets the surveillance imagery, you are talking
companies interested in this. about machine generated unstructured
data. Google Earth and similar
mechanisms aptly illustrate the point.
Web log data: when systems and mechanisms such Scientific data: all scientific data that
as servers, applications and networks etc. work, they includes seismic imagery, atmospheric
soak in different types of data regarding whatever is data and high energy Physics so and
the operation. It means enormous piles of data of so forth stand for machine generated
diverse kinds. Based on this data, you can deal with unstructured data.
service-level agreements or predict security breaches.
Point-of-sale data: when the digital transactions take Photographs and video: when
place over the counter of a shopping mall, the machines capture images and video
machine captures a lot of data. This is machine for the purposes of security,
generated structured data related to barcode and surveillance and traffic, the data that is
other relevant details of the product etc. produced is machine generated
unstructured data.
Financial data: computer programs are used with Radar or sonar data: this includes
respect to financial data a lot more now. Processes vehicular, meteorological, and
are automated with the help of these programs. Take oceanographic seismic profiles.
the case of stock-trading. It carries structured data
such as the company symbol and dollar value. A part
of this data is machine generated and some of it is
human generated.
Input data: when a human user enters input Text internal to your company: this is the type
such as name, age, income, non-free-form of data that is restricted to a given company
survey responses etc. into a computer, it is such as documents, logs, survey results,
human generated structured data. Companies emails etc. Such enterprise information forms
can find this type of data quite useful in a big part of such unstructured text information
studying consumer behaviour. in the world.
Clickstream data: this is the type of data Social media data: this kind of data is
generated through the process of a user generated when human users interact with
clicking a lick on a website. Businesses like social media platforms such as Facebook,
this type of data because it allows them to Twitter, Flickr, YouTube, LinkedIn etc.
study customer behaviour and purchase
patterns.
Gaming-related data: when a human user Mobile data: this type of data includes
makes a move in a game on a virtual platform, information such as text messages and
it produces a piece of information. How users location information.
navigate a gaming portfolio is a source of a lot
of interesting data.
Characteristics
Structured data Unstructured data
Robustness Robust -
Query Performance Structured query allows complex Only textual query possible
joins
Storage Techniques
Structured Data Storage Technique
Block storage / block level storage:
● This type of data storage is used in the context of storage-area network (SAN)
environments. In such environments, data is stored in volumes which is also referred to
as blocks.
● An arbitrary identifier is assigned to every block. It allows the block to be stored and
retrieved but there would be no metadata providing further context.
● Virtual machine file system volumes and structured database storage are the use cases
of block storage.
● When it comes to block storage, raw storage volumes are created on the device. With the
aid of a server-based system, the volumes are connected and each one is treated as an
individual hard drive.
● This particular technique is basically a way of storing, organising and accessing data on
disk. The difference however is that it is done so in a more scalable and cost-effective
manner.
● This kind of storage system makes it possible to retain huge volumes of unstructured data.
When it comes to storing photos on Facebook, songs on Spotify, or files in collaboration
services such as Dropbox, object storage comes into play.
● Each object incorporates data, a lot of metadata and a singularly unique identifier. This
kind of storage can be done at different levels such as device level, system level and
interface level.
● Since objects are robust, this kind of storage works well for long-term storage of data
archives, analytics data and service provider storage with SLAs linked with data delivery.
Definition Structured data refers to any data Unstructured data (or unstructured
that resides in a fixed field within a information) is information that either
record or file. This includes data does not have a predefined data model
contained in relational databases or is not organised in a pre-defined
and spreadsheets. manner.
Growth Structured data accounts for about Experts estimate that 80% of the data in
20% of the total existing data. any organisation is unstructured.
Definition Data that is ‘small’ enough for human Data sets that are so large or complex
comprehension. In a volume and that traditional data processing
format that makes it accessible, applications cannot deal with them.
informative and actionable.
Volume Most case in range of tens or More than few Terabytes (TB).
hundreds of GB. Some case few TBs
(1 TB = 1000 GB).
Velocity (Rate ● Controlled and steady data ● Data can arrive at very fast
at which data flow speeds
appears) ● Data accumulation is slow ● Enormous data can
accumulate within very short
periods of time
Variety Structured data in tabular format with High variety data sets which include
fixed schema and semi structured Tabular data, Text files, Images,
data in JSON or XML format. Video, Audio, XML, JSON, Logs,
Sensor data etc.
Veracity Contains less noise as data collected Usually quality of data not guaranteed.
(Quality of in controlled manner. Rigorous data validation is required
data) before processing.
Value Business Intelligence, Analysis and Complex data mining for prediction,
Reporting recommendation, pattern finding etc.
Time Historical data equally valid as data In some cases data gets older soon
Variance represent solid business interactions (e.g. fraud detection).
Data Location Databases within enterprise, Local Mostly in distributed storages on Cloud
servers etc. or in external file systems.
But this in turn is its main disadvantage: the centralized ones are very unstable, since any problem
that affects the central server can generate chaos throughout the system. However, the
distributed ones are more stable, by storing the totality of the system information in a large number
of nodes that maintain equal conditions with each other.
This same feature is what gives distributed networks a higher level of security, since to carry out
malicious attacks would have to attack a large number of nodes at the same time. As the
information is distributed among the nodes of the network: in this case if a legitimate change is
made it will be reflected in the rest of the nodes of the system that will accept and verify the new
information; but if some illegitimate change is made, the rest of the nodes will be able to detect it
and will not validate this information. This consensus between nodes protects the network from
deliberate attacks or accidental changes of information.
In addition, distributed systems have an advantage over centralized systems in terms of network
speed, since as the information is not stored in a central location, a bottleneck is less likely, in
which the number of people Attempting to access a server is larger than it can support, causing
waiting times and slowing down the system.
Also, centralized systems tend to present scalability problems since the capacity of the server is
limited and can not support infinite traffic. Distributed systems have greater scalability, due to the
large number of nodes that support the network.
Finally, in a distributed network the extraction of any of the nodes would not disconnect from the
network to any other. All the nodes are connected to each other without necessarily having to
pass through one or several local centers. In this type of networks the center / periphery division
disappears and therefore the filter power over the information that flows through it, which makes
it a practical and efficient system.
Comparative Summary
Centralised Distributed
Security If someone has access to the All data is distributed between the nodes
server with the information, any of the network. If something is added,
data can be added, modified and edited or deleted in any computer, it will be
deleted. reflected in all the computers in the
network. If some legal amendments are
accepted, new information will be
disseminated among other users
throughout the network. Otherwise, the
data will be copied to match the other
nodes. Therefore, the system is self-
sufficient and self-regulating. The
databases are protected against
deliberate attacks or accidental changes
of information.
Availability If there are several requests, the Can withstand significant pressure on the
server can break down and no network. All the nodes in the network have
longer respond. the data. Then, the requests are
distributed among the nodes. Therefore,
the pressure does not fall on a computer,
but on the entire network. In this case, the
total availability of the network is much
greater than in the centralised one.
Accessibility If the central storage has problems, Given that the number of computers in the
you will not be able to obtain your distributed network is large, DDoS attacks
information unless the problems are possible only in case their capacity is
are solved. In addition, different much greater than that of the network. But
users have different needs, but the that would be a very expensive attack. In
processes are standardised and a centralised model, the response time is
can be inconvenient for customers. very similar in this case. Therefore, it can
be considered that distributed networks
are secured.
Data If the nodes are located in different In distributed networks, the client can
transfer countries or continents, the choose the node and work with all the
rates connection with the server can required information.
become a problem.
e-Scalability Centralised networks are difficult to Distributed models do not have his
scale because the capacity of the problem since the load is shared among
server is limited and the traffic can several computers.
not be infinite. In a centralised
model, all clients are connected to
the server. Only the server stores
all the data. Therefore, all requests
to receive, change, add or delete
data go through the main computer.
But server resources are finite. As a
result, he is able to carry out his
work effectively only for the specific
number of participants. If the
number of clients is greater, the
server load may exceed the limit
during the peak time.
● Symbolic AI reached its peak popularity during the “Expert Systems” booms of the
1980s.
● Expert systems are a logical and knowledge-based approach.
● Their power came from the expert knowledge they contained, but it also limited the
further development of expert systems.
● The knowledge acquisition problem, knowledge base increasing and updates issues are
all the major challenges of expert systems.
● A new type of AI approaches became required over rule-based technologies at that time.
Machine Learning
● Machine learning, reorganised as a subset field of AI, started to flourish in the 1990s
● Different from Symbolic AI, machine learning does not require humans to know the
existing rules.
● It arises from the question: “could a computer go beyond “what we know how to order it
to perform” (Symbolic AI), and learn on its own how to perform a specified task?”
● Based on this, with machine learning, humans input data as well as the expected answers
from the data, and the machine will “learn” by itself and outcome the rules.
● These learned rules can then be applied to new data to produce new answers.
● Figure 3 illustrates the simple structure of machine learning.
● Starting from the 1990s, AI changed its goal from achieving AI to tackling solvable
problems of a practical nature.
● It shifted focus away from the symbolic approaches it had inherited from AI, and towards
methods and models borrowed from statistics and probability theory (Langley 2011).
Deep Learning
● Following a series of ups and downs, often referred to as “AI summers and winters”, as
interest in AI has alternately grown and diminished.
● This is illustrated in Figure 4. In this evolution roadmap, we can see AI is a general field,
which covers machine learning.
● Deep learning is one hot branch of machine learning, which is also the symbol of the
current AI boom, about eight years ago.
● Although AI research started in the 1950s, its effectiveness and progress have been most
significant over the last decade, driven by three mutually reinforcing factors:
○ The availability of big data: various data sources, including businesses, e-
commerce, social media, science, wearable devices, government etc.
○ Dramatically improvement of machine learning algorithm: the sheer amount of
available data accelerates algorithm innovation
○ More powerful computing ability and cloud-based services: make it possible to
realise and implement the advanced AI algorithms, like deep neural networks
● Significant progress in algorithms, hardware, and big data technology, combined with the
financial incentives to find new products, have also contributed to the AI technology
renaissance.
● Today, AI has transformed from “let the machine know what we know” to “let the machine
learn what we may don’t know” to “let the machine automatically learn how to learn”.
● Researchers are working on much wider applications of AI that will revolutionise the ways
in which people work, communicate, study and enjoy ourselves.
● Products and services incorporating such innovation will become part of people’s day-to-
day lives in the near future.
● Activation
○ Activation functions are mathematical equations that determine the output of a
neural network.
○ The function is attached to each neuron in the network, and determines whether it
should be activated (“fired”) or not, based on whether each neuron’s input is
relevant for the model’s prediction.
● Feed-forward and backpropagation learning
Primer Knowledge of AI
● To further understand how current AI works, we will introduce the primer knowledge of
deep learning in this section.
● As machine learning is the base of deep learning, so a general introduction of some basic
knowledge about machine learning will be shown first.
Machine Learning
● Machine learning involves the creation of algorithms which can modify/adjust itself without
human intervention to produce desired output - by feeding itself through input data.
● Through this learning process, the machine can categorise similar people or things,
discover or identify hidden or unknown patterns and relationships, also can detect
anomalous behaviours in the given data, which allow it to make predictions/estimations
possible outcomes or actions of future data.
● Therefore, to do machine learning, we usually follow five steps, from data collection, data
preparation to modelling, understanding and delivering the results (as shown in figure 6)
Supervised Learning
● The supervised learning algorithm is as its name shown, it is trained/taught using given
examples.
● The examples are labelled, means the desired output for input is known.
● For example, a credit card application can be labelled either as approved or rejected.
● The algorithm received a set of inputs (the applicants’ information) along with the
corresponding outputs (whether the application was approved or not) to foster learning.
● The model building or the algorithm learning is a process to minimise the error between
the estimated output and the correct output.
● Learning stops when the algorithm achieves an acceptable level of performance, such as
the error is smaller than the pre-defined minimum error.
● The trained algorithm is then applied to unlabeled data to predict the possible output value,
such as whether a new credit card application should be approved or not.
● This is helpful to what we can familiar with, called Know Your Customer (KYC) in bank
business.
● There are multiple supervised learning algorithms: Bayesian statistics, regression
analysis, decision trees, random forests, support vector machines (SVM), ensemble
models and so on.
● Practical applications include risk assessment, fraud detection, image, speech and text
recognition etc.
Unsupervised Learning
● Different from supervised learning, in supervised learning, the algorithm is not
trained/taught on the “right answer”. The algorithm tries to explore the given data and
detect or mine the hidden patterns and relationships within the data. In this case, there is
no answer key. Learning is based on the similarity/distance among the given data points.
● Take bank customer understanding as an example, unsupervised learning can be used to
identify several groups of bank customers. The customers in a specific group are with
similar demographic information or same bank product selections. The learned
homogeneous groups can help the bank to figure out the hidden relationship within the
customer’s demographics and their bank products selection.
● This would provide useful insights on customer targeting when the bank would like to
promote a product to new customers. Also, unsupervised learning works well with
transactional data in that it can be used to identify a group of individuals with similar
purchase behaviour who can then be treated as a single homogeneous unit during
marketing promotions.
● Association rule mining, clustering like K-means, nearest-neighbour mapping, self-
organising mapping, dimensionality reduction like principal component analysis, are all the
common and popular unsupervised learning algorithms.
● Practical applications cover market basket analysis, customer segmentation, anomaly
detection and so on.
Semi-Supervised Learning
● Semi-supervised learning is used to address similar problems as supervised learning.
● However, in semi-supervised learning, the machine is provided both labelled and
unlabelled data.
● A small amount of labelled data is combined with a large amount of unlabelled data.
● When the cost associated with labelling is too high to allow for a fully labelled training
process, semi-supervised learning is normally utilised.
● Using the labelled data, semi-supervised learning algorithms first a large amount of
unlabelled data.
● A new model will further be trained using the new labelled data set.
● For example, an online news portal wants to do web pages classification or labelling.
● Let’s say the requirement is to classify web pages into different categories (i.e. Sports,
Politics, Business, Entertainment, etc.).
● In this case, it is prohibitively expensive to go through hundreds of millions of web pages
and manually label them.
● Therefore the intent of semi-supervised learning is to take as much advantage of the
unlabelled data as possible, to improve the trained model.
● Image classification and text classification are good practical applications of semi-
supervised learning.
Reinforcement Learning
● The intent of reinforcement learning is to find the best actions that lead to maximum reward
or drive the most optimal outcome.
● The machine is provided with a set of allowed actions, rules, and potential end states. In
other words, the rules of the game are defined. By applying the rules, exploring different
actions and observing resulting reactions the machine learns to exploit the rules to create
the desired outcome.
● Thus determining what series of actions, in what circumstances, will lead to an optimal or
optimised result.
● Reinforcement learning is the equivalent of teaching someone to play a game. The rules
and objectives are clearly defined.
● However, the outcome of any single game depends on the judgement of the player who
must adjust his approach in response to the incumbent environment, skill and actions of
a given opponent. It is often utilised in gaming and robotics.
● https://wiki.pathmind.com/deep-reinforcement-learning
● Deep reinforcement learning combines artificial neural networks with a reinforcement
learning architecture that enables software-defined agents to learn the best actions
possible in virtual environment in order to attain their goals.
● While neural networks are responsible for recent AI breakthroughs in problems like
computer vision, machine translation and time series prediction - they can also combine
with reinforcement learning algorithm to create something astounding like Deepmind’s
AlphaGo, an algorithm that beat the world champions of the Go board game.
● Google DeepMind’s Deep Q-learning playing Atari Breakout
○ https://www.youtube.com/watch?v=V1eYniJ0Rnk
● Marl/O - Machine Learning for Video Games
○ https://www.youtube.com/watch?v=qv6UVOQ0F44
● Reinforcement Learning - Ep 30 (Deep Learning SIMPLIFIED)
○ https://www.youtube.com/watch?v=e3Jy2vShroE
● Simulation and Automated Deep Learning
○ https://www.youtube.com/watch?v=EHP47tM6ctc
Deep Learning
● Data is to machine learning what life is to human learning. The output of a machine
learning algorithm is entirely dependent on the input data it is exposed to.
● Therefore, to train a good machine learning model, experts need to do good data
preparation beforehand. To some extent, machine learning performance depends on the
quality of the input data.
● Deep learning follows a similar workflow as machine learning, while the main advantage
is deep learning does not necessarily need structured data as input. Imitating the way how
our human brain works to solve problems - by passing queries through various hierarchies
of concepts and related questions to find an answer, deep learning uses artificial neural
networks to hierarchically define specific features via multiple layers (as Figure 5 shown).
● Deep learning weakens the dependence of machine learning on feature engineering,
which makes it general and easier to apply to more fields. The following section illustrates
the primer knowledge about how deep learning works.
● We know deep learning do the mapping of input to output via a sequence of simple data
transformations (layers) in an Artificial Neural Network.
● Take face recognition as an example, as shown in Figure 7, data (face image) is presented
to the network via the input layer, which connects to one or more hidden layers. The hidden
layers further connect to an output layer.
● Each hidden layer represents one level of face image features (greyscale, eye shape,
facial contours, etc.). Every node on each layer is connected to the nodes on the neighbour
layer with a weight value.
● The actual processing of deep learning is done by adjusting the weights of each
connection to realise input-output mapping.