1 - Data Mining and Analysis
1 - Data Mining and Analysis
3
AI fundamentals
4
Cont.
5
Cont.
Deep Learning: Artificial neural networks with multiple layers (nodes and
connections). ML can extract simpler patterns in data while DL excels at handling
vast amounts of big data (unstructured data) like images or natural language.
6
Cont.
Large Language Model: Type of foundation model which is trained on large amount of
text data. L stands for large scale (billions or millions of parameters). Next, L stands for
language that designed to understand and interact using human languages as they are
trained on massive datasets. They are used in NLP task such as such as understanding
context, answering questions, generating text and even translation.
Vision model: It can see in and quotes, interpret and generate images.
Scientific models: are used in biology where there are models for predicting how proteins
fold into 3D shape.
Audio model: for generating human sounding, speech or composing the next fake drake hit
song.
Generative AI: Models and algorithms specifically crafted to generate new content.
Foundation models provide the underlying structure and understanding, GI is about
harnessing that knowledge to produce something that is new. It’s a broad field of AI that
uses algorithms to create new content like text, images, videos, audio, code, and
simulations.
7
Example?
AI and Augment AI
AI: is the ability for leveraging computers or machines to mimic the problem
solving and decision-making capabilities of human mind. It can perform task and
make decisions that normally require human intelligence, such as reasoning, natural
communication and problem solving. Basically, replaces the need of humans.
8
Reinforcement learning in AI
9
Reinforcement learning
1
0
Reinforcement Learning
1
1
Data Analytics and Data Science
1
2
Introduction to Data Mining
and Analysis
13
Data Mining
• Data mining is the process of discovering insightful, interesting, and
novel patterns, as well as deriving descriptive, understandable, and
predictive models from large-scale data.
1
4
Data Matrix
• Data can often be represented or abstracted as an n x d data matrix,
with n rows and d columns, where rows correspond to entities in the
dataset, and columns represent attributes or properties of interest
1
5
Data Matrix
• Rows: Also called instances, examples, records, transactions, objects,
points, feature-vectors, etc. Given as a d-tuple
1
6
Attribute Classification
Discrete Attribute
Has a finite or countably set of values
Examples: Zip codes, click counts, set of words in a collection
of documents (often represented as integer values)
Binary attribute is a special case of discrete attribute
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight
Continuous attributes are typically represented as floating-point
variables
1
7
Attributes
Attributes may be classified into two main types
• Numeric Attributes: real-valued or integer-valued domain
• Interval-scaled: only differences are meaningful, e.g., temperature
• Ratio-scaled: differences and ratios are meaningful, e.g., Age
• Categorical Attributes: set-valued domain composed of a set
of symbols
• Nominal: only equality is meaningful e.g., domain(Sex) = { M, F}
• Ordinal: both equality (are two values the same?) and inequality (is one
value less than another?) are meaningful e.g., domain(Education) = {
High School, BS, MS, PhD}
1
8
19
Iris Dataset Extract
2
0
Data: Algebraic and Geometric View
• For numeric data matrix D, each row is a d-dimensional data point (i.e., a
vector with d attributes):
2
1
Data: Algebraic and Geometric View
2
2
Scatterplot:
2D Iris
Dataset
sepal length
versus sepal
width.
• The mean of the data matrix D is the average of all the points:
24
Numeric Data Matrix
• The centered data matrix is obtained by subtracting the mean
from all the points:
25
Norm, Distance and Angle
26
Norm, Distance and Angle
27
Orthogonal Projection
28
DATA: PROBABILISTIC VIEW
• The probabilistic view of the data assumes that each numeric
attribute X is a random variable, defined as a function that assigns a
real number to each outcome of an experiment.
30
Example
• Consider the sepal length attribute (X1) for the Iris dataset in.
• All n = 150 values of this attribute lie in the range [4.3,7.9], with
centimeters as the unit of measurement.
• Let us assume that these constitute the set of all possible outcomes
O.
31
Example Cont.,
• On the other hand, if we want to distinguish between Iris flowers
with short and long sepal lengths, with long being, say, a length of 7 cm
or more, we can define a discrete random variable A as follows:
32
Probability Mass Function
• If X is discrete, the probability mass function of X is defined as
33
Next…
Data Exploration
34