0% found this document useful (0 votes)
14 views34 pages

1 - Data Mining and Analysis

The document provides an overview of data science and artificial intelligence, detailing key concepts such as machine learning, deep learning, and data mining. It discusses the roles of various experts in the field and highlights the differences between AI and augmented intelligence, as well as the importance of reinforcement learning. Additionally, it covers data analytics, data matrices, and the probabilistic view of data, emphasizing the significance of understanding data attributes and their classifications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

1 - Data Mining and Analysis

The document provides an overview of data science and artificial intelligence, detailing key concepts such as machine learning, deep learning, and data mining. It discusses the roles of various experts in the field and highlights the differences between AI and augmented intelligence, as well as the importance of reinforcement learning. Additionally, it covers data analytics, data matrices, and the probabilistic view of data, emphasizing the significance of understanding data attributes and their classifications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Data Science

Dr. Teena Sharma


DAI 101
Ph.D., University of Quebec at Chicoutimi, Canada
© IIT Roorkee India
([email protected])
2
Expert Talk –foreign, India
1. Prof. Rajasen Gupta (Professor Mcgill University, Montreal Canada)
2. Prof Abdellah Chehri (Royal Military College, Kingston, Canada)
3. Prof Issouf Fofana (University of Quebec at Chicoutimi, Quebec Canada)
4. Dr. Benoit Duglas (Thales, Canada)
5. Dr. Roshan Jain (Startup in AI, Waterloo, Ontario, Canada)
6. Mr. Abhishek (Manager, software developer, Accenture, Quebec, Canada)
7. Ms. Dhavni Sharma (Working at International air transport association
(IITA, Montreal, Canada)
8. Mrs. Aakansha Chawla, MBA (Business analyst, IITA, Montreal, Canada)
9. Prof. Hitesh Upreti (Professor Shivnadar University, Greater Noida)

3
AI fundamentals

Artificial Intelligence: Simulation of human intelligence in


machines enabling them to perform tasks typically require
human thinking. Chat bot ELISA (developed in mid 1960s
and could mimic human like conversation to an extent). It’s
a very broad terms encompassing several techniques.

AI’s ability to learn and adapt has the potential to transform


entire industries, create new innovations and ultimately
benefit society as a whole.

4
Cont.

Machine Learning: A subfield of AI, focussing on


developing algorithms that allow computers to learn from
and make decisions based upon data rather than being
explicitly programed to perform a specific task.
These algorithms uses statistical techniques to learn
patterns in data and make predictions or decisions
without human intervention. It’s again a broad terms and
uses traditional statistical methods and complex neural
networks (Categories: SL, UL, RL).

5
Cont.
Deep Learning: Artificial neural networks with multiple layers (nodes and
connections). ML can extract simpler patterns in data while DL excels at handling
vast amounts of big data (unstructured data) like images or natural language.

Foundation models: popularized in 2021 by researchers at the Stanford institute


and provide more generalized and scalable AI solutions. These models are large
scale neural networks pretrained on vast amount of data and they serve as a base
for a multitude of applications. So, instead of training a model from scratch for each
specific task, you can take a pretrained foundation model and fine tune it for a
particular application (save both resources and time). Can perform task ranging
from language translation to content generation to image recognition. They can
handle many types of inputs image, audio and text.

6
Cont.
Large Language Model: Type of foundation model which is trained on large amount of
text data. L stands for large scale (billions or millions of parameters). Next, L stands for
language that designed to understand and interact using human languages as they are
trained on massive datasets. They are used in NLP task such as such as understanding
context, answering questions, generating text and even translation.

Vision model: It can see in and quotes, interpret and generate images.
Scientific models: are used in biology where there are models for predicting how proteins
fold into 3D shape.
Audio model: for generating human sounding, speech or composing the next fake drake hit
song.

Generative AI: Models and algorithms specifically crafted to generate new content.
Foundation models provide the underlying structure and understanding, GI is about
harnessing that knowledge to produce something that is new. It’s a broad field of AI that
uses algorithms to create new content like text, images, videos, audio, code, and
simulations.
7
Example?
AI and Augment AI

AI: is the ability for leveraging computers or machines to mimic the problem
solving and decision-making capabilities of human mind. It can perform task and
make decisions that normally require human intelligence, such as reasoning, natural
communication and problem solving. Basically, replaces the need of humans.

Augmented intelligence: m/c and humans both work together by enhancing


each other’s efforts when completing tasks. It augment human abilities, such as
screen reader for blind, voice navigation or in-car collision avoidance system or
blind spot detection system. They complement our own capabilities.
So, AI or Augment AI?

8
Reinforcement learning in AI

Reinforcement learning in AI is when machines learn to make better decisions by


trying things out and getting feedback. For example, it can be used to teach a robot
how to navigate in a room. When robot perform an action, such as stopping,
turning around or moving forward, it then receives a reward or penalty based on
how well it did. The robot uses this feedback to learn and improve its decision-
making abilities and over time it gets better at navigating in the room.

Use cases: Robotics, gaming, autonomous vehicles and recommendation systems


use reinforcement learning to improve performance. The ability to learn from
mistakes and get better over time makes reinforcement learning a critical tool in AI.

9
Reinforcement learning

Reinforcement learning (RL) is a machine learning (ML) technique that


trains software to make decisions to achieve the most optimal results.
It mimics the trial-and-error learning process that humans use to
achieve their goals, through a feedback system, the agent learns from
its environment and optimizes its behaviors.

During training, model perceive and interpret its environment, take


actions and learn through trial and error. E.g., such as a feature in a
video game or a robot in an industrial setting and recommendation
systems.

1
0
Reinforcement Learning

1
1
Data Analytics and Data Science

Data analytics involves examining data to extract meaningful insights,


while data science encompasses a wider scope, including data
collection, cleaning, analysis, and machine learning modeling for
predictive insights and decision-making.

Data analytics focuses more on analyzing the past data or historical


data (explaining the past) to predict or forecast future, outcome or
decision making. E.g., Amazon product sale or temperature
prediction.

1
2
Introduction to Data Mining
and Analysis

13
Data Mining
• Data mining is the process of discovering insightful, interesting, and
novel patterns, as well as deriving descriptive, understandable, and
predictive models from large-scale data.

• At the heart of data mining is data itself.

• We begin this course by looking at basic properties of data modeled


as a data matrix

1
4
Data Matrix
• Data can often be represented or abstracted as an n x d data matrix,
with n rows and d columns, where rows correspond to entities in the
dataset, and columns represent attributes or properties of interest

1
5
Data Matrix
• Rows: Also called instances, examples, records, transactions, objects,
points, feature-vectors, etc. Given as a d-tuple

• Columns: Also called attributes, properties, features, dimensions,


variables, fields, etc. Given as an n-tuple

1
6
Attribute Classification
Discrete Attribute
Has a finite or countably set of values
Examples: Zip codes, click counts, set of words in a collection
of documents (often represented as integer values)
Binary attribute is a special case of discrete attribute

Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight
Continuous attributes are typically represented as floating-point
variables

1
7
Attributes
Attributes may be classified into two main types
• Numeric Attributes: real-valued or integer-valued domain
• Interval-scaled: only differences are meaningful, e.g., temperature
• Ratio-scaled: differences and ratios are meaningful, e.g., Age
• Categorical Attributes: set-valued domain composed of a set
of symbols
• Nominal: only equality is meaningful e.g., domain(Sex) = { M, F}
• Ordinal: both equality (are two values the same?) and inequality (is one
value less than another?) are meaningful e.g., domain(Education) = {
High School, BS, MS, PhD}

1
8
19
Iris Dataset Extract

2
0
Data: Algebraic and Geometric View
• For numeric data matrix D, each row is a d-dimensional data point (i.e., a
vector with d attributes):

whereas each column is an n-dimensional attribute vector (i.e., a vector


with n data points).

2
1
Data: Algebraic and Geometric View

2
2
Scatterplot:
2D Iris
Dataset
sepal length
versus sepal
width.

What about more than two attributes?. 92


3
Numeric Data Matrix
• If all attributes are numeric, then the data matrix D is an n x d matrix,
or equivalently a set of n row vectors xiT ∈ Rd or a set of d column
vectors Xj ∈ Rn

• The mean of the data matrix D is the average of all the points:

24
Numeric Data Matrix
• The centered data matrix is obtained by subtracting the mean
from all the points:

25
Norm, Distance and Angle

26
Norm, Distance and Angle

27
Orthogonal Projection

28
DATA: PROBABILISTIC VIEW
• The probabilistic view of the data assumes that each numeric
attribute X is a random variable, defined as a function that assigns a
real number to each outcome of an experiment.

• Formally, X is a function X : O → R, where O, the domain of X, is the


set of all possible outcomes of the experiment, also called the sample
space, and R, the range of X, is the set of real numbers.
X ( O: all possible outcomes or sample space, R= Range)

• If the outcomes are numeric, and represent the observed values of


the random variable, then X: O →O is simply the identity function:
X(v) = v for all v ∈ O.
29
DATA: PROBABILISTIC VIEW
• The distinction between the outcomes and the value of the random
variable is important, as we may want to treat the observed values
differently depending on the context

• A random variable X is called a discrete random variable if it takes on


only a finite or countably infinite number of values in its range,
whereas X is called a continuous random variable if it can take on any
value in its range.

30
Example
• Consider the sepal length attribute (X1) for the Iris dataset in.
• All n = 150 values of this attribute lie in the range [4.3,7.9], with
centimeters as the unit of measurement.
• Let us assume that these constitute the set of all possible outcomes
O.

• By default, we can consider the attribute X1 to be a continuous


random variable, given as the identity function X1(v) = v, because the
outcomes (sepal length values) are all numeric.

31
Example Cont.,
• On the other hand, if we want to distinguish between Iris flowers
with short and long sepal lengths, with long being, say, a length of 7 cm
or more, we can define a discrete random variable A as follows:

• In this case the domain of A is [4.3,7.9], and its range is {0,1}.

32
Probability Mass Function
• If X is discrete, the probability mass function of X is defined as

• Intuitively, for a discrete variable X, the probability is concentrated or


massed at only discrete values in the range of X, and is zero for all
other values.

33
Next…

Data Exploration

34

You might also like