0% found this document useful (0 votes)
13 views7 pages

Machile Learning Mid Note

Machine learning basic notes

Uploaded by

Tuhin Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Machile Learning Mid Note

Machine learning basic notes

Uploaded by

Tuhin Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

19 January

Machine Learning: Approximation এর ইসু থাকলে,সসটিই machine learning


Machine learning algorithm divided 3 category
1. Supervised learning
a. Classification
b. Regression
2. Unsupervised learning
a. Clustering
b. Association
3. Reinforcement learning

Query Data: ককছু ডািালসি কিলে সেইন করালনার পলর সসই ডািালসলির বাকিলর ককছু সিকিলে কিনলে পারা

Supervised learning: সেিালন সকালনা একিা instance (এর attributes সিকিলে) সকান ক্লালস belong কলর সসই
ইনফরলেশন থাকলব, Level data সিো থাকলব
Classification: সেিালন একাকিক class আলছ, একিা instance একাকিক ক্লালস থাকলে,
নেু ন একিা instance আসলে সসিা সকান ক্লালস belong সসগুলো সে প্রলেে কিলে সেভ করা িে সসিাই classification
problem
Regression: এিালন সকালনা classification সনই, একিা সভকরলেবে এর সালপলে অনয একিা সভকরলেবে ককভালব সিঞ্জ
িে,সসিাই regression.
or, Estimating unknown value of one variable from known value of another variable

Unsupervised learning : attributes & similarity সিলি grouping করা (Clustering),


Level data থালক না

Classification Algorithms:
Decision tree, Naive H classifier,Neural network, SVM,KNN
Multivariate: একিা কসদ্ধান্ত অলনকগুলো variable এর উপলর কডলপন্ড করলে

Regression: Single variable regression & Multivariable regression


In 1877, Sir. F. Galton first used the term “regression”
to measure relationship of heights between fathers & their sons

কবিুযকে ০ িলে বুঝাে,সবগুলো পলেন্ট একিা সরিার উপলর আলছ


কবিুযকের প্রলেে সোিান করা োে least square method কিলে (square কলর)
a & b tuning করলে িলব োলে কবিুযকে (s) কেকনোে িে,এিাই parameter tuning

Square Error = (observer value Yi - calculated value yi ) 2

23 January
Data Mining Concepts and techniques
Machine learning techniques apply কলর data mining করা িে

Data mining - Finding out useful information from data


- Extraction of interesting patterns or knowledge from huge amount of data
(implicit, previously unknown, potentially useful data )

Searching is not data mining, simple search & query processing is not data mining.
সকাথাে কক আলছ, সসিা জানা থাকলে সসিা িুলজ সবর করা data mining নে

Knowledge discovery (KDD) process


- Database, data cleaning, data integration, data warehouse, task relevant data, data mining, pattern evaluation,
knowledge

Data mining and business intelligence


Why not traditional data analysis
- huge amount of data
- high dimensionality of data (many attributes)
- high complexity of data
- new & sophisticated applications

30 January Part-1
Data mining functionalities
- multidimensional concept description
- frequent pattern, association, correlation vs causality
- classification & prediction
- cluster analysis
- outlier analysis
- trend & evolution analysis
- pattern directed or statistical analysis

Top 10 algorithm selected at ICDM’06-

Major issues on data mining-


- mining methodology
- user interaction
- applications & social impacts

06 February
Chapter 2 - Getting to know your data
term frequency vector - ডকুলেলন্ট সকান শব্দ কে বার এলসলছ
Types of data sats
- records
- graph and network
- ordered
- spatial, image and multimedia
important characteristics of structured data
- dimensionality
- sparsity- some attributes might be missing
- resolution
- distribution
data objects also called samples, examples, instances, data points, objects, tuples
rows -> data objects, columns -> attributes

Attribute types -
1) Nominal - categories, states or name of things
marital status, id number, zip code, occupation
2) Binary
a) symmetric binary: outcome same importance, ex-gender
b) asymmetric binary: outcome not same importance, ex-positive-negative
3) Ordinal - values have a meaningful order (ranking)
size = {small, medium, large}
Numeric attributes type -
a) interval - no zero point
b) ratio - inherent zero point

Discrete attributes - finite or countable infinite set of values, full value


ex-zip codes,profession
Continuous attributes - real numbers as attribute value, frictional value
ex- temperature, height, weight

From book - 83 page


weighted average
median
mode
mid range - (small + big) /2
IQR - inter quartile range (difference of Q3 & Q1) (first and last)

Five-number summary - 5 things


Q1 / median / Q2 / lowest / highest

Outlier calculation = highest value + 1.5 * IQR


Q1 = 20, Q3=60, IQR = 40, highest value = 100
outlier = 100 + 1.5 * 40 = 160, anything crossing 160 limit is outlier

13 February
Measuring data similarity and dissimilarity
Cluster - Cluster is a collection of data objects such that the objects within a cluster are similar to one another and
dissimilar to the objects in other clusters.

Data matrix (or object-by-attribute structure): This structure stores the n data objects in the form of a relational
table, or n-by-p matrix (n objects ×p attributes)

Dissimilarity matrix (or object-by-object structure): This structure stores a collection


of proximities that are available for all pairs of n objects. It is often represented by an
n-by-n table

this is data matrix & dissimilarity matrix

Measures of similarity can often be expressed as a function of measures of dissimilarity, For nominal data, sim ( i, j )
= 1 - d( i, j )

dissimilarity between 2 objects =


similarity between 2 objects =

table 2.2, page 69


Dissimilarity For nominal attributes

Dissimilarity For SYMMETRIC Binary attributes

, here (r & s) are (0,1) dissimilar

so, d ( i , j ) =( total dissimilarity / total all ), same as

Dissimilarity For ASYMMETRIC Binary attributes


For asymmetric binary attributes, the two states are not equally important,
the positive (1) and negative (0) outcomes of a disease test.
the agreement of two 1s (a positive match) is then considered more significant than that of two 0s (a negative
match), the number of negative matches, t, is
considered unimportant and is thus ignored

===========================================

=> Jaccard coefficient


===========================================
Dissimilarity of NUMERIC data : Minkowski Distance
Euclidean distance

Manhattan (or city block) distance

Euclidean & Manhattan distance satisfy following mathematical properties:


Non-negativity: d( i, j) ≥ 0: Distance is a non-negative number.
Identity of indiscernibles: d( i, i) = 0: The distance of an object to itself is 0.
Symmetry: d( i, j) = d( j, i): Distance is a symmetric function.
Triangle inequality: d( i, j) ≤ d( i, k) + d( k, j): Going directly from object i to object j

Minkowski distance is a generalization of the Euclidean and Manhattan distances.

Supremum distance: Bigger distance between (x2-x1) and (y2-y1)


x1 = (1,2) and x2 = (3,5), supremum distance = 5-2 = 3

Dissimilarity For ORDINAL attributes

Excellent, good, fair = (3, 2, 1) = ( 1-0.5-0)

13 February
Dissimilarity for MIXED attributes

Excellent
20 February (From Previous Semester Record)
Data preprocessing
data cleaning
data integration
data reduction
data transformation & discretization

Missing value handling


- ignore the tuple
- Fill in the missing value manually
- Use a global constant to fill in the missing value (“Unknown” or -1)
- Use a measure of central tendency (mean or median)
- Using mean or median for all samples belonging to same class
- Use most probable value to fill in the missing value

Noise - random error or variance in a variable


Handling noisy data
- Binning (partition, mean, bean boundary)
- Regression
- Clustering (removing outliers)
- Combine computer & human inspection

Data reduction - obtain a reduced representation of the data set that is much smaller in volume, yet closely
maintains the integrity of the original data.
- dimensionality reduction (Wavelet transform,PCA, supervise & nonlinear technique)
- numerosity reduction
- data compression

23 February

Support,Confidence = a -> d (60%, 100% )


Support 60%, সিািাে ৫িা োনলজকশলনর কভেলর ৩ বার a ককনলে d ককনলছ,
৩ িালে a,d আলছ
Confidence 100%, a ককনলে d অবশযই ককনলছ, a ককলন d ককলননাই, এেন িেনাই

(Support,Confidence), d -> a (60%, 75% )

Frequent itemset mining algorithm


Apriori algorithm by confined candidate generation
Apriori property: All nonempty subsets of a frequent itemset must also be frequent
1. The join step
2. The prune step

(Pdf page 250 & 254)

You might also like