0% found this document useful (0 votes)
19 views77 pages

LP Vi Manual

Uploaded by

Jahan Chaware
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views77 pages

LP Vi Manual

Uploaded by

Jahan Chaware
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE

ACADEMIC YEAR: 2022-23


DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSIGNMENT NO. 1
TITLE Perform Tokenisation & Stemming
PROBLEM STATEMENT Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet,
/DEFINITION MWE) using NLTK library. Use Porter stemmer and Snowball stemmer for
stemming. Use any technique for lemmatization.
Input / Dataset –use any sample sentence
OBJECTIVE ● Understand the concept of tokenization and stemming along with its
importance in natural language processing.
● Gain hands-on experience with the NLTK library to perform
tokenization and stemming using multiple techniques.

OUTCOME ● Understand the importance of breaking down natural language text


into smaller meaningful units for analysis and processing.
● Be able to identify the appropriate tokenization and stemming
techniques based on the nature of the text being processed.
S/W PACKAGES AND Operating Systems: Open source Linux or its derivative
HARDWARE APPARATUS Python (3.5+), NLTK library
USED
REFERENCES Steven Bird, Ewan Klein, Edward Loper, ―Natural Language Processing
with Python – Analyzing Text with the Natural Language Toolkit , O‘Reilly
Publication

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Prerequisites:

Concepts related Theory:

● Tokenization

Tokenization is the process of splitting text into individual words or tokens. It is a fundamental
task in natural language processing (NLP) and is necessary for several NLP tasks such as text

P:F-LTL-UG/03/R1
classification, sentiment analysis, and information retrieval. NLTK (Natural Language Toolkit) is
a popular Python library for NLP that provides several functions for tokenization.

● Types of Tokenization Techniques:


1. Whitespace-based Tokenization: This is the simplest form of tokenization,
where text is split based on whitespace characters like spaces, tabs, and newlines.
It assumes that each whitespace-separated substring is a token.

eg: "This is a sample text." Output: ['This', 'is', 'a', 'sample', 'text.']

2. Punctuation-based Tokenization: This method splits the text into tokens based
on punctuation marks such as periods, commas, and quotation marks. This
method can handle more complex text but may include some unwanted tokens.

eg: "This is a sample text. It contains, punctuation marks!"

Output: ['This', 'is', 'a', 'sample', 'text', '.', 'It', 'contains', ',', 'punctuation', 'marks',
'!']

3. Treebank Tokenization: This technique uses a set of rules to split the text into
tokens. It is used by the Penn Treebank Project and is widely used for parsing and
machine-learning applications. The Penn Treebank is a large corpus of written
and spoken English that has been annotated with part-of-speech tags and
syntactic trees. The Treebank Tokenizer is a tokenizer in NLTK that is based on
this corpus. , It treats periods that appear after abbreviations (e.g., "Mr.") as part
of the abbreviation, and it treats periods that appear at the end of a sentence as
their own token.

eg: "The quick brown fox jumped over the lazy dog. Mr. Smith went to
Washington, D.C. yesterday."

Output:['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', 'Mr.',
'Smith', 'went', 'to', 'Washington', ',', 'D.C.', 'yesterday', '.']

4. Tweet Tokenization: This method is specifically designed for handling tweets,


which may contain hashtags, mentions, and emojis. It uses regular expressions to
split the text into tokens.

eg: "This is a sample tweet with #hashtags and @mentions"

Output: ['This', 'is', 'a', 'sample', 'tweet', 'with', '#hashtags', 'and', '@mentions']

5. Multi-word Expression (MWE) Tokenization: This technique is used to


identify and split multi-word expressions such as "New York" and "United
States" into individual tokens.

eg: "I live in New York and work for the United States government."

Output: ['I', 'live', 'in', 'New_York', 'and', 'work', 'for', 'the', 'United_States',

P:F-LTL-UG/03/R1
'government', '.']

● Stemming

Stemming is the process of reducing words to their base or root form. It is used to improve text
processing efficiency and reduce the complexity of text analysis. Two popular stemming
algorithms are the Porter stemmer and the Snowball stemmer.

● Porter Stemmer Algorithm: The Porter stemmer is a simple algorithm that removes the
suffixes from the words. It is widely used in information retrieval and other text mining
applications.

eg: word = "running" Output: “run”

● Snowball Stemmer Algorithm: The Snowball stemmer is an improved version of the


Porter stemmer that can handle more complex words and provides better results. It is also
known as the Porter2 stemmer.

● Lemmatization vs. Stemming:

Lemmatization is often preferred over stemming because it provides a more accurate base form of
words. While stemming only removes suffixes from the words, lemmatization considers the
meaning and context of the word and produces a base form that is a real word in the language.

Conclusion:

By completing the tasks outlined in this problem statement, learners will have gained a solid
understanding of tokenization, stemming, and lemmatization in natural language processing, as well as
practical experience using the NLTK library to implement these techniques.

Review Questions:

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT

CLASS: B.E. SEMESTER: II


SUBJECT: LP VI
ASSIGNMENT NO. 2
TITLE Create embeddings using Word2Vec.
PROBLEM STATEMENT Perform bag-of-words approach (count occurrence, normalized count
/DEFINITION occurrence), TF-IDF on data. Create embeddings using Word2Vec.
Dataset to be used:
https://www.kaggle.com/datasets/CooperUnion/cardataset

OBJECTIVE ● Learn about different ways of representing text using the bag-of-
words approach, such as count occurrence and normalized count
occurrence.
● Understand the concept of TF-IDF (Term Frequency-Inverse
Document Frequency) and its importance in natural language
processing.
● Learn about the Word2Vec algorithm for generating word
embeddings.

OUTCOME ● Understand the importance of converting unstructured natural


language text into structured data for analysis and processing.
● Be able to represent text using different types of approaches and
understand the advantages and disadvantages of each.
S/W PACKAGES AND Operating Systems: Open source Linux or its derivative
HARDWARE APPARATUS Python (v3.6+), NLTK library, Gensim library
USED

P:F-LTL-UG/03/R1
REFERENCES Steven Bird, Ewan Klein, Edward Loper, ―Natural Language Processing
with Python – Analyzing Text with the Natural Language Toolkit , O‘Reilly
Publication
1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Concepts related Theory:

● Bag-of-Words Approach:

The bag-of-words approach is a method of representing text data in a structured format that can
be easily analyzed and processed. It involves breaking down a piece of text into individual words
(or tokens) and counting the number of occurrences of each word. This approach is useful for
tasks such as sentiment analysis, topic modeling, and document classification.

● Count Occurrence:

In the count occurrence method of the bag-of-words approach, the frequency of each
word is simply counted and used as a feature in the analysis. This method is simple to
implement and can be useful for certain types of analysis, such as identifying the most
common words in a corpus.

eg: "The quick brown fox jumped over the lazy dog."

Output: {"the": 2, "quick": 1, "brown": 1, "fox": 1, "jumped": 1, "over": 1, "lazy": 1,


"dog": 1}

● Normalized Count Occurrence:

In the normalized count occurrence method of the bag-of-words approach, the frequency
of each word is divided by the total number of words in the document, resulting in a
frequency distribution that represents the relative importance of each word. This method
is useful for tasks such as identifying the most important words in a document or corpus.

eg: "The quick brown fox jumped over the lazy dog."

Output: {"the": 0.222, "quick": 0.111, "brown": 0.111, "fox": 0.111, "jumped": 0.111,
"over": 0.111, "lazy": 0.111, "dog": 0.111}

● TF-IDF:

TF-IDF (Term Frequency-Inverse Document Frequency) is a method of weighting the importance

P:F-LTL-UG/03/R1
of each term in a document based on its frequency in the document and its frequency in the
corpus. This method is useful for reducing the impact of common words in a document and
highlighting words that are more specific to the document.

TF-IDF = (Term Frequency) x (Inverse Document Frequency)

Where Term Frequency is the frequency of the term in the document, and Inverse Document
Frequency is a measure of how rare the term is in the corpus.

eg: If the word "apple" appears 10 times in a document of 100 words, and appears in 50 out of a
corpus of 1000 documents, the TF-IDF score for "apple" in that document would be:

TF-IDF = (10/100) x log(1000/50) = 0.1 x 1.3 = 0.13

● Word Embeddings:

Word embeddings are a way of representing words in a continuous vector space, where words
that have similar meanings or contexts are clustered together. This approach is useful for
capturing the semantic relationships between words and for tasks such as sentiment analysis and
text classification.

eg: The word "king" might be represented as the vector [0.2, 0.8, -0.5, 0.1], while the word
"queen" might be represented as the vector [0.3, 0.7, -0.3, 0.2]. Words that have similar meanings
or contexts are clustered together in this vector space, allowing for semantic relationships
between words to be captured.

● Word2Vec:

Word2Vec is an algorithm for generating word embeddings that is based on a neural network
architecture. The algorithm learns to predict the context in which each word appears in a corpus,
resulting in a vector representation of each word that captures its semantic meaning.

eg: Given a corpus of text, the Word2Vec algorithm might learn that the word "king" often
appears in the context of the words "queen", "prince", and "royal", and that the word "queen"
often appears in the context of the words "king", "princess", and "royal". These relationships are
captured in the resulting word embeddings, allowing for the semantic meaning of words to be
analyzed and compared.

Conclusion: This assignment provides an opportunity to gain hands-on experience with several essential
NLP concepts, including tokenization, stemming, and vectorization. Through this exercise, students can
learn how to preprocess text data and represent it in a way that is suitable for analysis and modelling.

P:F-LTL-UG/03/R1
They can also gain a deeper understanding of the strengths and limitations of different text representation
techniques.

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE


ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT

CLASS: B.E. SEMESTER: II


SUBJECT: LP VI
ASSIGNMENT NO. 3
TITLE Preprocessing and Text Representation Techniques for Natural Language
Processing.
PROBLEM STATEMENT Perform text cleaning, perform lemmatization (any method), remove stop
/DEFINITION words (any method), and label encoding. Create representations using TF-
IDF. Save outputs.
Dataset:
https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-
Preprocessing/News_dataset.pickle

OBJECTIVE ● Understand the importance of text preprocessing for NLP tasks.


● Learn how to perform text cleaning, lemmatization, stop words
removal, and label encoding using Python.
● Gain an understanding of the TF-IDF technique for text
representation.

OUTCOME ● Gain hands-on experience in cleaning and preprocessing textual


data, including identifying and removing stop words and performing
lemmatization.
● Understand the concept of label encoding and learn how to perform
it on textual data to prepare it for machine learning models.

P:F-LTL-UG/03/R1
● Learn how to use the TF-IDF technique to convert textual data into
a numerical representation that is suitable for machine learning
models.

S/W PACKAGES AND Operating Systems: Open source Linux or its derivative
HARDWARE APPARATUS Python (v3.6+), NLTK library, Scikit-learn library
USED
REFERENCES Jurafsky, David, and James H. Martin, ―Speech and Language Processing:
An Introduction to Natural Language Processing, Computational
Linguistics and Speech Recognition, PEARSON Publication.

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Concepts related Theory:

● Text Cleaning:

Text cleaning is the process of removing unwanted elements such as noise, stop words,
punctuations, special characters, and numbers from the raw text. The goal of text cleaning is to
convert raw text into a clean and standardized format that is suitable for analysis. Text cleaning
can be done using various techniques such as regular expressions, stemming, lemmatization, and
stop-word removal.

● Lemmatization:

Lemmatization is the process of converting words to their base form (known as a lemma) using
morphological analysis. It reduces words to their core meaning, making them easier to analyze
and compare.

eg: "The cats play with the mice" Output: The cat play with the mouse

● Stop Words Removal:

Stop words are words that occur frequently in a language and do not carry significant meaning,
such as "the", "and", and "a". Removing stop words from text data can help to reduce noise and
improve the accuracy of NLP tasks.

eg: “The quick brown fox jumps over the lazy dog.” Output: quick brown fox jumps lazy dog

P:F-LTL-UG/03/R1
● Label Encoding:

Label encoding is the process of converting categorical data (such as text data) into numerical
data that can be used in machine learning models. Each unique value in the categorical data is
assigned a unique numerical value.

eg: Given a dataset of customer reviews for a restaurant, where each review has a label indicating
the sentiment of the review:

Before label encoding:

Review 1: "The food was amazing! I loved it." (positive)

Review 2: "The service was terrible. I wouldn't recommend this restaurant." (negative)

Review 3: "The ambience was average. Nothing special." (neutral)

After label encoding:

Review 1: "The food was amazing! I loved it." (1)

Review 2: "The service was terrible. I wouldn't recommend this restaurant." (0)

Review 3: "The ambience was average. Nothing special." (2)

● TF-IDF (Term Frequency-Inverse Document Frequency):

TF-IDF is a technique used to represent text data as a numerical vector, where each element of
the vector represents the importance of a term in the document. The first step is to create a
document-term matrix, where each row represents a document and each column represents a
unique term in the corpus. We can then calculate the term frequency (TF) and inverse document
frequency (IDF) for each term in each document, and multiply them together to get the TF-IDF
score.

TF-IDF = (Term Frequency) x (Inverse Document Frequency)

Conclusion:

In this assignment, we perform fundamental text preprocessing techniques in NLP, including text
cleaning, lemmatization, stop word removal, and label encoding. By creating representations using TF-
IDF, we gain valuable insights and feature vectors for further analysis, leading to improved text
processing and classification.

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSIGNMENT NO. 4
TITLE Transformer Implementation
PROBLEM STATEMENT Create a Transformer from scratch using the Pytorch library
/DEFINITION
OBJECTIVE ● To understand the working of Transformer architecture in detail.
● To learn how to implement the Transformer model using Pytorch
library.
● To gain experience in building deep learning models from scratch.

OUTCOME ● Ability to comprehend the various components of the Transformer


model and their functions.
● Proficiency in implementing the Transformer model using the
Pytorch library for various NLP tasks such as text classification,
language modelling, and machine translation.
● Enhanced ability to create deep learning models from scratch,
enabling the student to build customized models for specific use
cases.

P:F-LTL-UG/03/R1
S/W PACKAGES AND Operating Systems: Open source Linux or its derivative
HARDWARE APPARATUS Python (v3.6+), Pytorch library (v1.8+)
USED GPU (optional but recommended) for faster training of the model.

REFERENCES Vaswani, Ashish, et al. "Attention is all you need." Advances in neural
information processing systems 30 (2017).
1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Concepts related Theory:

● Introduction to Transformers:

Transformers are a type of neural network architecture that is based on the attention mechanism.
It was introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017 and has since
become the state-of-the-art method for various natural language processing tasks. Transformers
differ from traditional recurrent neural networks (RNNs) in that they are able to process the input
sequence in parallel, rather than sequentially. This makes them faster and more efficient for
longer sequences.

● Transformer Architecture:

P:F-LTL-UG/03/R1
Transformers typically consist of an encoder and a decoder, each of which contains multiple
layers of self-attention and feedforward neural networks. During training, the model is optimized
to minimize some objective functions, such as cross-entropy loss, between the predicted output
and the true output.

1. Encoder:

The encoder takes an input sequence and outputs a sequence of hidden representations.
Each layer of the encoder consists of two sub-layers: a self-attention layer and a
feedforward neural network. The self-attention layer calculates the importance of each
token in the input sequence with respect to every other token in the sequence, allowing
the model to focus on the most relevant parts of the input sequence. The feedforward
neural network takes the output of the self-attention layer and applies a non-linear
transformation to it.

2. Decoder:

The decoder takes the output sequence of the encoder and generates a target sequence one
token at a time. Like the encoder, each layer of the decoder consists of two sub-layers: a
self-attention layer and a feedforward neural network. In addition to these sub-layers, the
decoder also includes an additional self-attention layer that takes the output of the
encoder as input. This additional self-attention layer allows the decoder to focus on the
most relevant parts of the input sequence when generating the output sequence.

3. Self-attention mechanism:

P:F-LTL-UG/03/R1
The self-attention mechanism is a critical component of the Transformer architecture. It
enables the model to selectively focus on different parts of the input sequence, allowing it
to capture long-term dependencies more effectively.

Self-attention is a mechanism that computes a weighted sum of the input sequence


elements to obtain a representation of each element in the sequence. The weights are
computed based on the similarity between the element and all other elements in the
sequence. This similarity is calculated using a score function that takes as input the
embeddings of the two elements being compared.

In the Transformer architecture, the self-attention mechanism is used to compute a new


representation for each input element, which is then used as input to the feedforward
network. The self-attention mechanism is applied to each input element independently,
allowing the model to capture long-term dependencies between elements in the sequence.

The relevance of the self-attention mechanism in the Transformer architecture is due to


its ability to capture dependencies between all elements in the input sequence
simultaneously. This is in contrast to traditional recurrent neural networks (RNNs), which
process the input sequence sequentially and are limited in their ability to capture long-
term dependencies. By allowing the model to selectively focus on different parts of the
input sequence, the self-attention mechanism enables the Transformer to outperform
traditional RNNs on a range of natural language processing tasks.

A few more related terms:

● Feedforward Networks:

In the Transformer architecture, a feedforward neural network is applied to each position


independently and identically. This means that the same weights are used for each position in the
input sequence. The feedforward network consists of two linear transformations followed by a
non-linear activation function such as the rectified linear unit (ReLU). This helps the model learn
complex representations of the input data.

● Layer Normalization:

Layer normalization is a technique used in deep learning to improve the performance and stability
of neural networks. In the context of the Transformer architecture, layer normalization is applied
to the output of each sub-layer, including the self-attention and feedforward network layers. The
technique normalizes the activations of the neurons across the feature dimension, which helps to
reduce the internal covariate shift and improves the training process.

● Multi-head Attention:

Multi-head attention is a mechanism used in the Transformer architecture to allow the model to
jointly attend to information from different representation subspaces at different positions. The
input is first transformed into multiple subspaces, and self-attention is then applied to each of

P:F-LTL-UG/03/R1
these subspaces independently. This allows the model to capture different patterns of the input
data, leading to better performance on tasks that require modelling long-range dependencies.

● PyTorch:

PyTorch is an open-source machine learning library developed by Facebook's AI research group.


It is widely used for deep learning tasks, including natural language processing. PyTorch provides
a flexible and intuitive programming model that allows researchers and developers to quickly
prototype and experiment with deep learning models. PyTorch supports dynamic computational
graphs, which enable efficient computation of gradients during training, making it an ideal choice
for training complex models such as transformers.

PyTorch provides a wide range of built-in functions and modules for implementing deep learning
models, including various layers, activation functions, and optimization algorithms. It also
provides a simple and intuitive API for data loading and preprocessing, making it easy to work
with large datasets in natural language processing. The PyTorch ecosystem also includes several
high-level libraries, such as Transformers and TorchText, which provide pre-trained models and
utilities for common NLP tasks, such as text classification, sequence labeling, and language
modeling.

● Applications of Transformers:

Transformers have been used in various natural language processing tasks, such as machine
translation, text classification, and text generation. They have also been used in computer vision
tasks, such as object detection and image captioning. Transformers have shown to be highly
effective in handling long-range dependencies and have become the state-of-the-art method in
many tasks.

Conclusion:

In this assignment, we gained a better understanding of the Transformer architecture and its various
components by implementing a Transformer from scratch using the PyTorch library. We learned how to
implement the self-attention mechanism, multi-head attention, feedforward networks, and layer
normalization in PyTorch. We also learned how to preprocess the data, train the model, and evaluate its
performance.

Review Questions:

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23

P:F-LTL-UG/03/R1
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT

CLASS: B.E. SEMESTER: II


SUBJECT: LP VI
ASSIGNMENT NO. 5
TITLE Exploring Morphology with Add Delete Tables
PROBLEM STATEMENT Morphology is the study of the way words are built up from smaller meaning
/DEFINITION bearing units.
Study and understand the concepts of morphology by the use of add delete
table.
OBJECTIVE ● To understand the basic concepts of morphology.
● To learn about the different types of morphemes.
● To use add delete table for identifying and analyzing morphemes in
words.
OUTCOME ● Improved understanding of the structure and formation of words.
● Ability to identify different types of morphemes in a given word.
● Proficiency in using add delete table for morphological analysis of
words.
S/W PACKAGES AND Microsoft Word, Python, NLTK library
HARDWARE APPARATUS
USED
REFERENCES Manning, Christopher D., and nrich Schütze , ―Foundations of Statistical
Natural Language Processing, Cambridge, MA: MIT Press

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Conclusion/Analysis

Concepts related Theory:

● Morphology and its importance in NLP:

Morphology is the study of the way words are built up from smaller meaning-bearing units called
morphemes. The study of morphology is crucial for natural language processing as it helps in
understanding the structure and meaning of words. By understanding morphology, we can break
down complex words into smaller meaningful units, which can then be used to analyze and
understand the structure of sentences.

● Types of morphemes:

Morphemes can be classified into two types: free morphemes and bound morphemes. Free
morphemes are words that can stand alone and have meaning by themselves, such as "book" or
"run". On the other hand, bound morphemes are units that cannot stand alone and must be

P:F-LTL-UG/03/R1
attached to a free morpheme to create meaning. Examples of bound morphemes include prefixes
like "un-" in "unhappy" or suffixes like "-ed" in "played".

● Morphological Processes:

Morphological processes refer to the ways in which morphemes (the smallest meaning-bearing
units) are combined to form words. These processes can be categorized into two main types:
inflectional and derivational.

1. Inflectional processes involve the modification of a word to reflect grammatical


distinctions such as tense, number, gender, or case. Inflectional morphemes do not change
the basic meaning of the word, but rather modify its form to indicate a particular
grammatical function. For example, in English, the inflectional morpheme "-s" can be
added to the end of a verb to indicate third person singular present tense, as in the word
"talks".
2. Derivational processes, on the other hand, involve the creation of new words by adding
affixes (prefixes or suffixes) to existing words. Unlike inflectional processes, derivational
processes change the meaning of the word. For example, in English, the prefix "un-" can
be added to an adjective to form a new word with the opposite meaning, as in the word
"unhappy".

Morphological processes can also include other modifications to words, such as reduplication
(repeating part or all of a word to create a new word), compounding (combining two or more
words to create a new word), and suppletion (using a completely different word form to express a
particular grammatical function).

Understanding the various morphological processes is essential in many natural language


processing tasks, such as morphological analysis, part-of-speech tagging, and machine
translation. By recognizing the morphological structure of words, NLP models can better
understand the meaning and context of the text they are processing.

● Morphological Analysis:

Morphological analysis is the process of breaking down a word into its component parts, such as
stems, prefixes, and suffixes, in order to understand its meaning and grammatical structure. This
process involves identifying and analyzing the morphemes that make up a word. There are
various techniques for performing morphological analysis, including the use of 1) add/delete
tables and 2) finite-state transducers.

1. Add-Delete table:

An add-delete table is a table used in morphology to track how words are formed by adding and
deleting morphemes. Each row in the table represents a step in the process of word formation,
with the left column representing the starting word and the right column representing the resulting

P:F-LTL-UG/03/R1
word after the addition or deletion of a morpheme. For example, the add-delete table for the word
"unhappiness" might include rows such as "happiness" to "happy+ness" and "unhappy+ness" to
"unhappiness".

2. Finite-state Transducers:

Finite-state transducers are computational models that can be used to perform


morphological analysis and generation. It would be implemented as a finite-state
automaton, which consists of a set of states and transitions between them. Each state
represents a specific morphological form, and each transition corresponds to a specific
morphological rule. The transducer would also incorporate a lexicon of known words and
their corresponding morphological forms, which would allow it to recognize and generate
words that are not covered by the regular morphological rules.

For example, consider the English word "dogs". A finite-state transducer can be used to
analyze the morphological structure of this word by recognizing that "dog" is the root
word and "s" is the inflectional suffix indicating plural. The transducer can then generate
the corresponding output form "dog+s" or "dogs".

These models can be trained on a corpus of words to learn the rules for morphological
analysis and generation, and can be used for various applications such as spell-checking,
language translation, and information retrieval.

Conclusion:

In conclusion, the assignment on morphology provided an in-depth understanding of how words are built
up from smaller meaning-bearing units. By studying and applying concepts such as add delete tables and
morphological analysis, one can gain a better understanding of language and how it works. This
knowledge can be applied to various fields such as natural language processing, linguistics, and
computational linguistics.

Review Questions:

P:F-LTL-UG/03/R1
TITLE Resampling techniques, histogram and equalized histogram

PROBLEM STATEMENT Consider any image with size 1024*1024. Modify the image to the
/ DEFINITION sizes 512*512, 256*256, 128*128, 64*64 and 32*32 using
subsampling technique. Create the original image from all the
above subsampled images using resampling technique. Read any
image. Display the histogram, Equalized histogram, and image with
equalized histogram
OBJECTIVE  To become familiar with digital image fundamentals.
 To get exposed to simple image enhancement techniques
 Understand resampling techniques, histogram, Equalized
histogram concept.

1. Operating System : 64-bit Open source Linux or its derivative


S/W PACKAGES AND 2. Programming Languages: PYTHON, OpenCV
HARDWARE
APPARATUS USED
RFERENCES  Gonzalez & Woods, “Digital Image Processing”, Pearson
Education, 3rd Edition, 2008

 S Sridhar, “Digital Image Processing”, Oxford University Press,


2nd Edition.

 Jain Anil K., “Fundamentals Digital Image Processing”,


Prentice Hall India, 4h Edition.

STEPS Refer to theory, algorithm, test input, test output

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning outcome
6. Related Mathematics
7. Concepts related Theory
8. Algorithm
9. Test cases
10. Conclusion and applications (the verification and testing of
outcomes)

P:F–LTL–UG/03/R1
Aim: To understand resampling techniques, histogram and equalized histogram

Problem Statement / Definition:


Consider any image with size 1024*1024. Modify the image to the sizes 512*512, 256*256,
128*128, 64*64 and 32*32 using subsampling technique. Create the original image from all
the above subsampled images using resampling technique. Read any image. Display the
histogram, Equalized histogram, and image with equalized histogram

Prerequisites

Image processing concepts

Learning Objectives

 To become familiar with digital image fundamentals.


 To get exposed to simple image enhancement techniques
 Understand resampling techniques, histogram, Equalized histogram concept

Learning Outcome:

Students will be able to Apply knowledge of mathematics for image understanding and
analysis.

Theory:

Digital images play an important role both in daily life applications as well as in the areas
of research technology. The digital image processing refers to the manipulation of an
image by means of processor. The different elements of an image processing system
include image acquisition, image storage, image processing and display

Digital image resampling occupies a small but important place in the fields of image
processing and computer graphics. In the most basic terms, resampling is the process of
geometrically transforming digital images.

Resampling finds uses in many fields. It is usually a step in a larger process, seldom an
end in itself and is most often viewed as a means to an end. In computer graphics
resampling allows texture to be applied to surfaces in computer generated imagery,
without the need to explicitly model the texture. In medical and remotely sensed imagery
it allows an image to be registered with some standard co-ordinate system, be it another
image, a map, or some other reference. This is primarily done to prepare for further
processing.

Resampling: the geometric transformation of discrete images. In image resampling we


are given a discrete image and a transformation. The aim is to produce a second discrete
image which looks as if it was formed by applying the transformation to the original
discrete image. What is meant by ‘looks as if’ is that the continuous image generated
from the second discrete image, to all intents and purposes, appears to be identical to a
P:F–LTL–UG/03/R1
transformed version of the continuous image generated from the first discrete image.
What makes resampling difficult is that, in practice, it is impossible to generate the
second image by directly applying the mathematical transformation to the first (except for
a tiny set of special case transformations).

Scaling, or simply resizing, is the process of increasing or decreasing the size of an image
in terms of width and height.

When resizing an image, it’s important to keep in mind the aspect ratio — which is the
ratio of an image’s width to its height. Ignoring the aspect ratio can lead to resized images
that look compressed and distorted.

In an image processing context, the histogram of an image normally refers to a histogram


of the pixel intensity values. This histogram is a graph showing the number of pixels in
an image at each different intensity value found in that image

 Histogram of an image provides a global description of the appearance of


an image.
 Information obtained from histogram is very large in quality.
 Histogram of an image represents the relative frequency of occurrence of
various gray levels in an image.

Histogram equalization is a method in image processing of contrast adjustment using the


image’s histogram.

This method usually increases the global contrast of many images, especially when the
usable data of the image is represented by close contrast values. Through this adjustment,
the intensities can be better distributed on the histogram. This allows for areas of lower
local contrast to gain a higher contrast. Histogram equalization accomplishes this by
effectively spreading out the most frequent intensity values. The method is useful in
images with backgrounds and foregrounds that are both bright or both dark.

Functions:

Resize an image using OpenCV and the cv2.resize function. OpenCV provides us several
interpolation methods for resizing an image.

Choice of Interpolation Method for Resizing:

cv2.INTER_AREA: This is used when we need to shrink an image.


cv2.INTER_CUBIC: This is slow but more efficient.
cv2.INTER_LINEAR: This is primarily used when zooming is required. This is the default
interpolation technique in OpenCV.
Syntax: cv2.resize(source, dsize, dest, fx, fy, interpolation)
Parameter:
source: Input Image array (Single-channel, 8-bit or floating-point)
dsize: Size of the output array

P:F–LTL–UG/03/R1
dest: Output array (Similar to the dimensions and type of Input image array) [optional]
fx: Scale factor along the horizontal axis [optional]
fy: Scale factor along the vertical axis [optional]
interpolation: One of the above interpolation methods [optional]

cv2.cvtColor() method is used to convert an image from one color space to another. There are
more than 150 color-space conversion methods available in OpenCV.

Syntax: cv2.cvtColor(src, code[, dst[, dstCn]])

Parameters:
src: It is the image whose color space is to be changed.
code: It is the color space conversion code.
dst: It is the output image of the same size and depth as src image. It is an optional
parameter.
dstCn: It is the number of channels in the destination image. If the parameter is 0 then the
number of the channels is derived automatically from src and code. It is an optional
parameter.

Return Value: It returns an image.

cv2.calcHist() function to calculate the image histograms. We could apply it to calculate the
histogram of the constituent color channels (blue, green, and red) of the image.

Syntax: cv2.calcHist(images, channels, mask, histSize, ranges[, hist[, accumulate]])


Parameters:

images: list of images as numpy arrays. All images must be of the same dtype and same
size.
channels: list of the channels used to calculate the histograms.
mask: optional mask (8 bit array) of the same size as the input image.
histSize: histogram sizes in each dimension
ranges: Array of the dims arrays of the histogram bin boundaries in each dimension
hist: Output histogram
accumulate: accumulation flag, enables to compute a single histogram from several sets
of arrays.
Return: It returns an array of histogram points of dtype float32.

Algorithm:

a) Subsampling algorithm:
1. Import cv2
2. Load the image and display/show it by using cv2_imshow(image)function
3. Display height and width by using height, width = image.shape[:2]
4. Sub-sampling the original image (1024x1024) to size: 512x512 by using
cv2.resize(image, (new_width, new_height), interpolation=cv2.INTER_LINEAR)
function (new_width=512 and new_height=512)

P:F–LTL–UG/03/R1
5. Sub-sampling the original image (1024x1024) to size: 256x256, 128x128, 32x32 etc.
using above function.
6. Display all resized images
7. Re-sampling the image of size 512x512, 256x256 ,128x128, 32x32 to the original
size (1024x1024) by using following steps
1. Calculate the new dimensions of the image (new_height = 1024, new_width =
1024)
2. Resize the image resampling to original size by using resize function as:
new_original_image1 = cv2.resize(resized_image1, (new_width, new_height),
interpolation=cv2.INTER_LINEAR)
3. Display the image by using cv2_imshow(new_original_image1)

b) Display the histogram, Equalized histogram, and image with equalized histogram.
1. Load the image.
2. Convert it to grayscale by using function
cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
3. Calculate the histogram of the grayscale image by using
cv2.calcHist([gray], [0], None, [256], [0, 256])
4. Plot the histogram using matplotlib. The histogram will have 256 bins,
corresponding to the possible gray levels in the image.
5. Displaying Equalized Histogram for the image:
i. Equalize the histogram of the grayscale image using the cv2.equalizeHist()
function, and calculate the histogram of the equalized image.
ii. Plot the histogram using matplotlib. The histogram will have 256 bins,
corresponding to the possible gray levels in the image.
6. Display the image corresponding to the equalized histogram
7. Display the images to show difference between the original grayscale image and
the image corresponding to the equalized histogram.

Conclusion:
Students will able to implement resampling techniques, histogram and equalise histogram.

P:F–LTL–UG/03/R1
TITLE Resampling techniques

PROBLEM STATEMENT Consider any image with size 1024*1024. Modify the image to the
/ DEFINITION sizes 512*512, 256*256, 128*128, 64*64 and 32*32 using
subsampling technique. Create the original image from all the
above subsampled images using resampling technique.
OBJECTIVE  To become familiar with digital image fundamentals.
 To get exposed to simple image enhancement techniques
 Understand resampling techniques
1. Operating System : 64-bit Open source Linux or its derivative
S/W PACKAGES AND 2. Programming Languages: PYTHON, OpenCV
HARDWARE
APPARATUS USED
RFERENCES  Gonzalez & Woods, “Digital Image Processing”, Pearson
Education, 3rd Edition, 2008

 S Sridhar, “Digital Image Processing”, Oxford University Press,


2nd Edition.

 Jain Anil K., “Fundamentals Digital Image Processing”,


Prentice Hall India, 4h Edition.

STEPS Refer to theory, algorithm, test input, test output

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning outcome
6. Related Mathematics
7. Concepts related Theory
8. Algorithm
9. Conclusion and applications (the verification and testing of
outcomes)

P:F–LTL–UG/03/R1
Aim: To understand resampling techniques

Problem Statement / Definition:


Consider any image with size 1024*1024. Modify the image to the sizes 512*512, 256*256,
128*128, 64*64 and 32*32 using subsampling technique. Create the original image from all the
above subsampled images using resampling technique

Prerequisites

Image processing concepts

Learning Objectives

 To become familiar with digital image fundamentals.


 To get exposed to simple image enhancement techniques
 Understand resampling techniques, histogram, Equalized histogram concept

Learning Outcome:

Students will be able to apply knowledge of mathematics for image understanding and
analysis.

Theory:

Digital images play an important role both in daily life applications as well as in the areas
of research technology. The digital image processing refers to the manipulation of an
image by means of processor. The different elements of an image processing system
include image acquisition, image storage, image processing and display

Digital image resampling occupies a small but important place in the fields of image
processing and computer graphics. In the most basic terms, resampling is the process of
geometrically transforming digital images.

Resampling finds uses in many fields. It is usually a step in a larger process, seldom an
end in itself and is most often viewed as a means to an end. In computer graphics
resampling allows texture to be applied to surfaces in computer generated imagery,
without the need to explicitly model the texture. In medical and remotely sensed imagery
it allows an image to be registered with some standard co-ordinate system, be it another
image, a map, or some other reference. This is primarily done to prepare for further
processing.

Resampling: the geometric transformation of discrete images. In image resampling we


are given a discrete image and a transformation. The aim is to produce a second discrete
image which looks as if it was formed by applying the transformation to the original
discrete image. What is meant by ‘looks as if’ is that the continuous image generated
from the second discrete image, to all intents and purposes, appears to be identical to a
transformed version of the continuous image generated from the first discrete image.
P:F–LTL–UG/03/R1
What makes resampling difficult is that, in practice, it is impossible to generate the
second image by directly applying the mathematical transformation to the first (except for
a tiny set of special case transformations).

Scaling, or simply resizing, is the process of increasing or decreasing the size of an image
in terms of width and height.

When resizing an image, it’s important to keep in mind the aspect ratio — which is the
ratio of an image’s width to its height. Ignoring the aspect ratio can lead to resized images
that look compressed and distorted.

In an image processing context, the histogram of an image normally refers to a histogram


of the pixel intensity values. This histogram is a graph showing the number of pixels in
an image at each different intensity value found in that image

 Histogram of an image provides a global description of the appearance of


an image.
 Information obtained from histogram is very large in quality.
 Histogram of an image represents the relative frequency of occurrence of
various gray levels in an image.

Histogram equalization is a method in image processing of contrast adjustment using the


image’s histogram.

This method usually increases the global contrast of many images, especially when the
usable data of the image is represented by close contrast values. Through this adjustment,
the intensities can be better distributed on the histogram. This allows for areas of lower
local contrast to gain a higher contrast. Histogram equalization accomplishes this by
effectively spreading out the most frequent intensity values. The method is useful in
images with backgrounds and foregrounds that are both bright or both dark.

Functions:

Resize an image using OpenCV and the cv2.resize function. OpenCV provides us several
interpolation methods for resizing an image.

Choice of Interpolation Method for Resizing:

cv2.INTER_AREA: This is used when we need to shrink an image.


cv2.INTER_CUBIC: This is slow but more efficient.
cv2.INTER_LINEAR: This is primarily used when zooming is required. This is the default
interpolation technique in OpenCV.
Syntax: cv2.resize(source, dsize, dest, fx, fy, interpolation)
Parameter:
source: Input Image array (Single-channel, 8-bit or floating-point)
dsize: Size of the output array
dest: Output array (Similar to the dimensions and type of Input image array) [optional]

P:F–LTL–UG/03/R1
fx: Scale factor along the horizontal axis [optional]
fy: Scale factor along the vertical axis [optional]
interpolation: One of the above interpolation methods [optional]

cv2.cvtColor() method is used to convert an image from one color space to another. There are
more than 150 color-space conversion methods available in OpenCV.

Syntax: cv2.cvtColor(src, code[, dst[, dstCn]])

Parameters:
src: It is the image whose color space is to be changed.
code: It is the color space conversion code.
dst: It is the output image of the same size and depth as src image. It is an optional
parameter.
dstCn: It is the number of channels in the destination image. If the parameter is 0 then the
number of the channels is derived automatically from src and code. It is an optional
parameter.

Return Value: It returns an image.

cv2.calcHist() function to calculate the image histograms. We could apply it to calculate the
histogram of the constituent color channels (blue, green, and red) of the image.

Syntax: cv2.calcHist(images, channels, mask, histSize, ranges[, hist[, accumulate]])


Parameters:

images: list of images as numpy arrays. All images must be of the same dtype and same
size.
channels: list of the channels used to calculate the histograms.
mask: optional mask (8 bit array) of the same size as the input image.
histSize: histogram sizes in each dimension
ranges: Array of the dims arrays of the histogram bin boundaries in each dimension
hist: Output histogram
accumulate: accumulation flag, enables to compute a single histogram from several sets
of arrays.
Return: It returns an array of histogram points of dtype float32.

Algorithm:

a) Subsampling algorithm:
1. Import cv2
2. Load the image and display/show it by using cv2_imshow(image)function
3. Display height and width by using height, width = image.shape[:2]
4. Sub-sampling the original image (1024x1024) to size: 512x512 by using
cv2.resize(image, (new_width, new_height), interpolation=cv2.INTER_LINEAR)
function (new_width=512 and new_height=512)
5. Sub-sampling the original image (1024x1024) to size: 256x256, 128x128, 32x32 etc.

P:F–LTL–UG/03/R1
using above function.
6. Display all resized images
7. Re-sampling the image of size 512x512, 256x256 ,128x128, 32x32 to the original
size (1024x1024) by using following steps
1. Calculate the new dimensions of the image (new_height = 1024, new_width =
1024)
2. Resize the image resampling to original size by using resize function as:
new_original_image1 = cv2.resize(resized_image1, (new_width, new_height),
interpolation=cv2.INTER_LINEAR)
3. Display the image by using cv2_imshow(new_original_image1)

b) Display the histogram, Equalized histogram, and image with equalized histogram.
1. Load the image.
2. Convert it to grayscale by using function
cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
3. Calculate the histogram of the grayscale image by using
cv2.calcHist([gray], [0], None, [256], [0, 256])
4. Plot the histogram using matplotlib. The histogram will have 256 bins,
corresponding to the possible gray levels in the image.
5. Displaying Equalized Histogram for the image:
i. Equalize the histogram of the grayscale image using the cv2.equalizeHist()
function, and calculate the histogram of the equalized image.
ii. Plot the histogram using matplotlib. The histogram will have 256 bins,
corresponding to the possible gray levels in the image.
6. Display the image corresponding to the equalized histogram
7. Display the images to show difference between the original grayscale image and
the image corresponding to the equalized histogram.

Conclusion:
Students will able to implement resampling techniques, histogram and equalise histogram.

P:F–LTL–UG/03/R1
TITLE Contrast stretching, intensity level slicing

PROBLEM STATEMENT Read any image. Display the outputs of contrast stretching,
/ DEFINITION intensity level slicing

OBJECTIVE  To become familiar with digital image fundamentals.


 To get exposed to simple image enhancement techniques
 Understand contrast stretching, intensity level slicing and its
uses

1. Operating System : 64-bit Open source Linux or its derivative


S/W PACKAGES AND 2. Programming Languages: PYTHON, OpenCV
HARDWARE
APPARATUS USED
RFERENCES  Gonzalez & Woods, “Digital Image Processing”, Pearson
Education, 3rd Edition, 2008

 S Sridhar, “Digital Image Processing”, Oxford University Press,


2nd Edition.

 Jain Anil K., “Fundamentals Digital Image Processing”,


Prentice Hall India, 4h Edition.

STEPS Refer to theory, algorithm, test input, test output

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning outcome
6. Related Mathematics
7. Concepts related Theory
8. Algorithm
9. Conclusion and applications (the verification and testing of
outcomes)

P:F–LTL–UG/03/R1
Aim: To understand and implement Contrast stretching, intensity level slicing

Problem Statement / Definition:

Read any image. Display the outputs of contrast stretching, intensity level slicing

Prerequisites

Image processing concepts

Learning Objectives

 To become familiar with digital image fundamentals.


 To get exposed to simple image enhancement techniques
 Understand contrast stretching, intensity level slicing and its uses

Learning Outcome:

Students will be able to apply knowledge of mathematics for image understanding and
analysis.

Theory:

The enhancement techniques are employed in order to increase the contrast of an image.
Generally, an image can be enhanced by spreading out the range of scene illumination.
This procedure is called contrast stretching. Contrast stretching (often called
normalization) is a simple image enhancement technique that attempts to improve the
contrast in an image by ‘stretching’ the range of intensity values it contains to span a
desired range of values, the full range of pixel values that the image type concerned
allows. Contrast stretching changes the distribution and range of the digital numbers
assigned to each pixel in an image. This is normally done to accent image details that
may be difficult for the human viewer to observe.

So it is image enhancement technique that tries to improve the contrast by stretching the
intensity values of an image to fill the entire dynamic range. The transformation function
used is always linear and monotonically increasing.

Example: If the minimum intensity value(r min ) present in the image is 100 then it is
stretched to the possible minimum intensity value 0. Likewise, if the maximum intensity
value(r max) is less than the possible maximum intensity value 255 then it is stretched
out to 255.(0–255 is taken as standard minimum and maximum intensity values for 8-bit
images)

P:F–LTL–UG/03/R1
General Formula for Contrast Stretching:

For I min = 0 and I max = 255 (for standard 8-bit grayscale image)

where,
r = current pixel intensity value
r min = minimum intensity value present in the whole image
r max = maximum intensity value present in the whole image

Intensity Level Slicing:

Intensity level slicing means highlighting a specific range of intensities in an image. In


other words, we segment certain gray level regions from the rest of the image.

Suppose in an image, your region of interest always take value between say 80 to 150.
So, intensity level slicing highlights this range and now instead of looking at the whole
image, one can now focus on the highlighted region of interest.

Algorithm:

a) Subsampling algorithm:
 Import cv2
 Load the image and display/show it by using cv2_imshow(image)function
 Split the image into its R, G, and B channels by using channels = cv2.split(image)
 Initialize the stretch factors stretch_factors = []
 Calculate the stretch factor for each channel for every channel in channels:
 Calculate the minimum and maximum values in the channel
min_val, max_val, _, _ = cv2.minMaxLoc(channel)
 Calculate the stretch factor
stretch_min = 0
stretch_max = 255
stretch_factor_1 = (stretch_max - stretch_min) / (max_val - min_val)
stretch_factor_2 = stretch_min - stretch_factor_1 * min_val

 Store the stretch factor for this channel


stretch_factors.append((stretch_factor_1, stretch_factor_2))

 Initialize the stretched image

P:F–LTL–UG/03/R1
stretched_image = np.zeros_like(image)

 Stretch the contrast of each channel


for i, channel in enumerate(channels):
stretched_channel = cv2.convertScaleAbs(channel, alpha=stretch_factors[i][0],
beta=stretch_factors[i][1])
stretched_image[:,:,i] = stretched_channel

 Show the difference between the original image and contrast stretched image.

Intensity Level Slicing:


 Load the image
 Convert it to grayscale by using
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 Set the lower and upper intensity levels
 Create a mask with the specified intensity range by using
mask = cv2.inRange(gray, lower_level, upper_level)
 The mask is then applied to the original image to create the intensity sliced version of
the image by using result = cv2.bitwise_and(image, image, mask=mask)
 Difference between the original image and the intensity sliced image:

Conclusion:
Students will able to implement contrast stretching, intensity level slicing

P:F–LTL–UG/03/R1
TITLE Histogram and Equalized histogram

PROBLEM STATEMENT Read any image. Display the histogram, Equalized histogram, and
/ DEFINITION image with equalized histogram

OBJECTIVE  To become familiar with digital image fundamentals.


 To get exposed to simple image enhancement techniques
 Understand resampling techniques
 Understand histogram and equalized histogram in image
processing
1. Operating System : 64-bit Open source Linux or its derivative
S/W PACKAGES AND 2. Programming Languages: PYTHON, OpenCV
HARDWARE
APPARATUS USED
RFERENCES  Gonzalez & Woods, “Digital Image Processing”, Pearson
Education, 3rd Edition, 2008

 S Sridhar, “Digital Image Processing”, Oxford University Press,


2nd Edition.

 Jain Anil K., “Fundamentals Digital Image Processing”,


Prentice Hall India, 4h Edition.

STEPS Refer to theory, algorithm, test input, test output

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning outcome
6. Related Mathematics
7. Concepts related Theory
8. Algorithm
9. Conclusion and applications (the verification and testing of
outcomes)

P:F–LTL–UG/03/R1
Aim: To understand resampling techniques

Problem Statement / Definition:


Read any image. Display the histogram, Equalized histogram, and image with equalized
histogram

Prerequisites

Image processing concepts

Learning Objectives

 To become familiar with digital image fundamentals.


 To get exposed to simple image enhancement techniques
 Understand resampling techniques, histogram, Equalized histogram concept

Learning Outcome:

Students will be able to apply knowledge of mathematics for image understanding and
analysis.

Theory:

Digital images play an important role both in daily life applications as well as in the areas
of research technology. The digital image processing refers to the manipulation of an
image by means of processor. The different elements of an image processing system
include image acquisition, image storage, image processing and display

Digital image resampling occupies a small but important place in the fields of image
processing and computer graphics. In the most basic terms, resampling is the process of
geometrically transforming digital images.

Resampling finds uses in many fields. It is usually a step in a larger process, seldom an
end in itself and is most often viewed as a means to an end. In computer graphics
resampling allows texture to be applied to surfaces in computer generated imagery,
without the need to explicitly model the texture. In medical and remotely sensed imagery
it allows an image to be registered with some standard co-ordinate system, be it another
image, a map, or some other reference. This is primarily done to prepare for further
processing.

Resampling: the geometric transformation of discrete images. In image resampling we


are given a discrete image and a transformation. The aim is to produce a second discrete
image which looks as if it was formed by applying the transformation to the original
discrete image. What is meant by ‘looks as if’ is that the continuous image generated
from the second discrete image, to all intents and purposes, appears to be identical to a
transformed version of the continuous image generated from the first discrete image.
P:F–LTL–UG/03/R1
What makes resampling difficult is that, in practice, it is impossible to generate the
second image by directly applying the mathematical transformation to the first (except for
a tiny set of special case transformations).

Scaling, or simply resizing, is the process of increasing or decreasing the size of an image
in terms of width and height.

When resizing an image, it’s important to keep in mind the aspect ratio — which is the
ratio of an image’s width to its height. Ignoring the aspect ratio can lead to resized images
that look compressed and distorted.

In an image processing context, the histogram of an image normally refers to a histogram


of the pixel intensity values. This histogram is a graph showing the number of pixels in
an image at each different intensity value found in that image

 Histogram of an image provides a global description of the appearance of


an image.
 Information obtained from histogram is very large in quality.
 Histogram of an image represents the relative frequency of occurrence of
various gray levels in an image.

Histogram equalization is a method in image processing of contrast adjustment using the


image’s histogram.

This method usually increases the global contrast of many images, especially when the
usable data of the image is represented by close contrast values. Through this adjustment,
the intensities can be better distributed on the histogram. This allows for areas of lower
local contrast to gain a higher contrast. Histogram equalization accomplishes this by
effectively spreading out the most frequent intensity values. The method is useful in
images with backgrounds and foregrounds that are both bright or both dark.

Functions:

Resize an image using OpenCV and the cv2.resize function. OpenCV provides us several
interpolation methods for resizing an image.

Choice of Interpolation Method for Resizing:

cv2.INTER_AREA: This is used when we need to shrink an image.


cv2.INTER_CUBIC: This is slow but more efficient.
cv2.INTER_LINEAR: This is primarily used when zooming is required. This is the default
interpolation technique in OpenCV.
Syntax: cv2.resize(source, dsize, dest, fx, fy, interpolation)
Parameter:
source: Input Image array (Single-channel, 8-bit or floating-point)
dsize: Size of the output array
dest: Output array (Similar to the dimensions and type of Input image array) [optional]

P:F–LTL–UG/03/R1
fx: Scale factor along the horizontal axis [optional]
fy: Scale factor along the vertical axis [optional]
interpolation: One of the above interpolation methods [optional]

cv2.cvtColor() method is used to convert an image from one color space to another. There are
more than 150 color-space conversion methods available in OpenCV.

Syntax: cv2.cvtColor(src, code[, dst[, dstCn]])

Parameters:
src: It is the image whose color space is to be changed.
code: It is the color space conversion code.
dst: It is the output image of the same size and depth as src image. It is an optional
parameter.
dstCn: It is the number of channels in the destination image. If the parameter is 0 then the
number of the channels is derived automatically from src and code. It is an optional
parameter.

Return Value: It returns an image.

cv2.calcHist() function to calculate the image histograms. We could apply it to calculate the
histogram of the constituent color channels (blue, green, and red) of the image.

Syntax: cv2.calcHist(images, channels, mask, histSize, ranges[, hist[, accumulate]])


Parameters:

images: list of images as numpy arrays. All images must be of the same dtype and same
size.
channels: list of the channels used to calculate the histograms.
mask: optional mask (8 bit array) of the same size as the input image.
histSize: histogram sizes in each dimension
ranges: Array of the dims arrays of the histogram bin boundaries in each dimension
hist: Output histogram
accumulate: accumulation flag, enables to compute a single histogram from several sets
of arrays.
Return: It returns an array of histogram points of dtype float32.

Algorithm:

a) Subsampling algorithm:
1. Import cv2
2. Load the image and display/show it by using cv2_imshow(image)function
3. Display height and width by using height, width = image.shape[:2]
4. Sub-sampling the original image (1024x1024) to size: 512x512 by using
cv2.resize(image, (new_width, new_height), interpolation=cv2.INTER_LINEAR)
function (new_width=512 and new_height=512)
5. Sub-sampling the original image (1024x1024) to size: 256x256, 128x128, 32x32 etc.

P:F–LTL–UG/03/R1
using above function.
6. Display all resized images
7. Re-sampling the image of size 512x512, 256x256 ,128x128, 32x32 to the original
size (1024x1024) by using following steps
1. Calculate the new dimensions of the image (new_height = 1024, new_width =
1024)
2. Resize the image resampling to original size by using resize function as:
new_original_image1 = cv2.resize(resized_image1, (new_width, new_height),
interpolation=cv2.INTER_LINEAR)
3. Display the image by using cv2_imshow(new_original_image1)

b) Display the histogram, Equalized histogram, and image with equalized histogram.
1. Load the image.
2. Convert it to grayscale by using function
cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
3. Calculate the histogram of the grayscale image by using
cv2.calcHist([gray], [0], None, [256], [0, 256])
4. Plot the histogram using matplotlib. The histogram will have 256 bins,
corresponding to the possible gray levels in the image.
5. Displaying Equalized Histogram for the image:
i. Equalize the histogram of the grayscale image using the cv2.equalizeHist()
function, and calculate the histogram of the equalized image.
ii. Plot the histogram using matplotlib. The histogram will have 256 bins,
corresponding to the possible gray levels in the image.
6. Display the image corresponding to the equalized histogram
7. Display the images to show difference between the original grayscale image and
the image corresponding to the equalized histogram.

Conclusion:
Students will able to implement resampling techniques

P:F–LTL–UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSINGMENT NO. BI I
TITLE Import the legacy data from different sources such as (Excel, SQL
Server, Oracle etc.) and load in the target system. (You can download
sample databases such as Adventure works, Northwind, foodmart
etc.)
PROBLEM STATEMENT Import the data warehouse data in Microsoft Excel and create the
/DEFINITION Pivot table and Pivot Chart

OBJECTIVE To introduce the concepts and components of Business Intelligence


(BI)
To import data from different sources and load into target system.
OUTCOME Students will be able to import data from different sources and load
into target system.
S/W PACKAGES AND MS Excel,
HARDWARE APPARATUS Operating System recommended: - 64-bit Open source Linux or its
USED derivative Programming Languages: C++/JAVA/PYTHON/RProgramming
tools recommended: Front End: Java/Perl/PHP/Python/Ruby/.net, Backend:
MongoDB/MYSQL/Oracle, Database Connectivity: ODBC/JDBC,
Additional Tools: Octave, Matlab, WEKA, powerBI,

REFERENCES Text Books:


1. Fundamental of Business Intelligence, Grossmann W, Rinderle-Ma,
Springer,2015
2. R. Sharda, D. Delen, & E. Turban, Business Intelligence and Analytics.
Systems for Decision Support, 10th Edition. Pearson/Prentice Hall, 2015
Reference Books:
1. Paulraj Ponnian, ―Data Warehousing Fundamentals , John Willey.
2. Introduction to business Intelligence and data warehousing, IBM, PHI
3. Business Intelligence: Data Mining and Optimization for Decision
Making, Carlo Vercellis, Wiley,2019

STEPS

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Prerequisites:
2
1. Basics of dataset extensions.

2. Concept of data import.

1. What is Legacy Data?

Legacy data, according to BusinessDictionary, is "information maintained in an old or

out-of-date format or computer system that is consequently challenging to access or

handle."

2. Sources of Legacy Data

Where does legacy data come from? Virtually everywhere. Figure 1 indicates that there are

many sources from which you may obtain legacy data. This includes existing databases,

often relational, although non-RDBs such as hierarchical, network, object, XML,

object/relational databases, and NoSQL databases. Files, such as XML documents or "flat

files" such as configuration files and comma-delimited text files, are also common sources

of legacy data. Software, including legacy applications that have been wrapped (perhaps

via CORBA) and legacy services such as web services or CICS transactions, can also

provide access to existing information. The point to be made is that there is often far more

to gaining access to legacy data than simply writing an SQL query against an existing

relational database.

3
Steps to import Legacy Data

Importing Excel Data

1) Launch Power BI Desktop.

2) From the Home ribbon, select Get Data. Excel is one of the Most Common data
connections, so you can select it directly from the Get Data menu.

4
3) If you select the Get Data button directly, you can also select FIle > Excel and select
Connect.

4) In the Open File dialog box, select the Products.xlsx file.

5) In the Navigator pane, select the Products table and then select Edit.

5
Importing Data from OData Feed

In this task, you'll bring in order data. This step represents connecting to a sales system. Youimport
data into Power BI Desktop from the sample Northwind OData feed at the following URL, which you
can copy (and then paste) in the steps below:
http://services.odata.org/V3/Northwind/Northwind.svc/

Connect to an OData feed:

1) From the Home ribbon tab in Query Editor, select Get Data.

2) Browse to the OData Feed data source.

3) In the OData Feed dialog box, paste the URL for the Northwind OData feed.

4) Select OK.

5) In the Navigator pane, select the Orders table, and then select Edit.

6
Note - You can click a table name, without selecting the checkbox, to see a preview.

Conclusion:

By completing the tasks outlined in this problem statement, learners will be able to us Power BI and
able to import legacy data from excel file and Northwind OData feed.

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSINGMENT NO. BI II
TITLE Create Database in SQL server using ETL
PROBLEM STATEMENT Perform the Extraction Transformation and Loading (ETL) process to
/DEFINITION construct the database in the Sql server.
OBJECTIVE To learn about Extraction Transformation and Loading Process
OUTCOME Students will be able to learn about Extraction Transformation and
Loading Process on SQL server
S/W PACKAGES AND MS Excel,
HARDWARE APPARATUS Operating System recommended: - 64-bit Open source Linux or its
USED derivative Programming Languages: C++/JAVA/PYTHON/RProgramming
tools recommended: Front End: Java/Perl/PHP/Python/Ruby/.net, Backend:
MongoDB/MYSQL/Oracle, Database Connectivity: ODBC/JDBC,
Additional Tools: Octave, Matlab, WEKA, powerBI,

REFERENCES Text Books:


1. Fundamental of Business Intelligence, Grossmann W, Rinderle-Ma,
Springer,2015
2. R. Sharda, D. Delen, & E. Turban, Business Intelligence and Analytics.
Systems for Decision Support, 10th Edition. Pearson/Prentice Hall, 2015
Reference Books:
1. Paulraj Ponnian, ―Data Warehousing Fundamentals , John Willey.
2. Introduction to business Intelligence and data warehousing, IBM, PHI
3. Business Intelligence: Data Mining and Optimization for Decision
Making, Carlo Vercellis, Wiley,2019

STEPS

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Concepts related Theory:

Step 1 : Data Extraction :

The data extraction is first step of ETL. There are 2 Types of Data Extraction
1. Full Extraction : All the data from source systems or operational systems gets extracted to
staging area. (Initial Load)

2. Partial Extraction : Sometimes we get notification from the source system to update
specific date. It is called as Delta load.

Source System Performance: The Extraction strategies should not affect source system
performance.

Step 2 : Data Transformation :

The data transformation is second step.After extracting the data there is big need to do the
transformation as per the target system.I would like to give you some bullet points of Data
Transformation.

 Data Extracted from source system is in to Raw format. We need to transform it


before loading in to target server.
 Data has to be cleaned, mapped and transformed
 There are following important steps of Data Transformation :

1. Selection : Select data to load in target

2. Matching : Match the data with target system

3. Data Transforming : We need to change data as per target table structures

Real life examples of Data Transformation :

 Standardizing data : Data is fetched from multiple sources so it needs to be standardized


as per the target system.
 Character set conversion : Need to transform the character sets as per the target systems.
(Firstname and last name example)
 Calculated and derived values: In source system there is first val and second val and in
target we need the calculation of first val and second val.
 Data Conversion in different formats : If in source system date in in DDMMYY format
and in target the date is in DDMONYYYY format then this transformation needs to be
done at transformation phase.

Step 3 : Data Loading

 Data loading phase loads the prepared data from staging tables to main tables.

ETL process in SQL Server:

Following are the steps to open BIDS\SSDT.

Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL Server
programs group. The following screen appears.
Step 2 − The above screen shows SSDT has opened. Go to file at the top left corner in the
above image and click New. Select project and the following screen opens.
Step 3 − Select Integration Services under Business Intelligence on the top left corner in the
above screen to get the following screen.

Step 4 − In the above screen, select either Integration Services Project or Integration Services
Import Project Wizard based on your requirement to develop\create the package.

Modes

There are two modes − Native Mode (SQL Server Mode) and Share Point Mode.
Models

There are two models − Tabular Model (For Team and Personal Analysis) and Multi
Dimensions Model (For Corporate Analysis).
The BIDS (Business Intelligence Studio till 2008 R2) and SSDT (SQL Server Data Tools
from 2012) are environments to work with SSAS.

Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL Server
programs group. The following screen will appear.

Step 2 − The above screen shows SSDT has opened. Go to file on the top left corner in the
above image and click New. Select project and the following screen opens.
Step 3 − Select Analysis Services in the above screen under Business Intelligence as seen on
the top left corner. The following screen pops up.
Step 4 − In the above screen, select any one option from the listed five options based on your
requirement to work with Analysis services.

Conclusion: Hence, using Extraction, Transformation and Loading Student will be able to store data into
the database easily. Student will also able to combine data from multiple systems into a single database,
data store, data warehouse, or data lake.

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSINGMENT NO. BI III
TITLE Create Cube based on ROLAP, MOLAP and HOLAP model
PROBLEM STATEMENT Perform the Extraction Transformation and Loading (ETL) process to
/DEFINITION construct the database in the Sql server.
OBJECTIVE To learn about Extraction Transformation and Loading Process
OUTCOME Students will be able to learn about Extraction Transformation and
Loading Process on SQL server
S/W PACKAGES AND MS Excel,
HARDWARE APPARATUS Operating System recommended: - 64-bit Open source Linux or its
USED derivative Programming Languages: C++/JAVA/PYTHON/RProgramming
tools recommended: Front End: Java/Perl/PHP/Python/Ruby/.net, Backend:
MongoDB/MYSQL/Oracle, Database Connectivity: ODBC/JDBC,
Additional Tools: Octave, Matlab, WEKA, powerBI,

REFERENCES Text Books:


1. Fundamental of Business Intelligence, Grossmann W, Rinderle-Ma,
Springer,2015
2. R. Sharda, D. Delen, & E. Turban, Business Intelligence and Analytics.
Systems for Decision Support, 10th Edition. Pearson/Prentice Hall, 2015
Reference Books:
1. Paulraj Ponnian, ―Data Warehousing Fundamentals , John Willey.
2. Introduction to business Intelligence and data warehousing, IBM, PHI
3. Business Intelligence: Data Mining and Optimization for Decision
Making, Carlo Vercellis, Wiley,2019

STEPS

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Concepts related Theory:

Step 1 : Data Extraction :

The data extraction is first step of ETL. There are 2 Types of Data Extraction
1. Full Extraction : All the data from source systems or operational systems gets extracted to
staging area. (Initial Load)

2. Partial Extraction : Sometimes we get notification from the source system to update
specific date. It is called as Delta load.

Source System Performance: The Extraction strategies should not affect source system
performance.

Step 2 : Data Transformation :

The data transformation is second step.After extracting the data there is big need to do the
transformation as per the target system.I would like to give you some bullet points of Data
Transformation.

 Data Extracted from source system is in to Raw format. We need to transform it


before loading in to target server.
 Data has to be cleaned, mapped and transformed
 There are following important steps of Data Transformation :

1. Selection : Select data to load in target

2. Matching : Match the data with target system

3. Data Transforming : We need to change data as per target table structures

Real life examples of Data Transformation :

 Standardizing data : Data is fetched from multiple sources so it needs to be standardized


as per the target system.
 Character set conversion : Need to transform the character sets as per the target systems.
(Firstname and last name example)
 Calculated and derived values: In source system there is first val and second val and in
target we need the calculation of first val and second val.
 Data Conversion in different formats : If in source system date in in DDMMYY format
and in target the date is in DDMONYYYY format then this transformation needs to be
done at transformation phase.

Step 3 : Data Loading

 Data loading phase loads the prepared data from staging tables to main tables.

ETL process in SQL Server:

Following are the steps to open BIDS\SSDT.

Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL Server
programs group. The following screen appears.
Step 2 − The above screen shows SSDT has opened. Go to file at the top left corner in the
above image and click New. Select project and the following screen opens.
Step 3 − Select Integration Services under Business Intelligence on the top left corner in the
above screen to get the following screen.

Step 4 − In the above screen, select either Integration Services Project or Integration Services
Import Project Wizard based on your requirement to develop\create the package.

Modes

There are two modes − Native Mode (SQL Server Mode) and Share Point Mode.
Models

There are two models − Tabular Model (For Team and Personal Analysis) and Multi
Dimensions Model (For Corporate Analysis).
The BIDS (Business Intelligence Studio till 2008 R2) and SSDT (SQL Server Data Tools
from 2012) are environments to work with SSAS.

Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL Server
programs group. The following screen will appear.

Step 2 − The above screen shows SSDT has opened. Go to file on the top left corner in the
above image and click New. Select project and the following screen opens.
Step 3 − Select Analysis Services in the above screen under Business Intelligence as seen on
the top left corner. The following screen pops up.
Step 4 − In the above screen, select any one option from the listed five options based on your
requirement to work with Analysis services.

Conclusion: Hence, using Extraction, Transformation and Loading Student will be able to store data into
the database easily. Student will also able to combine data from multiple systems into a single database,
data store, data warehouse, or data lake.

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSINGMENT NO. BI 4
TITLE Import the data warehouse data in Microsoft Excel and create the
Pivot table and Pivot Chart

PROBLEM STATEMENT Import the data warehouse data in Microsoft Excel and create the
/DEFINITION Pivot table and Pivot Chart

OBJECTIVE Student will be able to import Datawarehouse in Excel and create pivot
table and charts.
OUTCOME Student will be able to import Datawarehouse in Excel and create pivot
table and charts.
S/W PACKAGES AND MS Excel,
HARDWARE APPARATUS Operating System recommended: - 64-bit Open source Linux or its
USED derivative Programming Languages: C++/JAVA/PYTHON/RProgramming
tools recommended: Front End: Java/Perl/PHP/Python/Ruby/.net,
Backend: MongoDB/MYSQL/Oracle, Database Connectivity: ODBC/JDBC,
Additional Tools: Octave, Matlab, WEKA, powerBI,

REFERENCES Text Books:


1. Fundamental of Business Intelligence, Grossmann W, Rinderle-Ma,
Springer,2015
2. R. Sharda, D. Delen, & E. Turban, Business Intelligence and Analytics.
Systems for Decision Support, 10th Edition. Pearson/Prentice Hall, 2015
Reference Books:
1. Paulraj Ponnian, ―Data Warehousing Fundamentals‖, John Willey.
2. Introduction to business Intelligence and data warehousing, IBM, PHI
3. Business Intelligence: Data Mining and Optimization for Decision
Making, Carlo Vercellis, Wiley,2019

STEPS 1. Select cell from the data you want to generate pivot chart from

2. Select Tables and columns you want to add to Chart.


3. A new sheet will be created with Pivot Table and Chart.
4. Select columns from PivotChart Fields to populate table and
chart.
1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective

P:F-LTL-UG/03/R1
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Prerequisites:

Concepts related Theory:

A pivot table is a table of grouped values that aggregates the individual items of a more extensive
table within one or more discrete categories. This summary might include sums, averages, or
other statistics, which the pivot table groups together using a chosen aggregation function
applied to the grouped values.

Data Model Is used for building a model where data from various sources can be combined by
creating relationships among the data sources. A Data Model integrates the tables, enabling
extensive analysis using PivotTables, Power Pivot, and Power View.

A Data Model is created automatically when you import two or more tables simultaneously from
a database. The existing database relationships between those tables are used to create the Data
Model in Excel.

Following are the steps to create pivot table and chart in excel.

1. Select cell from the data you want to generate pivot chart from

P:F-LTL-UG/03/R1
2. Select Insert->Pivot Chart->Pivot Chart & Pivot Table

P:F-LTL-UG/03/R1
3. Select Tables and columns you want to add to Chart.

P:F-LTL-UG/03/R1
4. A new sheet will be created with Pivot Table and Chart.

5. Select columns from PivotChart Fields to populate table and chart.

P:F-LTL-UG/03/R1
Conclusion: Students will be able to create pivot charts and summarize huge data on excel sheet.

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSINGMENT NO. BI 5
TITLE Classification and clustering on business data

PROBLEM STATEMENT Perform the data classification using classification algorithm. Or


/DEFINITION perform the data clustering using clustering algorithm.

OBJECTIVE Student will be able to able to perform classification and clustering


as per business needs.
OUTCOME Student will be able implement classification and clustering on a
given datasets.
S/W PACKAGES AND MS Excel,
HARDWARE APPARATUS Operating System recommended: - 64-bit Open source Linux or its
USED derivative Programming Languages:
C++/JAVA/PYTHON/RProgramming tools recommended: Front
End: Java/Perl/PHP/Python/Ruby/.net, Backend:
MongoDB/MYSQL/Oracle, Database Connectivity: ODBC/JDBC,
Additional Tools: Octave, Matlab, WEKA, powerBI,

REFERENCES Text Books:


1. Fundamental of Business Intelligence, Grossmann W, Rinderle-
Ma, Springer,2015
2. R. Sharda, D. Delen, & E. Turban, Business Intelligence and
Analytics. Systems for Decision Support, 10th Edition.
Pearson/Prentice Hall, 2015
Reference Books:
1. Paulraj Ponnian, ―Data Warehousing Fundamentals , John Willey.
2. Introduction to business Intelligence and data warehousing, IBM,
PHI
3. Business Intelligence: Data Mining and Optimization for Decision
Making, Carlo Vercellis, Wiley,2019

STEPS

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition

P:F-LTL-UG/03/R1
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Prerequisites:

Concepts related Theory:

Machine learning algorithms can be broadly classified into two categories - supervised and
unsupervised learning. There are other categories also like semi-supervised learning and
reinforcement learning. But, most of the algorithms are classified as supervised or unsupervised
learning. The difference between them happens because of presence of target variable. In
unsupervised learning, there is no target variable. The dataset only has input variables which
describe the data. This is called unsupervised learning.

K-Means clustering is the most popular unsupervised learning algorithm. It is used when we
have unlabelled data which is data without defined categories or groups. The algorithm follows
an easy or simple way to classify a given data set through a certain number of clusters, fixed
apriori. K-Means algorithm works iteratively to assign each data point to one of K groups based
on the features that are provided. Data points are clustered based on feature similarity.

K-Means clustering can be represented diagrammatically as follows:-

K-Means Clustering intuition


K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw infer-
ences from them. It is based on centroid-based clustering.

P:F-LTL-UG/03/R1
Centroid - A centroid is a data point at the centre of a cluster. In centroid-based clustering, clus-
ters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is
derived by how close a data point is to the centroid of the cluster. K-Means clustering works as
follows:- The K-Means clustering algorithm uses an iterative procedure to deliver a final result.
The algorithm requires number of clusters K and the data set as input. The data set is a collection
of features for each data point. The algorithm starts with initial estimates for the K centroids. The
algorithm then iterates between two steps:-

 Data assignment step


Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids
in set C, then each data point is assigned to a cluster based on minimum Euclidean distance.

 Centroid update step


In this step, the centroids are recomputed and updated. This is done by taking the mean of all da-
ta points assigned to that centroid’s cluster.

The algorithm then iterates between step 1 and step 2 until a stopping criteria is met. Stopping
criteria means no data points change the clusters, the sum of the distances is minimized or some
maximum number of iterations is reached. This algorithm is guaranteed to converge to a result.
The result may be a local optimum meaning that assessing more than one run of the algorithm
with randomized starting centroids may give a better outcome.

The K-Means intuition can be represented with the help of following diagram:-

K-Means intuition

P:F-LTL-UG/03/R1
 Choosing the value of K
The K-Means algorithm depends upon finding the number of clusters and data labels for a pre-
defined value of K. To find the number of clusters in the data, we need to run the K-Means clus-
tering algorithm for different values of K and compare the results. So, the performance of K-
Means algorithm depends upon the value of K. We should choose the optimal value of K that
gives us best performance. There are different techniques available to find the optimal value of
K. The most common technique is the elbow method which is described below.

The elbow method

The elbow method is used to determine the optimal number of clusters in K-means clustering.
The elbow method plots the value of the cost function produced by different values of K. The
below diagram shows how the elbow method works: -

P:F-LTL-UG/03/R1
The elbow method
We can see that if K increases, average distortion will decrease. Then each cluster will have
fewer constituent instances, and the instances will be closer to their respective centroids.
However, the improvements in average distortion will decline as K increases. The value of K at
which improvement in distortion declines the most is called the elbow, at which we should stop
dividing the data into further clusters.

Applications of clustering

 K-Means clustering is the most common unsupervised machine learning algorithm. It is


widely used for many applications which include-
1. Image segmentation
2. Customer segmentation
3. Species clustering
4. Anomaly detection
5. Clustering languages

Conclusion
After completion of this assignment student would be able to Implement the most popular
unsupervised clustering technique called K-Means Clustering. And Apply the elbow method to
find out optimal number of clusters to cluster this data.

P:F-LTL-UG/03/R1
PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE
ACADEMIC YEAR: 2022-23
DEPARTMENT of COMPUTER ENGINEERING DEPARTMENT
CLASS: B.E. SEMESTER: II
SUBJECT: LP VI
ASSINGMENT NO. PR 1
TITLE Extraction of features using structural and feature space methods for
Indian Fruits.

PROBLEM STATEMENT Perform extraction of features using structural and feature space
/DEFINITION methods for Indian Fruits.

OBJECTIVE  To learn about structural and feature space methods


for extraction of features.
 Perform above given processing techniques on the
given dataset.

OUTCOME  Student will be able to Learn feature extraction tech-


niques such as structural and feature space methods.
 Student will be able to Perform feature extraction on
the given dataset.

S/W PACKAGES AND MS Excel,


HARDWARE APPARATUS Operating System recommended: - 64-bit Open source Linux or its
USED derivative Programming Languages: C++/JAVA/PYTHON/RProgramming
tools recommended: Front End: Java/Perl/PHP/Python/Ruby/.net,
Backend: MongoDB/MYSQL/Oracle, Database Connectivity: ODBC/JDBC,
Additional Tools: Octave, Matlab, WEKA, powerBI,

REFERENCES Text Books:


1. R. O. Duda, P. E. Hart, D. G. Stork, ―Pa ern Classifica on‖, 2nd Edi on,
WileyInter- science, John Wiley &Sons, 2001
2. S. Theodoridis and K. Koutroumbas, ―Pa ern Recogni on‖, 4th Edi on,
Elsevier, Academic Press, ISBN: 978-1-59749-272-0
3. B.D. Ripley, ―Pattern Recognition and Neural Networks‖, Cambridge
University Press. ISBN 0 521 460867
Reference Books:
1. Devi V.S.; Murty, M.N. (2011) Pattern Recognition: An Introduction,
Universities Press, Hyderabad.
2. David G. Stork and Elad Yom-Tov, ―Computer Manual in MATLAB to
accompany Pattern Classification‖, Wiley Inter-science, 2004, ISBN-10:
0471429775

P:F-LTL-UG/03/R1
3. Malay K. Pakhira, ―Digital Image Processing and Pa ern Recogni on‖,
PHI, ISBN-978- 81-203-4091-6
4. eMedia at NPTEL : http://nptel.ac.in/courses/106108057/33
STEPS
1. Data acquisition
2. Pre processing
3. Structural Feature Extration
4. Feature space feature extraction
5. Combining Features
6. Building Model

1. Date
INSTRUCTIONS FOR 2. Assignment no.
WRITING JOURNAL 3. Problem definition
4. Learning objective
5. Learning Outcome
6. Concepts related Theory
7. Algorithm
8. Test cases
10. Conclusion/Analysis

Prerequisites:

Concepts related Theory:

Feature Extraction:

Feature extraction is a crucial step in pattern recognition and machine learning. Structural and
feature space methods are two common approaches for feature extraction from images or data
sets.

Structural Methods :

Structural methods focus on the shape and geometry of the object or pattern of interest. These
methods involve analyzing the contour, boundary, or skeleton of the object to extract shape-
based features such as area, perimeter, curvature, or aspect ratio. For instance, in the case of face
recognition, structural features can be obtained by locating and extracting facial landmarks such
as eyes, nose, and mouth.

Structural methods involve analyzing the shape and geometry of the fruit. One approach could be

P:F-LTL-UG/03/R1
to use edge detection techniques to find the contours of the fruit, and then extract features such as
the perimeter, area, and compactness of the fruit. Another approach could be to use region-based
segmentation to identify the fruit and then extract features such as the shape descriptor,
eccentricity, and aspect ratio.

Feature Space Methods :

Feature space methods focus on the color, texture, or other intrinsic characteristics of the object
or pattern. These methods involve analyzing the pixel or frequency domain of the object to
extract features such as color histograms, texture descriptors, or Fourier coefficients. For
example, in the case of plant classification, feature space methods can be used to extract features
such as leaf texture, color, or shape from plant images.

Feature space methods involve analyzing the color and texture of the fruit. One approach could
be to use color-based segmentation to identify the fruit and then extract features such as color
histograms and color moments. Another approach could be to use texture-based segmentation to
identify the fruit and then extract features such as gray-level co-occurrence matrix (GLCM)
features and Gabor wavelet features.

Both structural and feature space methods can be used in combination to obtain a comprehensive
set of features that capture different aspects of the object or pattern. These features can then be
used for various applications such as object recognition, classification, segmentation, or tracking.
For instance, in the case of medical image analysis, combining structural and feature space
methods can enhance the accuracy of tumor detection or lesion segmentation.

Algorithm:

1. Data acquisition: Collect images of Indian fruits from dataset.


2. Pre-processing: Apply pre-processing techniques such as noise reduction, image en-
hancement, and normalization to improve the quality and consistency of the images.
3. Structural feature extraction:
a. Detect and segment the fruit from the background using edge detection or region-
based segmentation techniques.

P:F-LTL-UG/03/R1
b. Extract structural features such as area, perimeter, compactness, circularity, or as-
pect ratio from the segmented fruit.
4. Feature space feature extraction:
a. Extract color-based features such as color histograms, color moments, or chroma-
ticity features from the segmented fruit.
b. Extract texture-based features such as GLCM features, Gabor wavelet features, or
local binary patterns from the segmented fruit.
5. Combine features: Concatenate the extracted structural and feature space features to form
a comprehensive feature vector for each fruit image.
6. Classification :
a. Train a classification algorithm such as support vector machines, decision trees,
or neural networks using the feature vectors and corresponding fruit labels.
b. Test the trained classifier on a separate test set and evaluate its performance using
metrics such as accuracy, precision, recall, and F1-score.

output

P:F-LTL-UG/03/R1
Conclusion: after completing this assignment learners will be able to learn and implement
extraction of features using structural and feature space methods for Indian Fruits.

P:F-LTL-UG/03/R1

You might also like