Lab Manual LPII 2
Lab Manual LPII 2
DEPARTMENT
of
COMPUTER ENGINEERING
LAB MANUAL
Laboratory Practice II
Enhancing the potential of aspiring students and faculty for higher education
and lifelong learning.
PEO 2:
Possess knowledge and skills in the field of Computer Engineering for analyzing,
designing and implementing novel software products in a dynamic environment for
successful career and pursue higher studies.
PEO 3:
PEO 4:
Program Outcomes
a. An ability to apply knowledge of computing, mathematics, science and engineering
fundamentals appropriate to Computer Engineering.
b. An ability to define the problems and provide solutions by designing and conducting
experiments, interpreting and analyzing data.
e. An ability to use modern engineering tools and technologies necessary for engineering
practice.
g. An ability to understand the environmental issues and provide the sustainable system.
6. Engineer and f. An ability to analyze the local and global impact of computing
Society on individuals, organizations and society.
Th: Theory TW: Term Work Pr: Practical Or: Oral Pre: Presentation
1 For an organization of your choice, choose a set of business processes. Design star /
snow flake schemas for analyzing these processes. Create a fact constellation
schema by combining them. Extract data from different data sources, apply suitable
transformations and load into destination tables using an ETL tool. For Example:
Business Origination: Sales, Order, and Marketing Process.
3 Apply a-priori algorithm to find frequently occurring items from given data and
generate strong association rules using support and confidence thresholds. For
Example: Market Basket Analysis
4 Consider a suitable text dataset. Remove stop words, apply stemming and feature
selection techniques to represent documents as vectors. Classify documents and
evaluate precision, recall.
D. COURSE OUTCOMES:
Apply recent automation tool for various software testing for testing software
E. PREREQUISITES:
Database,Software Engineering,Software Modelling Design
Assignment References
Assignment Name
Number
Text:
Elective I
1. Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and Techniques”,
Elsevier Publishers, ISBN:9780123814791, 9780123814807.
2. Parag Kulkarni, “Reinforcement and Systemic Machine Learning for Decision Making” by
Wiley-IEEE Press, ISBN: 978-0-470-91999-6
Elective II
1. M G Limaye, “Software Testing Principles, Techniques and Tools”, Tata McGraw Hill, ISBN:
9780070139909 0070139903
References:
Elective I
1. Matthew A. Russell, "Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn,
Google+, GitHub, and More" , Shroff Publishers, 2nd Edition, ISBN: 9780596006068
1. Naresh Chauhan, “Software Testing Principles and Practices ", OXFORD, ISBN-10:
0198061846. ISBN-13: 9780198061847
2. Stephen Kan, “Metrics and Models in Software Quality Engineering”, Pearson, ISBN-10:
0133988082; ISBN-13: 978-0133988086
ASSIGNMENT 1
Aim : For an organization of your choice, choose a set of business processes. Design
star / snow flake schemas for analyzing these processes. Create a fact constellation schema
by combining them. Extract data from different data sources, apply suitable transformations
and load into destination tables using an ETL tool. For Example: Business Origination:
Sales, Order, and Marketing Process.
Theory: To overcome performance issues for large queries in the data warehouse, we use
dimensional models. The dimensional modeling approach provides a way to improve query
performance for summary reports without affecting data integrity
Star Schema : A dimensional model is also commonly called a star schema. This type of model
is very popular in data warehousing because it can provide much better query performance,
especially on very large queries, than an E/R model. However, it also has the major benefit of
being easier to understand. It consists, typically, of a large table of facts (known as a fact table),
with a number of other tables surrounding it that contain descriptive data, called dimensions.
When it is drawn, it resembles the shape of a star, therefore the name.
_ Fact table
_ Dimension table
_ The fact table contains numerical values of what you measure. For example, a fact value of 20
might mean that 20 widgets have been sold.
_ Each fact table contains the keys to associated dimension tables. These are called foreign keys
in the fact table.
_ The dimension tables contain descriptive information about the numerical values in the fact
table. That is, they contain the attributes of the facts. For
example, the dimension tables for a marketing analysis application might include attributes such
as time period, marketing region, and product type.
_ Since the data in a dimension table is denormalized, it typically has a large number of columns.
_ The dimension tables typically contain significantly fewer rows of data than the fact table.
_ The attributes in a dimension table are typically used as row and column headings in a report or
query results display. For example, the textual descriptions on a report come from dimension
attributes.
Snowflake model: Further normalization and expansion of the dimension tables in a star schema
result in the implementation of a snowflake design. A dimension is said to be snowflaked when
the low-cardinality columns in the dimension have been removed to separate normalized tables
that then link back into the original dimension table.
In this example, we expanded (snowflaked) the Product dimension by removing the low-
cardinality elements pertaining to Family, and putting them in a separate Family dimension table.
The Family table is linked to the Product dimension table by an index entry (Family_id) in both
tables. From the Product dimension table, the Family attributes are extracted by, in this example,
the Family Intro_date. The keys of the hierarchy (Family_Family_id) are also included in the
Family table. In a similar fashion, the Market dimension was snowflaked.
ETL Tools:
ETL is short for extract, transform, load, three database functions that are combined into one
tool to pull data out of one database and place it into another database. Extract is the process of
reading data from a database. ... Transformation occurs by using rules or lookup tables or by
combining the data with other data.
Talend provides multiple solutions for data integration, both open source and commercial editions. Talend
offers an Eclipse-based interface, drag-and-drop design flow, and broad connectivity with more than 400
pre-configured application connectors to bridge between databases, mainframes, file systems, web
services, packaged enterprise applications, data warehouses, OLAP applications, Software-as-a-Service,
Cloud-based applications, and more.
Scriptella
Scriptella is an open source ETL (Extract-Transform-Load) and script execution tool written in Java. Its
primary focus is simplicity. You don't have to study yet another complex XML-based language - use SQL
(or other scripting language suitable for the data source) to perform required transformations. Scriptella is
licensed under the Apache License, Version 2.0
KETL
KETL is a premier, open source ETL tool. The data integration platform is built with portable, java-based
architecture and open, XML-based configuration and job language. KETL features successfully compete
with major commercial products available today. Highlights include:
Proven scalability across multiple servers and CPU’s and any volume of data
No additional need for third party schedule, dependency, and notification tools
Jaspersoft ETL
Jasper ETL is easy to deploy and out-performs many proprietary ETL software systems. It is used to
extract data from your transactional system to create a consolidated data warehouse or data mart for
reporting and analysis.
GeoKettle
GeoKettle is a powerful, metadata-driven Spatial ETL tool dedicated to the integration of different spatial
data sources for building and updating geospatial data warehouses. GeoKettle enables the Extraction of
data from data sources, the Transformation of data in order to correct errors, make some data cleansing,
Conclusion: Thus we have designed sales business process using star schema. Fact constellation
schema is used for design of combing sales and inventory process we desi and we learned the ETL tool
application.
ASSIGNMENT 2
Aim: Consider a suitable dataset. For clustering of data instances in different groups,
apply different clustering techniques. Visualize the clusters using suitable tool.
Theory:
Clustering: It can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structurein a collection of unlabeled data. A
loose definition of clustering could be “the process of organizing objects into groups whose
members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
K-Means clustering
K-Means clustering intends to partition n objects into k clusters in which each object belongs to
the cluster with the nearest mean. This method produces exactly k different clusters of greatest
possible distinction. The best number of clusters k leading to the greatest separation (distance) is
not known as a priori and must be computed from the data. The objective of K-Means clustering
is to minimize total intra-cluster variance, or, the squared error function: 1) Mnemonic machine
instructions – each one is translated to a single executable instruction.
ASSIGNMENT 3
Aim: Apply a-priori algorithm to find frequently occurring items from given data and
generate strong association rules using support and confidence thresholds. For Example:
Market Basket Analysis
General Process Association rule generation is usually split up into two separate steps:
2. Second, these frequent itemsets and the minimum con fidence constraint are used to form
rules.
While the second step is straight forward, the firs t step needs more attention. Finding all
frequent itemsets in a database is difficult since it involves searching all possible item sets (item
combinations). The set of possible itemsets I s the power set over I and has size 2 n − 1
(excluding the empty set which is not a valid itemset). Although t he size of the powerset grows
exponentially in the number of items n in I , efficient search is possible using the downward-
closure property of support (also called anti-monotonicity ) which guarantees that for a frequent
itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets m ust
also be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori ) can find all f
requent itemsets.
Example:
Suppose you have records of large number of transactions at a shopping center as follows:
Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought
at least 60% of times(i.e Minimum Support=3). So for here it should be bought at least 3 times.
Original table:
T1 {M, O, J, K, E, C}
T2 {N, O, J, K, E, C}
T3 {M, A, K, E}
T4 {M, T, Co, K, C}
T5 {Co, O, O, K, Kn,
E}
Step 1: Count the number of transactions in which each item occurs, Note „O=Onion‟ is bought
4 times in total, but, it occurs in just 3 transactions.
Candidate transactions(Support)
SetC1 )
M 3
O 3
Step 2: Now we said the item is said frequently bought if it is bought at least 3 times. So in this
step we remove all the items that are bought less than 3 times from the above table and we are
left with. This is the single items that are bought frequently. Now let‟s say we want to find a pair
of items that are bought frequently. We continue from the above table (Table in step 2)
Item Number of
(Frequent transactions(Support)
Item Sets
L1)
M 3
O 3
K 5
E 4
C 3
Step 3: We start making pairs from the first item, like MO,MK,ME,MC and then we start with
the second item like OK,OE,OC. We did not do OM because we already did MO when we were
making pairs with M and buying a Mango and Onion together is same as buying Onion and
Mango together. After making all the pairs we get,
Item pairs
MO
MK
Step 4: Now we count how many times each pair is bought together. For example M and O is
just bought together in {M,O,N,K,E,C}
MO 1
MK 3
ME 2
MC 2
OK 3
OE 3
OC 2
KE 4
KC 3
EC 2
Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less
than three and we are left with
MK 3
OK 3
OE 3
KE 4
KY 3
Now let‟s say we want to find a set of three items that are brought together. We use the above
table (table in step 5) and make a set of 3 items.
Step 6: To make the set of three items we need one more rule (it‟s termed as self-join),
It simply means, from the Item pairs in the above table, we find two pairs with the same first
Alphabet, so we get
It simply means, from the Item pairs in the above table, we find two pairs with the same first
Alphabet, so we get
Then we find how many times O,K,E are bought together in the original table and same for
K,E,Y and we get the following table
OKE 3
KEY 2
CONCLUSION
Thus we have successfully implemented Apriori Approach for data mining to organize the data
items on a shelf using following table of items purchased in a Mall.
Aim: Consider a suitable text dataset. Remove stop words, apply stemming and feature
selection techniques to represent documents as vectors. Classify documents and evaluate
precision, recall..
Theory:
'Preprocessing', is the most important subtask of text classification . The main objective of
preprocessing is to obtain the key features or key terms from online news text documents and to
enhance the relevancy between word and document and the relevancy between word and
category. The goal behind preprocessing is to represent each document as a feature vector, that
is, to separate the text into individual words.
Preprocessing steps:
Step 1:Data Collection
Step 2: Stop word removal
Step 3: Stemming
Step 4: Indexing
Step 5: Term weighting
Step 6: Feature Selection
To classify these documents, we would start by taking all of the words in the three documents in
our training set and creating a table or vector from these words.
(some,tigers,live,in,the,zoo,green,is,a,color,go,to,new,york,city)class
Then for each of the training documents, we would create a vector by assigning a 1 if the word
exists in the training document and a 0 if it doesn’t, tagging the document with the appropriate
Class as follows:
Untagged document arrives for classification and it contains the words “Orange is a color” we
would create a word vector for it by marking the words which exist in our classification
vector.
If we compare this vector for the document of unknown class, to the vectors representing our
three document classes, we can see that it most closely resembles the vector for class 2
documents, as shown below:
It is then possible to label the new document as a class 2 document with an adequate degree of
confidence.
CONCLUSION
Thus we have successfully classified a text document by removing stop words, applying
stemming and feature selection techniques to represent documents as vectors.
Theory:
Nowadays there is huge amount of data being collected and stored in databases everywhere
across the globe. The tendency is to keep increasing year after year. It is not hard to find
databases with Terabytes of data in enterprises and research facilities. That is over
1,099,511,627,776 bytes of data. There is invaluable information and knowledge “hidden” in
such databases; and without automatic methods for extracting this information it is practically
impossible to mine for them. Throughout the years many algorithms were created to extract what
is called nuggets of knowledge from large sets of data. There are several different methodologies
to approach this problem: classification, association rule, clustering, etc.
Classification consists of predicting a certain outcome based on a given input. In order to predict
the outcome, the algorithm processes a training set containing a set of attributes and the
respective outcome, usually called goal or prediction attribute. The algorithm tries to discover
relationships between the attributes that would make it possible to predict the outcome. Next the
algorithm is given a data set not seen before, called prediction set, which contains the same set of
attributes, except for the prediction attribute – not yet known. The algorithm analyses the input
and produces a prediction. The prediction accuracy defines how “good” the algorithm is. For
example, in a medical database the training set would have relevant patient information recorded
previously, where the prediction attribute is whether or not the patient had a heart problem. Table
1 below illustrates the training and prediction sets of such database.
Decision Tree: The ID3 algorithm was originally developed by J. Ross Quinlan at the
University of Sydney, and he first presented it in the 1975 book “Machine Learning”. The ID3
COMP,DIT PIMPRI Page 33
algorithm induces classification models, or decision trees, from data. It is a supervised learning
algorithm that is trained by examples for different classes. After being trained, the algorithm
should be able to predict the class of a new item. ID3 identifies attributes that differentiate one
class from another. All attributes must be known in advance, and must also be either continuous
or selected from a set of known values. For instance, temperature (continuous), and country of
citizenship (set of known values) are valid attributes. To determine which attributes are the most
important, ID3 uses the statistical property of entropy. Entropy measures the amount of
information in an attribute. This is how the decision tree, which will be used in testing future
cases, is built.
One of the limitations of ID3 is that it is very sensitive to attributes with a large number of values
(e.g. social security numbers). The entropy of such attributes is very low, and they don’t help you
in performing any type of prediction. The C4.5 algorithm overcomes this problem by using
another statistical property known as information gain. Information gain measures how well a
given attribute separates the training sets into the output classes. Therefore, the C4.5 algorithm
extends the ID3 algorithm through the use of information gain to reduce the problem of
artificially low entropy values for attributes such as social security number.
LABORATORY WORK
I. ORAL QUESTIONS
J. PROGRESSIVE