0% found this document useful (0 votes)
3 views106 pages

DMBI Practical 1 To 17 Rahul Final

The document outlines various data mining and data warehouse tools, including Orange Data Mining, SAS Data Mining, DataMelt, Rattle, Rapid Miner, Weka, Amazon Redshift, Microsoft Azure, Google BigQuery, Snowflake, Amazon DynamoDB, and PostgreSQL. It provides a brief description of each tool's functionalities and applications, highlighting their significance in data analysis and management. Additionally, it includes a practical section on the WEKA tool, detailing its installation process and features for data mining tasks.

Uploaded by

rahul rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views106 pages

DMBI Practical 1 To 17 Rahul Final

The document outlines various data mining and data warehouse tools, including Orange Data Mining, SAS Data Mining, DataMelt, Rattle, Rapid Miner, Weka, Amazon Redshift, Microsoft Azure, Google BigQuery, Snowflake, Amazon DynamoDB, and PostgreSQL. It provides a brief description of each tool's functionalities and applications, highlighting their significance in data analysis and management. Additionally, it includes a practical section on the WEKA tool, detailing its installation process and features for data mining tasks.

Uploaded by

rahul rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

ENROLL: 211310132117 DMBI

Practical:1
Aim: List & Explain various Data Mining Tools and Data ware-house tools.

Study: Data Mining Tools

1) Orange Data Mining:


Orange is a perfect machine learning and data mining software suite. It supports visualization
and is a software-based on components written in Python computing language.

As it is software-based on components, the components of Orange are called "widgets." These


widgets range from preprocessing and data visualization to the assessment of algorithms and
predictive modeling.

Widgets deliver significant functionalities such as

• Displaying data table and allowing to select features


• Data reading
• Training predictors and comparison of learning algorithms • Data element visualization, etc.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

2) SAS Data Mining:


SAS stands for Statistical Analysis System. It is a product of the SAS Institute created for
analytics and data management. SAS can mine data, change it, manage information from various
sources, and analyze statistics. It offers a graphical UI for non-technical users.

SAS data miner allows users to analyze big data and provide accurate insight for timely decision-
making purposes. SAS has a distributed memory processing architecture that is highly scalable.
It is suitable for data mining, optimization, and text mining purposes.

3) DataMelt Data Mining:

DataMelt is a computation and visualization environment which offers an interactive structure for
data analysis and visualization. It is primarily designed for students, engineers, and scientists. It is
also known as DMelt.

DMelt is a multi-platform utility written in JAVA. It can run on any operating system which is
compatible with JVM (Java Virtual Machine). It consists of science and mathematics libraries.

• Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
• Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting, etc.

DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis.
It is extensively used in natural sciences, financial markets, and engineering.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

4) Rattle:

Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle
exposes the statical power of R by offering significant data mining features. While rattle has a
comprehensive and well-developed user interface, it has an integrated log code tab that produces
duplicate code for any GUI operation.

The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to
review the code, use it for many purposes, and extend the code without any restriction.

5) Rapid Miner:

Rapid Miner is one of the most popular predictive analysis systems created by the company
with the same name as the Rapid Miner. It is written in JAVA programming language. It offers
an integrated environment for text mining, deep learning, machine learning, and predictive
analysis.

The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine
learning.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It
has a client/server model as its base. A rapid miner comes with template-based frameworks that
enable fast delivery with few errors (which are commonly expected in the manual coding
writing process).

6) Weka:

Weka is an open-source machine learning software with a vast collection of algorithms for data
mining. It was developed by the University of Waikato, in New Zealand, and it’s written in
JavaScript.

It supports different data mining tasks, like preprocessing, classification, regression, clustering,
and visualization, in a graphical interface that makes it easy to use. For each of these tasks, Weka
provides built-in machine learning algorithms which allow you to quickly test your ideas and
deploy models without writing any code. To take full advantage of this, you need to have a
sound knowledge of the different algorithms available so you can choose the right one for your
particular use case.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Study: Data Warehouse Tools


1) Amazon Redshift:

Amazon Redshift is a cloud-based fully managed petabytes-scale data warehouse By the


Amazon Company. It starts with just a few hundred gigabytes of data and scales to petabytes or
more. This enables the use of data to accumulate new insights for businesses and customers.

It is a relational database management system (RDBMS) therefore it is compatible with other


RDBMS applications. Amazon Redshift offers quick querying capabilities over structured data
using SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC
connections.

Amazon Redshift is made around industry-standard SQL, with additional practicality to manage
massive datasets and support superior analysis and reporting of these data. It helps to work
quickly and easily along with data in open formats, and simply integrates with and connects to
the AWS scheme. Also query and export data to and from the data lake.

2) Microsoft Azure:

Azure is a cloud computing platform that was launched by Microsoft in 2010. Microsoft Azure
is a cloud computing service provider for building, testing, deploying, and managing
applications and services through Microsoft-managed data centers.

Azure is a public cloud computing platform that offers Infrastructure as a Service (IaaS),
Platform as a Service (PaaS), and Software as a Service (SaaS). The Azure cloud platform
provides more than 200 products and cloud services such as Data Analytics, Virtual Computing,
Storage, Virtual Network, Internet Traffic Manager, Web Sites, Media Services, Mobile

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Services, Integration, etc. Azure facilitates simple portability and genuinely compatible
platform between on-premises and public Cloud.

Azure provides a range of cross-connections including virtual private networks (VPNs), caches,
content delivery networks (CDNs), and ExpressRoute connections to improve usability and
performance. Microsoft Azure provides a secure base across physical infrastructure and
operational security.

3) Google BigQuery:

BigQuery is a serverless data warehouse that allows scalable analysis over petabytes of data.
It’s a Platform as a Service that supports querying with the help of ANSI SQL. It additionally
has inbuilt machine learning capabilities. BigQuery was declared in 2010 and made available
for use there in 2011.

Google BigQuery is a cloud-based big data analytics web service to process very huge amount
of read-only data sets. BigQuery is designed for analyzing data that are in billions of rows by
simply employing SQL-lite syntax. BigQuery can run advanced analytical SQL-based queries
beneath big sets of data. BigQuery is not developed to substitute relational databases and for
easy CRUD operations and queries.

It is oriented for running analytical queries. It is a hybrid system that enables the storage of
information in columns; however, it takes into the NoSQL additional features, like the data type,
and the nested feature. BigQuery is a better option than Redshift since we must pay by the hour.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

4) Snowflake:

Snowflake is a cloud computing-based data warehousing built on top of the Amazon Web
Services or Microsoft Azure cloud infrastructure. The Snowflake design allows storage and
computers to scale independently, thus customers can use and pay money for storage and
computation individually.
In Snowflake data processing is simplified: Users will do data blending, analysis, and
transformations against varied forms of data structures with one language, SQL. Snowflake
offers dynamic, scalable computing power with charges primarily based strictly on usage. With
Snowflake, computation and storage are fully separate, and the storage value is the same as
storing the data on Amazon S3.

5) Amazon DynamoDB:

Amazon DynamoDB is a fully managed proprietary NoSQL data warehouse service that
supports key-value and document data structures and is obtainable by Amazon.com as a part of
the Amazon Web Services portfolio. DynamoDB has an identical data model and encompasses
a completely different underlying implementation.

A partition key value is used in DynamoDB as input to an enclosed hash function. The output
from the hash function determines the partition within which the item is going to be kept. All
items with identical partition key values are stored together, in sorted order by sort key value.

It offers customers high availability, dependability, and progressive scalability, with no limits
on dataset size or request output for a given table.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

6) PostgreSQL:

It is an extremely stable database management system, backed by over twenty years of


community development that has contributed to its high levels of resilience, integrity, and
correctness.

PostgreSQL is employed because it is the primary data store or data warehouse for several web,
mobile, geospatial, and analytics applications. SQL Server is a database management system
that is especially used for e-commerce and providing different data warehousing solutions.

PostgreSQL is a sophisticated version of SQL that provides support to various functions of SQL
like foreign keys, subqueries, triggers, and other user-defined varieties and functions.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 2
Aim: Study of WEKA Tool & its step-by-step Installation.

● What is WEKA?

⮚ WEKA, formally called Waikato Environment for Knowledge Learning, is a computer program
that was developed at the University of Waikato in New Zealand for the purpose of identifying
information from raw data gathered from agricultural domains.

⮚ WEKA supports many different standard data mining tasks such as data preprocessing,
classification, clustering, regression, visualization, and feature selection. The basic premise of
the application is to utilize a computer application that can be trained to perform machine
learning capabilities and derive useful information in the form of trends and patterns.

⮚ WEKA is an open-source application that is freely available under the GNU general public
license agreement. Originally written in C the WEKA application has been completely rewritten
in Java and is compatible with almost every computing platform.

⮚ It is user friendly with a graphical interface that allows for quick set up and operation. WEKA
operates on the predication that the user data is available as a flat file or relation, this means that
each data object is described by a fixed number of attributes that usually are of a specific type,
normal alpha-numeric or numeric values.

⮚ The WEKA application allows novice users a tool to identify hidden information from database
and file systems with simple to use options and visual interfaces.

● How to Install?

⮚ The program information can be found by conducting a search on the Web for WEKA Data
Mining or going directly to the site at www.cs.waikato.ac.nz/~ml/WEKA.

⮚ The site has a very large amount of useful information on the program’s benefits and background.
New users might find some benefit from investigating the user manual for the program.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

⮚ The main WEKA site has links to this information as well as past experiments for new users to
refine the potential uses that might be of particular interest to them.

⮚ When prepared to download the software it is best to select the latest application from the
selection offered on the site.

⮚ The format for downloading the application is offered in a self-installation package and is a
simple procedure that provides the complete program on the end users machine that is ready to
use when extracted.

● Opening the program:

⮚ Once the program has been loaded on the user’s machine it is opened by navigating to the
programs start option and that will depend on the user’s operating system. Figure 1 is an example
of the initial opening screen on a computer with Windows XP.

⮚ Figure 1 Chooser screen.

⮚ There are four options available on this initial screen.

♦ Simple CLI- provides users without a graphic interface option the ability to execute
commands from a terminal window.

♦ Explorer- the graphical interface used to conduct experimentation on raw data

♦ Experimenter- this option allows users to conduct different experimental variations on data
sets and perform statistical manipulation

♦ Knowledge Flow-basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning from
previous results

⮚ While the options available can be useful for different applications the remaining focus of the
user manual will be on the Experimenter option through the rest of the user guide.

⮚ After selecting the Experimenter option, the program starts and provides the user with a separate
graphical interface.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

⮚ Figure 2 shows the opening screen with the available options. At first there is only the option to
select the Preprocess tab in the top left corner. This is due to the necessity to present the data set
to the application so it can be manipulated. After the data has been preprocessed the other tabs
become active for use.

⮚ There are six tabs: 1. Preprocess- used to choose the data file to be
used by the application.

2. Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation.

3. Cluster- used to apply different tools that identify clusters within the data file.

4. Association- used to apply different rules to the data file that identify association within the
data.

5. Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment.

6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in
scatter plot and bar graph output.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

● Preprocessing:

⮚ In order to experiment with the application the data set needs to be presented to WEKA in a
format that the program understands. There are rules for the type of data that WEKA will accept.
There are three options for presenting data into the program.

♦ Open File- allows for the user to select files residing on the local machine or recorded
medium
♦ Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
♦ Open Database- allows the user to retrieve files or data from a database source provided by
the user

⮚ There are restrictions on the type of data that can be accepted into the program. Originally the
software was designed to import only ARFF files, newer versions allow different file types such
as CSV, C4.5 and serialized instance formats.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

⮚ The extensions for these files include .csv, .arff, .names, .bsi and .data. Figure 3 shows an
example of selection of the file weather.arff.

⮚ Once the initial data has been selected and loaded the user can select options for refining the
experimental data. The options in the preprocess window include selection of optional filters to
apply and the user can select or remove different attributes of the data set as necessary to identify
specific information.

● Classify:

⮚ The user has the option of applying many different algorithms to the data set that would in theory
produce a representation of the information used to make observation easier. It is difficult to
identify which of the options would provide the best output for the experiment.

⮚ The best approach is to independently apply a mixture of the available choices and see what
yields something close to the desired results. The Classify tab is where the user selects the
classifier choices. Figure 4 shows some of the categories.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

⮚ Again, there are several options to be selected inside of the classify tab. Test option gives the
user the choice of using four different test mode scenarios on the data set:

1. Use training set


2. Supplied training set
3. Cross validation
4. Split percentage

⮚ There is the option of applying any or all of the modes to produce results that can be compared
by the user. Additionally, inside the test options toolbox there is a dropdown menu so the user
can select various items to apply that depending on the choice can provide output options such
as saving the results to file or specifying the random seed value to be applied for the
classification.

⮚ The classifiers in WEKA have been developed to train the data set to produce output that has
been classified based on the characteristics of the last attribute in the data set. For a specific
attribute to be

⮚ Used the option must be selected by the user in the options menu before testing is performed.
Finally, the results have been calculated and they are shown in the text box on the lower right.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

They can be saved in a file and later retrieved for comparison later or viewed within the window
after changes and different results have been derived.

● Cluster:

⮚ The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze.

⮚ There are a few options within the cluster window that are similar to those described in the
classifier tab. They are using training set, supplied test set, percentage split. The fourth option
is classes to cluster evaluation, which compares how well the data compares with a pre-assigned
class within the data.

⮚ While in cluster mode users have the option of ignoring some of the attributes from the data set.
This can be useful if there are specific attributes causing the results to be out of range or for
large data sets. Figure 5 shows the Cluster window and some of its options.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

● Visualization:

⮚ The last tab in the window is the visualization tab. Within the program calculations and
comparisons have occurred on the data set.

⮚ Selections of attributes and methods of manipulation have been chosen. The final piece of the
puzzle is looking at the information that has been derived throughout the process. The user can
now actually see the fruit of their efforts in a two-dimensional representation of the information.
⮚ The first screen that the user sees when they select the visualization option is a matrix of plots
representing the different attributes within the data set plotted against the other attributes. If
necessary, there is a scroll bar to view all the produced plots.

⮚ The user can select a specific plot from the matrix to view its contents for analyzation. A grid
pattern of the plots allows the user to select the attribute positioning to their liking and for better
understanding. Once a specific plot has been selected the user can change the attributes from
one view to another providing flexibility. Figure 9 shows the plot matrix view.

⮚ The scatter plot matrix gives the user a visual representation of the manipulated data sets for
selection and analysis. The choices are the attributes across the top and the same from top to
bottom giving the user easy access to pick the area of interest.

⮚ Clicking on a plot brings up a separate window of the selected scatter plot. The user can then
look at a visualization of the data of the attributes selected and select areas of the scatter plot
with a selection window or by clicking on the points within the plot to identify the point’s
specific information. Figure 10 shows the scatter plot for two attributes and the points derived
from the data set.

⮚ There are a few options to view the plot that could be helpful to the user. It is formatted similar
to an X/Y graph, yet it can show any of the attribute classes that appear on the main scatter plot
matrix. This is handy when the scale of the attribute is unable to be ascertained in one axis over
the other.

⮚ Within the plot the points can be adjusted by utilizing a feature called jitter. This option moves
the individual points so that in the event of close data points users can reveal hidden multiple
occurrences within the initial plot. Figure 11 shows an example of this point selection and the
results the user sees.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

⮚ There are a few options to manipulate the view for the identification of subsets or to separate
the data points on the plot.

♦ Polyline: can be used to segment different values for additional visualization clarity on the
plot. This is useful when there are many data points represented on the graph.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

♦ Rectangle: this tool is helpful to select instances within the graph for copying or
clarification.
♦ Polygon: Users can connect points to segregate information and isolate points for
reference.

⮚ This user guide is meant to assist users in their efforts to become familiar with some of the
features within the Explorer portion of the WEKA data mining software application and is used
for informational purposes only.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

⮚ It is a summary of the user information found on the program’s main web site. For a more
comprehensive and in-depth version users can visit the main site
http://www.cs.waikato.ac.nz/~ml/WEKA for examples and FAQs about the program.

• Step-by-step Installation:

Step 1: Visit this website using any web browser. Click on Free Download.

Step 2: Click on Start Download. Downloading of the executable file will start shortly. Now
check for the executable file in downloads in your system and run it.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Step 3: It will prompt confirmation to make changes to your system. Click on Yes. A setup
screen will appear, click on Next.

Step 4: The next screen will be of License Agreement; click on I Agree. Next screen is of
choosing components, all components are already marked so don’t change anything just click
on the Install button.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Step 5: The next screen will be in the installation location so choose the drive which will have
sufficient memory space for installation. It needed a memory space of 301 MB. Next screen
will be choosing the Start menu folder so don’t do anything just click on Install Button.

Step 6: After this installation process will start and will hardly take a minute to complete the
installation. Click on the Next button after the installation process is complete.

Step 7: Click on Finish to finish the installation process. Weka is successfully installed on the
system and an icon is created on the desktop.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Step 8: Run the software and see the interface.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 3
Aim: Perform the analysis, Preprocessing, and visualization on following
available datasets:

1) Weather

(A) How many instances (examples) are contained in the dataset?

Ans: 14
Screenshot:

(B) How many attributes are used to represent the instances?

Ans: 5
Screenshot:

(C) Which attribute is the class label?


Ans: play – Yes or No Screenshot:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

(D) What is the data type (e.g., numeric, nominal, etc.) of the attributes in the
dataset?
Ans:

Screenshot:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

(E) Visualize the different attributes.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

2) IRIS-dataset

(A) How many instances (examples) are contained in the dataset?

Ans: 150
Screenshot:

(B) How many attributes are used to represent the instances?

Ans: 5
Screenshot:

(C) Which attribute is the class label?


Ans: Iris-setosa, Iris-versicolor, Iris-virginica Screenshot:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

(D) What is the data type (e.g., numeric, nominal, etc.) of the attributes in the
dataset?
Ans:

Screenshot:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132005 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

(E) Visualize the different attributes.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 4
Aim: Create, Analyze, Preprocess and Visualize three types ARFF databases
like Student, Subject, Faculty using the Weka tool.

Answer the following Question after creation of above databases:

1. Find which faculty teaches which subject.

2. Find the count for different subjects chosen by same or different students.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

3. State attributes & their data types of all three tables.

(A) Student Table:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

(B) Subject Table:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

(C) Faculty Table:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

4. Visualizing all three ARFF files (Student, Faculty, Subject).

(A) Subject Table

ARFF File Code:

@relation "Subject"
@attribute Sub_id numeric
@attribute Subject{DMBI,ITU,AI}
@attribute Learn_by{Rupali,Aditi,Dhruhi,Rutva,Mansi}
@attribute Teach_by{DV,RK,AS}

@data
1,DMBI,Rupali,DV
2,DMBI,Rupali,DV
3,DMBI,Rupali,DV
4,ITU,Aditi,RK
5,ITU,Aditi,RK
6,ITU,Aditi,RK
7,AI,Dhruhi,AS
8,AI,Dhruhi,AS
9,AI,Rutva,AS 10,AI,Mansi,AS

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

(B) Student Table

ARFF File Code:

@relation "Student"
@attribute Enroll_No numeric
@attribute Name{Rupali,Aditi,Dhruhi,Rutva,Mansi}
@attribute Subject_Name{DMBI,ITU,AI,ADC,CPDP,MPMC}
@attribute Subject_Code{3160714,3161009,3161608,3161607,3160002,3160914}

@data
11,Rupali,DMBI,3160714
12,Rupali,ITU,3161009
13,Rupali,AI,3161608
14,Aditi,ADC,3161607
15,Aditi,CPDP,3160002
16,Dhruhi,MPMC,3160914
17,Rutva,ITU,3161009

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

18,Rutva,ADC,3161607
19,Mansi,CPDP,3160002 20,Dhruhi,MPMC,3160914

(C) Faculty Table

ARFF File Code:

@relation "Faculty"
@attribute No numeric
@attribute Faculty_Name{DV,RK,AS,AKV,PR,MSG}
@attribute Subject_Name{DMBI,ITU,AI,ADC,CPDP,MPMC}
@attribute Subject_Code{3160714,3161009,3161608,3161607,3160002,3160914}

@data
001,DV,DMBI,3160714
002,RK,ITU,3161009
003,AS,AI,3161608
004,AKV,ADC,3161607

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

005,PR,CPDP,3160002
006,MSG,MPMC,3160914
007,RK,ITU,3161009
008,AKV,ADC,3161607
009,PR,CPDP,3160002 0010,MSG,MPMC,3160914

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 5
Aim: Calculate Mean, Mode and Median using python.

Mean:
The mean is the average value of all the values in a dataset. To calculate the mean value of a
dataset, we first need to find the sum of all the values and then divide the result by the number of
elements.

Code:
l1 = [1,2,3,4,5]
sum = 0 a =
range(0,5) for
i in a:
sum = sum + l1[i]
i = i + 1 mean =
sum/i; print("Sum is:
", sum)
print("Mean is: ",mean)

Output:

Median:

The median is the middle value among all the values in sorted order. There are two ways to calculate
the median value as:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

1) For even no of values


Median = [(n/2)th term + {(n/2) + }th)/2

2) For odd no of values


Median = {(n+1)/2}th term

Code:
list1 = [5,6,4,5,7,9]
list1.sort() median =
0 if(len(list1) %2 ==
0):
m1 = list1[len(list1) // 2]
m2 = list1[len(list1) // 2-1] median
= (m1 + m2) / 2 else:
median = list1[len(list1) // 2]
print(median)

Output:

Mode:

from statistics import mode


mode([1,2,3,4,4])

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 6
Aim: Calculate Variance & standard deviation in python.

Variance:
Variance is the measure of how notably a collection of data is spread out. If all the data values are
identical, then it indicates the variance is zero. All non-zero variances are positive. A little
variance represents that the data points are close to the mean, and to each other, whereas if the
data points are highly spread out from the mean and from one another indicates the high
variance.

Standard Deviation:
Standard Deviation is a measure which shows how much variation from the mean exists. The
standard deviation indicates a “typical” deviation from the mean. It is a popular measure of
variability because it returns to the original units of measure of the data set.

Code:
l = [9,108,27,36,54,27,63] mean =
(9+108+27+36+54+27+63)/7
print("The mean is: ", mean)

variance = sum([((i-mean)**2) for i in l])/len(l) print("The variance is: ", variance) stan_dev =
variance**0.5
print("The standard deviation is: ", stan_dev)

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117
Practical: 7
Aim: Perform Data Cleaning using panda’s library in python.

Data Set:
import pandas as pd import
numpy as np
students = [('Rupali', 22, 'AHMEDABAD', 'ADANI'),
('Aditi', 25, np.NAN, np.NAN),
('Jinal', np.NAN, 'NVS', 'KVS'),
('Jinal', np.NAN, 'NVS', 'KVS'),
(np.NAN, 20, 'AHMEDABAD', 'KSV'),
('Mansi', 25, np.NAN, 'AU'),
('Priya', 30, 'Baroda', np.NAN),
(np.NAN, 35, 'Surat', np.NAN),
('Dhruhi', np.NAN, 'Una', np.NAN),
('Rutva', 30, 'Mumbai', 'IIT'),
('Rutva', 30, 'Mumbai', 'IIT'),
(np.NAN, 15, np.NAN, 'AU'),
(np.NAN, np.NAN, np.NAN, np.NAN),
(np.NAN, 20, np.NAN, np.NAN)]

df = pd.DataFrame(students, columns = ['Name', 'Age', 'Place', 'Institute'], index = [0, 1, 2, 3, 4, 5,


6, 7, 8, 9, 10, 11, 12, 13]) df.info()
#printing information
print(df) #Printing dataframe

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

1) Count missing values in rows, columns, dataframe:


#count total NAN in each row for i in range(len(df.index)):
print("Total NaN in each row ", i+1, ":", df.iloc[i]. isnull().sum())

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

#returns the no. of times the null value has occurred in each column df.isnull().sum()

# returns the no of times the null value has occurred in entire dataset
df.isnull().sum().sum()

2) Remove Duplicates:

# deleting the duplicate values


df = df.drop_duplicates()
print(df)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

3) Fill (0):

#To fill the NAN(null) values with 0 df.fillna(0)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

4)

Delete rows with all null values:

# Add new rows: ->


df1.loc[len(df.index)] = [np.NaN,np.NaN,np.NaN,np.NaN]
print(df1)

# deleting the last row: ->


df1 = df.dropna(how = 'all')
print(df1)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

5)

Replace -99 with Nan:

# Replace Nan with -99


df.replace(to_replace = np.nan, value = -99)

Data Set:
#creating new dataset

data = {'A': [1, 2, np.nan, 4, 5],


'B': [10, np.nan, 30, 40, 50],
'C': [np.nan, 22, 33, 44, 55],
'D': [11, np.nan, 12, 30, 45],
'E': [55, 60, 77, 88, 99]} df
= pd.DataFrame(data) print(df)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

6)

Pad:
# fill missing/NaN values with previous ones
df.fillna(method = 'pad') print("Pad:
")
print(df)

7) Bfill:
# fill missing values with the next value df2
= df.fillna(method = 'bfill')
print("Filling the missing values with the next ones: ") print(df2)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132018 DMBI

7)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Data Set:
data1 = {'Clothes': ['pant', 't-shirt', 'kurta', 'pant', None, 'kurta', 't-shirt', 'pant']}
df3 = pd.DataFrame(data1)
print(df3)

8) Mode:

# Prints the category with occurs the most no. of


times m = df3['Clothes'].mode()[0] print(m)
df3['Clothes'].fillna(value = m, inplace = True)

9) Mean:

data = {'A': [1, 2, np.nan, 4, 5],


'B': [10, np.nan, 30, 40, 50],
'C': [np.nan, 22, 33, 44, 55],
'D': [11, np.nan, 12, 30, 45],
'E': [55, 60, 77, 88, 99]} df5 =
pd.DataFrame(data) df5['B'] =
df5['B'].fillna(df5['B'].mean())
print(df5)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 8
Aim: Implement Chi – Square Test in python using scipy.stats module.

Code:
# job satisfaction import
scipy.stats as stats
from scipy.stats import chi2_contingency
data=[[25,40,30,50],[15,20,25,30],[10,15,20,25]] print(data)

Output:

Visualization Code: import seaborn as sns


import matplotlib.pyplot as plt plt.figure(figsize =
(12,8)) sns.heatmap(data, annot = True, cmap =
"YlGnBu") plt.show()

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Code:
# Running a Chi-Square Test of Independence in Python
import scipy.stats as stats stat, p, dof, expected =
chi2_contingency(table) print('Printing stat value', stat)
print('Printing p value', p) print('dof = %d' % dof)
print(expected)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Output:

Code:
# Assessing our results if p < 0.05:
print('Reject null hypothesis') else:
print('Fail to reject the null hypothesis')

Output:

Visualization Code: import seaborn as sns import


matplotlib.pyplot as plt plt.figure(figsize = (12,8))
sns.heatmap(expected, annot = True, cmap = "YlGnBu")
plt.show()

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 9
Aim: Implement Min-max normalization in python. (Use suitable dataset or
Use Data frame).

Code:
import pandas as pd import
numpy as np
numbers=[(12,23,34,56),
(93,33,44,55),
(43,87,89,52),
(93,56,82,61)]
df=pd.DataFrame(numbers,columns=['A','B','C','D']) print(df)

Output:

Code:
import matplotlib.pyplot as plt df.plot(kind
= 'bar')

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Output:

Code:
#Using min-max feature scaling
# copy the data
df_min_max=df.copy()

#apply normalization techniques: for


column in df_min_max.columns:
df_min_max[column]=(df_min_max[column] -
df_min_max[column].min())/(df_min_max[column].max()-df_min_max[column].min())

#view normalized data print(df_min_max)

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Output:

Code:
import matplotlib.pyplot as plt df_min_max.plot(kind

= 'bar')

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 10
Aim: Implement z-score normalization in python. (Use suitable dataset or Use
Data frame).

Code:
import pandas as pd import

numpy as np

numbers=[(12,23,34,56),

(93,33,44,55),

(43,87,89,52), (93,56,82,61)]

df=pd.DataFrame(numbers,columns=['A','B','C','D']

) print(df)

Output:

Code:
df_z_score=df.copy() for column

in df_z_score.columns:

df_z_score[column]=(df_z_score[column] - df_z_score[column].mean())
/df_z_score[column].std()

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

display(df_z_score)
Output:

Code:
import matplotlib.pyplot as plt df_z_score.plot(kind

= 'bar')

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 11
Aim: Implement simple Linear regression in python.

Code:
import pandas as pd import
numpy as np
numbers=[(12,23),
(93,33),
(43,87),
(93,56)]
df=pd.DataFrame(numbers,columns=['A','B']) print(df)

Output:

Code:
x=df['A'].values # Independent variable

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Code:
y=df['B'].values

Output:

Code:
# Mean X & Y mean_x=np.mean(x)

mean_y=np.mean(y)

# Total number of values n=len(x)

#Using the formula to calculate b1 and b0

numer=0 denom=0 for i in range(n):

numer += (x[i] - mean_x) * (y[i] - mean_y)

denom += (x[i] - mean_x) ** 2 b1=numer /

denom b0=mean_y - (b1 * mean_x)

#print coefficients print(b1,

b0)

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Code:
# plotting values and regression line

import matplotlib.pyplot as plt import

numpy as np max_x=np.max(x) +

100 min_x=np.min(x) - 100

# calculating line values x & y

x=np.linspace(max_x,min_x,1000) y=b0

+ b1 * x

# Plotting Line plt.plot(x,y, color='orange',

label='Regression Line')

#Plotting Scatter point

plt.scatter(x,y,c='green', label='Scatter Plot')

plt.xlabel('A') plt.ylabel('B') plt.legend()

plt.show()

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Code:
ss_t=0 ss_r=0

for i in range(n):

y_pred = b0 + b1 * x[i]

ss_t += (y[i] - mean_y) ** 2

ss_r += (y[i] - y_pred) ** 2

r2=1 - (ss_r / ss_t) print(r2)

Output:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 12
Aim: Study of APRIORI algorithm in detail.
Apriori Algorithm
In data mining, Apriori is a classic algorithm for learning association rules. Apriori is designed to
operate on databases containing transactions (for example, collections of items bought by
customers, or details of a website frequentation).
Other algorithms are designed for finding association rules in data having no transactions (Winepi
and Minepi), or having no timestamps (DNA sequencing).

Overview:
The whole point of the algorithm (and data mining, in general) is to extract useful information
from large amounts of data. For example, the information that a customer who purchases a
keyboard also tends to buy a mouse at the same time is acquired from the association rule below:

Support: The percentage of task-relevant data transactions for which the pattern is true.

Support (Keyboard -> Mouse) =

Confidence: The measure of certainty or trustworthiness associated with each discovered pattern.

Confidence (Keyboard -> Mouse) =

The algorithm aims to find the rules which satisfy both a minimum support threshold and a
minimum confidence threshold (Strong Rules).
Item: article in the basket.

Item set: a group of items purchased together in a single transaction.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

How Apriori Works

1. Find all frequent itemsets:


Get frequent items:
Items whose occurrence in database is greater than or equal to the min_support threshold.

Get frequent itemsets:

Generate candidates from frequent items.

Prune the results to find the frequent itemsets.

2. Generate strong association rules from frequent itemsets


Rules which satisfy the min_support and min_confidence threshold.

High Level Design

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Low Level Design

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Example :

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

A database has five transactions. Let the min sup = 50% and min con f = 80%.

Solution:
Step 1: Find all Frequent Itemsets

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Frequent Itemsets:
{A} {B} {C} {E} {A C} {B C} {B E} {C E} {B C E}

Step 2: Generate strong association rules from the frequent itemsets

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 14
Aim: Study of Decision tree induction algorithm in detail.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buys_computer that indicates whether a customer
at a company is likely to buy a computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

● It does not require any domain knowledge.


● It is easy to comprehend.
● The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm


● A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor
of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking;
the trees are constructed in a top-down recursive divide-and-conquer manner.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Generating a decision tree form training tuples of data partition


D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples and their associated
class labels.
attribute_list, the set of candidate attributes. Attribute selection
method, a procedure to determine the splitting criterion that best
partitions that the data tuples into individual classes. This criterion
includes a splitting_attribute and either a splitting point or
splitting subset.

Output: A
Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then return


N as leaf node with labeled
with majority class in D;|| majority voting apply
attribute_selection_method(D, attribute_list) to
find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and multiway splits


allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; //
apartition

if Dj is empty then attach a leaf labeled


with the majority class in D to node N;
else attach the node returned by Generate

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

decision tree(Dj, attribute list) to node N;


end for
return N;

Tree Pruning:
Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers.
The pruned trees are smaller and less complex.
Tree Pruning Approaches
Here is the Tree Pruning Approaches listed below −

● Pre-pruning − The tree is pruned by halting its construction early.

● Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity:
The cost complexity is measured by the following two parameters −

● Number of leaves in the tree, and ● Error rate of the tree.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 16
Aim: Make a case study and design star schema of a Data Warehouse for an
organization by identifying facts and dimensions.
Datawarehouse:
“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data
in support of management's decision-making process.”

In other words
“A data warehouse is a collection of data designed to support management decision making. Data
warehouses contain a wide variety of data that present a coherent picture of business conditions
at a single point in time.”

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there will
be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from
3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is kept. For example, a transaction
system may hold the most recent address of a customer, where a data warehouse can hold all
addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

Data Warehouse Architecture


Different data warehousing systems have different structures. Some may have an ODS (operational
data store), while some may have multiple data marts. Some may have a small number of data
sources, while some may have dozens of data sources. In view of this, it is far more reasonable to
present the different layers of a data warehouse architecture rather than discussing the specifics of
any one system.

In general, all data warehouse systems have the following layers:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

● Data Source Layer


● Data Extraction Layer
● Staging Area
● ETL Layer
● Data Storage Layer
● Data Logic Layer
● Data Presentation Layer
● Metadata Layer
● System Operations Layer
The picture below shows the relationships among the different components of the data warehouse
architecture:

Each component is discussed individually below:

Data Source Layer

This represents the different data sources that feed data into the data warehouse. The data source
can be of any format -- plain text file, relational database, other types of database, Excel file, etc.,
can all act as a data source.

Many different types of data can be a data source:

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

● Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems
data.
● Web server logs with user browsing data.
● Internal market research data.
● Third-party data, such as census data, demographics data, or survey data.

All these data sources together form the Data Source Layer.

Data Extraction Layer

Data gets pulled from the data source into the data warehouse system. There is likely some minimal
data cleansing, but there is unlikely any major data transformation.

Staging Area

This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart.
Having one common area makes it easier for subsequent data processing / integration.

ETL Layer

This is where data gains its "intelligence", as logic is applied to transform the data from a
transactional nature to an analytical nature. This layer is also where data cleansing happens. The
ETL design phase is often the most time-consuming phase in a data warehousing project, and an
ETL tool is often used in this layer.

Data Storage Layer

This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of
entities can be found here: data warehouse, data mart, and operational data store (ODS). In any
given system, you may have just one of the three, two of the three, or all three types.

Data Logic Layer

This is where business rules are stored. Business rules stored here do not affect the underlying data
transformation rules, but do affect what the report looks like.

Data Presentation Layer

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

This refers to the information that reaches the users. This can be in a form of a tabular / graphical
report in a browser, an emailed report that gets automatically generated and sent every day, or an
alert that warns users of exceptions, among others. Usually an OLAP tool and/or a reporting tool
is used in this layer.

Metadata Layer

This is where information about the data stored in the data warehouse system is stored. A logical
data model would be an example of something that's in the metadata layer. A metadata tool is often
used to manage metadata.

System Operations Layer

This layer includes information on how the data warehouse system operates, such as ETL job
status, system performance, and user access history.

Data Warehouse Concepts


Several concepts are of particular importance to data warehousing. They are discussed in detail in
this section.

Dimensional Data Model: Dimensional data model is commonly used in data warehousing
systems. This section describes this modeling technique, and the two common schema types, star
schema and snowflake schema.

Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This
section explains the problem, and describes the three ways of handling this problem with examples.

Conceptual Data Model: What is a conceptual data model, its features, and an example of this
type of data model.

Logical Data Model: What is a logical data model, its features, and an example of this type of data
model.

Physical Data Model: What is a physical data model, its features, and an example of this type of
data model.

Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model.
This section compares the three different types of data models.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Data Integrity: What is data integrity and how it is enforced in data warehousing.

What is OLAP: Definition of OLAP.

Factless Fact Table: A fact table without any fact may sound silly, but there are real life instances
when a factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful.

Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is
it important.

Dimensional Data Model:


Dimensional data model is most often used in data warehousing systems. This is different from the
3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the
same data would then be stored differently in a dimensional model than in a 3rd normal form
model.

To understand dimensional data modeling, let's define some of the terms commonly used in this
type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes
within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter
→ Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount
would be such a measure. This measure is stored in the fact table with the appropriate granularity.
For example, it can be sales amount by store by day. In this case, the fact table would contain three
columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of the quarters
available in the data warehouse. Each row (each quarter) may have several fields, one for the
unique ID that identifies the quarter, and one or more additional fields that specifies how that

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

particular quarter is represented on a report (for example, first quarter of 2001 may be represented
as "Q1 2001" or "2001 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more
lookup tables, but fact tables do not have direct relationships to one another. Dimensions and
hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup
tables.
In designing data models for data warehouses / data marts, the most commonly used schema types
are Star Schema and Snowflake Schema.

Whether one uses a star or a snowflake largely depends on personal preference and business needs.
Personally, I am partial to snowflakes, when there is a business case to analyze the information at
that particular level.

SCHEMA

✔ Star Schema

In the star schema design, a single object (the fact table) sits in the middle and is radically
connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is
represented as a single table. The primary key in each dimension table is related to a foreign key
in the fact table.

All measures in the fact table are related to all the dimensions that fact table is related to. In other
words, they all have the same level of granularity.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

A star schema can be simple or complex. A simple star consists of one fact table; a complex star
can have more than one fact table.

Let's look at an example: Assume our data warehouse keeps store sales data, and the different
dimensions are time, store, product, and customer. In this case, the figure on the left represents our
star schema. The lines between two tables indicate that there is a primary key / foreign key
relationship between the two tables. Note that different dimensions are not related to one another.
✔ Snowflake Schema

The snowflake schema is an extension of the star schema, where each point of the star explodes
into more points. In a star schema, each dimension is represented by a single dimensional table,
whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables,
each representing a level in the dimensional hierarchy.

For example, the Student Dimension that consists of one row which is address field has 3 different
hierarchies:

1. Address → City
2. Address →Country.
3. Address →State.

We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for
month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating
the above relationships in the Student Dimension is shown to the right.

The main advantage of the snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The main disadvantage
of the snowflake schema is the additional maintenance efforts needed due to the increase number
of lookup tables.

Design star schema of a Data Warehouse using Microsoft SQL Server 2008

Creating a New Analysis Services Project:

1. Select Microsoft SQL Server 2008 > SQL Server Business Intelligence Development
Studio from the Programs menu to launch Business Intelligence Development Studio.
2. Select File > New > Project.
3. In the New Project dialog box, select the Business Intelligence Projects project type.
4. Select the Analysis Services Project template.
5. Name the new project Interactive Training Program and select a convenient location to save
it.
6. Click OK to create the new project.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

1.1 Create Analysis Service Project

Figure 1.2 shows the Solution Explorer window of the new project, ready to be populated
with objects.

1.2 Solution Explorer window

Defining a Data Source

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

To define a data source, follow these steps:

1. Right-click on the Data Sources folder in Solution Explorer and select New Data Source.
2. Read the first page of the Data Source Wizard and click Next.
3. You can base a data source on a new or an existing connection. Because you don't have any
existing connections, click New.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

1.3 Data Source wizard

4. In the Connection Manager dialog box, select the server containing your analysis services
sample database from the Server Name combo box.
5. Fill in your authentication information.
6. Select the Native OLE DB\Microsoft Jet 4.0 OLE DB Provider (this is the default
provider).
7. Select the FoodMart database. Figure 1.3 shows the filled-in Connection Manager dialog
box.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

1.4 Connection Manager

8. Click OK to Connection Manager dialog box. (Test Connection Succeed)


9. Click Next.
10. Select Use the credentials of the current user and click Next.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

11.

12.
1.5 Data source wizard

13. Accept the default data source name and click Finish.

Defining a Data Source View

A data source view is a persistent set of tables from a data source that supply the data. BIDS
also includes a wizard for creating data source views, which you can invoke by right-clicking on
the Data Source Views folder in Solution Explorer.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

To create a new data source view, follow these steps:

1. Right-click on the Data Source Views folder in Solution Explorer and select New Data Source
View.
2. Read the first page of the Data Source View Wizard and click Next.

3. Select the Foodmart data source and click Next. Note that you could also launch the Data
Source Wizard from here by clicking New Data Source.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

4. Select the Customer (dbo) table, Product Table, sales_fact_1997 table and time_by_day table
in the Available Objects list and click the > button to move it to the Included Object list. This
will be the fact table in the new cube.

1.6 Selecting Wizard for Data Source View

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

5. Click Next.

1.7 Provide name for data source view

6. Click Finish.

BIDS will automatically display the schema of the new data source view, as shown in Figure 1.8.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Star Schema of Data Source View

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Practical: 17
Aim: Make a case study for data cube creation process. Also list and explain
various OLAP operations like drill down, slice, pivot and roll up.

What Is OLAP?

OLAP (online analytical processing) is computer processing that enables a user to easily
and selectively extract and view data from different points of view. OLAP data is stored in a
multidimensional database. Whereas a relational database can be thought of as twodimensional,
a multidimensional database considers each data attribute (such as product, geographic sales
region, and time period) as a separate "dimension."

OLAP software can locate the intersection of dimensions (all products sold in the Eastern
region above a certain price during a certain time period) and display them. Attributes such as time
periods can be broken down into sub attributes.

OLAP CUBE:
A cube can be considered a generalization of a three-dimensional spreadsheet. For example,
a company might wish to summarize financial data by product, by time-period, and by city to
compare actual and budget expenses. Property, time, city and scenario are the data's dimensions.

Cube is a shortcut for multidimensional dataset, given that data can have an arbitrary
number of dimensions. The term hypercube is sometimes used, especially for data with more than
three dimensions.

OLAP data is typically stored in a star schema or snowflake schema in a relational data
warehouse or in a special-purpose data management system. Measures are derived from the records
in the fact table and dimensions are derived from the dimension tables.

Conceiving data as a cube with hierarchical dimensions leads to conceptually straightforward


operations to facilitate analysis. Aligning the data content with a familiar visualization enhances
analyst learning and productivity. The user-initiated process of navigating by calling for page
displays interactively, through the specification of slices via rotations and drill down/up is
sometimes called "slice and dice". Common operations include slice and dice, drill down, roll up,
and pivot.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Perform the following Steps for cube generation:

1. Select Creation Method.

2. Specify Source Information.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

3. Select Dimension Attribute.

4. Completing the Wizard.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

5. Following Hierarchy is created of all.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

6. Following Attribute Relationship is generated.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

7. Select the measure table.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

8. Select measures.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

9. Completing the wizard.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

10. Process for the cube generation.

11. Cube source view of the collage management system.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

3-D Data Cube representation of the data for Property according to


time(year), student_name, and Grades.

⮚ Roll-up: A roll-up involves summarizing the data along a dimension. The summarization rule
might be computing totals along a hierarchy or applying a set of formulas.

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI

⮚ Drill Down: A Drill Down allows the user to navigate among levels of data ranging from the
most summarized (up) to the most detailed (down).

Adani Institute of Infrastructure Engineering, Adani University

SEM- 6 ICT-CLASS-B

You might also like