DMBI Practical 1 To 17 Rahul Final
DMBI Practical 1 To 17 Rahul Final
Practical:1
Aim: List & Explain various Data Mining Tools and Data ware-house tools.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SAS data miner allows users to analyze big data and provide accurate insight for timely decision-
making purposes. SAS has a distributed memory processing architecture that is highly scalable.
It is suitable for data mining, optimization, and text mining purposes.
DataMelt is a computation and visualization environment which offers an interactive structure for
data analysis and visualization. It is primarily designed for students, engineers, and scientists. It is
also known as DMelt.
DMelt is a multi-platform utility written in JAVA. It can run on any operating system which is
compatible with JVM (Java Virtual Machine). It consists of science and mathematics libraries.
• Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
• Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting, etc.
DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis.
It is extensively used in natural sciences, financial markets, and engineering.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
4) Rattle:
Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle
exposes the statical power of R by offering significant data mining features. While rattle has a
comprehensive and well-developed user interface, it has an integrated log code tab that produces
duplicate code for any GUI operation.
The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to
review the code, use it for many purposes, and extend the code without any restriction.
5) Rapid Miner:
Rapid Miner is one of the most popular predictive analysis systems created by the company
with the same name as the Rapid Miner. It is written in JAVA programming language. It offers
an integrated environment for text mining, deep learning, machine learning, and predictive
analysis.
The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine
learning.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It
has a client/server model as its base. A rapid miner comes with template-based frameworks that
enable fast delivery with few errors (which are commonly expected in the manual coding
writing process).
6) Weka:
Weka is an open-source machine learning software with a vast collection of algorithms for data
mining. It was developed by the University of Waikato, in New Zealand, and it’s written in
JavaScript.
It supports different data mining tasks, like preprocessing, classification, regression, clustering,
and visualization, in a graphical interface that makes it easy to use. For each of these tasks, Weka
provides built-in machine learning algorithms which allow you to quickly test your ideas and
deploy models without writing any code. To take full advantage of this, you need to have a
sound knowledge of the different algorithms available so you can choose the right one for your
particular use case.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Amazon Redshift is made around industry-standard SQL, with additional practicality to manage
massive datasets and support superior analysis and reporting of these data. It helps to work
quickly and easily along with data in open formats, and simply integrates with and connects to
the AWS scheme. Also query and export data to and from the data lake.
2) Microsoft Azure:
Azure is a cloud computing platform that was launched by Microsoft in 2010. Microsoft Azure
is a cloud computing service provider for building, testing, deploying, and managing
applications and services through Microsoft-managed data centers.
Azure is a public cloud computing platform that offers Infrastructure as a Service (IaaS),
Platform as a Service (PaaS), and Software as a Service (SaaS). The Azure cloud platform
provides more than 200 products and cloud services such as Data Analytics, Virtual Computing,
Storage, Virtual Network, Internet Traffic Manager, Web Sites, Media Services, Mobile
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Services, Integration, etc. Azure facilitates simple portability and genuinely compatible
platform between on-premises and public Cloud.
Azure provides a range of cross-connections including virtual private networks (VPNs), caches,
content delivery networks (CDNs), and ExpressRoute connections to improve usability and
performance. Microsoft Azure provides a secure base across physical infrastructure and
operational security.
3) Google BigQuery:
BigQuery is a serverless data warehouse that allows scalable analysis over petabytes of data.
It’s a Platform as a Service that supports querying with the help of ANSI SQL. It additionally
has inbuilt machine learning capabilities. BigQuery was declared in 2010 and made available
for use there in 2011.
Google BigQuery is a cloud-based big data analytics web service to process very huge amount
of read-only data sets. BigQuery is designed for analyzing data that are in billions of rows by
simply employing SQL-lite syntax. BigQuery can run advanced analytical SQL-based queries
beneath big sets of data. BigQuery is not developed to substitute relational databases and for
easy CRUD operations and queries.
It is oriented for running analytical queries. It is a hybrid system that enables the storage of
information in columns; however, it takes into the NoSQL additional features, like the data type,
and the nested feature. BigQuery is a better option than Redshift since we must pay by the hour.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
4) Snowflake:
Snowflake is a cloud computing-based data warehousing built on top of the Amazon Web
Services or Microsoft Azure cloud infrastructure. The Snowflake design allows storage and
computers to scale independently, thus customers can use and pay money for storage and
computation individually.
In Snowflake data processing is simplified: Users will do data blending, analysis, and
transformations against varied forms of data structures with one language, SQL. Snowflake
offers dynamic, scalable computing power with charges primarily based strictly on usage. With
Snowflake, computation and storage are fully separate, and the storage value is the same as
storing the data on Amazon S3.
5) Amazon DynamoDB:
Amazon DynamoDB is a fully managed proprietary NoSQL data warehouse service that
supports key-value and document data structures and is obtainable by Amazon.com as a part of
the Amazon Web Services portfolio. DynamoDB has an identical data model and encompasses
a completely different underlying implementation.
A partition key value is used in DynamoDB as input to an enclosed hash function. The output
from the hash function determines the partition within which the item is going to be kept. All
items with identical partition key values are stored together, in sorted order by sort key value.
It offers customers high availability, dependability, and progressive scalability, with no limits
on dataset size or request output for a given table.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
6) PostgreSQL:
PostgreSQL is employed because it is the primary data store or data warehouse for several web,
mobile, geospatial, and analytics applications. SQL Server is a database management system
that is especially used for e-commerce and providing different data warehousing solutions.
PostgreSQL is a sophisticated version of SQL that provides support to various functions of SQL
like foreign keys, subqueries, triggers, and other user-defined varieties and functions.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 2
Aim: Study of WEKA Tool & its step-by-step Installation.
● What is WEKA?
⮚ WEKA, formally called Waikato Environment for Knowledge Learning, is a computer program
that was developed at the University of Waikato in New Zealand for the purpose of identifying
information from raw data gathered from agricultural domains.
⮚ WEKA supports many different standard data mining tasks such as data preprocessing,
classification, clustering, regression, visualization, and feature selection. The basic premise of
the application is to utilize a computer application that can be trained to perform machine
learning capabilities and derive useful information in the form of trends and patterns.
⮚ WEKA is an open-source application that is freely available under the GNU general public
license agreement. Originally written in C the WEKA application has been completely rewritten
in Java and is compatible with almost every computing platform.
⮚ It is user friendly with a graphical interface that allows for quick set up and operation. WEKA
operates on the predication that the user data is available as a flat file or relation, this means that
each data object is described by a fixed number of attributes that usually are of a specific type,
normal alpha-numeric or numeric values.
⮚ The WEKA application allows novice users a tool to identify hidden information from database
and file systems with simple to use options and visual interfaces.
● How to Install?
⮚ The program information can be found by conducting a search on the Web for WEKA Data
Mining or going directly to the site at www.cs.waikato.ac.nz/~ml/WEKA.
⮚ The site has a very large amount of useful information on the program’s benefits and background.
New users might find some benefit from investigating the user manual for the program.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ The main WEKA site has links to this information as well as past experiments for new users to
refine the potential uses that might be of particular interest to them.
⮚ When prepared to download the software it is best to select the latest application from the
selection offered on the site.
⮚ The format for downloading the application is offered in a self-installation package and is a
simple procedure that provides the complete program on the end users machine that is ready to
use when extracted.
⮚ Once the program has been loaded on the user’s machine it is opened by navigating to the
programs start option and that will depend on the user’s operating system. Figure 1 is an example
of the initial opening screen on a computer with Windows XP.
♦ Simple CLI- provides users without a graphic interface option the ability to execute
commands from a terminal window.
♦ Experimenter- this option allows users to conduct different experimental variations on data
sets and perform statistical manipulation
♦ Knowledge Flow-basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning from
previous results
⮚ While the options available can be useful for different applications the remaining focus of the
user manual will be on the Experimenter option through the rest of the user guide.
⮚ After selecting the Experimenter option, the program starts and provides the user with a separate
graphical interface.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ Figure 2 shows the opening screen with the available options. At first there is only the option to
select the Preprocess tab in the top left corner. This is due to the necessity to present the data set
to the application so it can be manipulated. After the data has been preprocessed the other tabs
become active for use.
⮚ There are six tabs: 1. Preprocess- used to choose the data file to be
used by the application.
2. Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation.
3. Cluster- used to apply different tools that identify clusters within the data file.
4. Association- used to apply different rules to the data file that identify association within the
data.
5. Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment.
6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in
scatter plot and bar graph output.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
● Preprocessing:
⮚ In order to experiment with the application the data set needs to be presented to WEKA in a
format that the program understands. There are rules for the type of data that WEKA will accept.
There are three options for presenting data into the program.
♦ Open File- allows for the user to select files residing on the local machine or recorded
medium
♦ Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
♦ Open Database- allows the user to retrieve files or data from a database source provided by
the user
⮚ There are restrictions on the type of data that can be accepted into the program. Originally the
software was designed to import only ARFF files, newer versions allow different file types such
as CSV, C4.5 and serialized instance formats.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ The extensions for these files include .csv, .arff, .names, .bsi and .data. Figure 3 shows an
example of selection of the file weather.arff.
⮚ Once the initial data has been selected and loaded the user can select options for refining the
experimental data. The options in the preprocess window include selection of optional filters to
apply and the user can select or remove different attributes of the data set as necessary to identify
specific information.
● Classify:
⮚ The user has the option of applying many different algorithms to the data set that would in theory
produce a representation of the information used to make observation easier. It is difficult to
identify which of the options would provide the best output for the experiment.
⮚ The best approach is to independently apply a mixture of the available choices and see what
yields something close to the desired results. The Classify tab is where the user selects the
classifier choices. Figure 4 shows some of the categories.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ Again, there are several options to be selected inside of the classify tab. Test option gives the
user the choice of using four different test mode scenarios on the data set:
⮚ There is the option of applying any or all of the modes to produce results that can be compared
by the user. Additionally, inside the test options toolbox there is a dropdown menu so the user
can select various items to apply that depending on the choice can provide output options such
as saving the results to file or specifying the random seed value to be applied for the
classification.
⮚ The classifiers in WEKA have been developed to train the data set to produce output that has
been classified based on the characteristics of the last attribute in the data set. For a specific
attribute to be
⮚ Used the option must be selected by the user in the options menu before testing is performed.
Finally, the results have been calculated and they are shown in the text box on the lower right.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
They can be saved in a file and later retrieved for comparison later or viewed within the window
after changes and different results have been derived.
● Cluster:
⮚ The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze.
⮚ There are a few options within the cluster window that are similar to those described in the
classifier tab. They are using training set, supplied test set, percentage split. The fourth option
is classes to cluster evaluation, which compares how well the data compares with a pre-assigned
class within the data.
⮚ While in cluster mode users have the option of ignoring some of the attributes from the data set.
This can be useful if there are specific attributes causing the results to be out of range or for
large data sets. Figure 5 shows the Cluster window and some of its options.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
● Visualization:
⮚ The last tab in the window is the visualization tab. Within the program calculations and
comparisons have occurred on the data set.
⮚ Selections of attributes and methods of manipulation have been chosen. The final piece of the
puzzle is looking at the information that has been derived throughout the process. The user can
now actually see the fruit of their efforts in a two-dimensional representation of the information.
⮚ The first screen that the user sees when they select the visualization option is a matrix of plots
representing the different attributes within the data set plotted against the other attributes. If
necessary, there is a scroll bar to view all the produced plots.
⮚ The user can select a specific plot from the matrix to view its contents for analyzation. A grid
pattern of the plots allows the user to select the attribute positioning to their liking and for better
understanding. Once a specific plot has been selected the user can change the attributes from
one view to another providing flexibility. Figure 9 shows the plot matrix view.
⮚ The scatter plot matrix gives the user a visual representation of the manipulated data sets for
selection and analysis. The choices are the attributes across the top and the same from top to
bottom giving the user easy access to pick the area of interest.
⮚ Clicking on a plot brings up a separate window of the selected scatter plot. The user can then
look at a visualization of the data of the attributes selected and select areas of the scatter plot
with a selection window or by clicking on the points within the plot to identify the point’s
specific information. Figure 10 shows the scatter plot for two attributes and the points derived
from the data set.
⮚ There are a few options to view the plot that could be helpful to the user. It is formatted similar
to an X/Y graph, yet it can show any of the attribute classes that appear on the main scatter plot
matrix. This is handy when the scale of the attribute is unable to be ascertained in one axis over
the other.
⮚ Within the plot the points can be adjusted by utilizing a feature called jitter. This option moves
the individual points so that in the event of close data points users can reveal hidden multiple
occurrences within the initial plot. Figure 11 shows an example of this point selection and the
results the user sees.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ There are a few options to manipulate the view for the identification of subsets or to separate
the data points on the plot.
♦ Polyline: can be used to segment different values for additional visualization clarity on the
plot. This is useful when there are many data points represented on the graph.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
♦ Rectangle: this tool is helpful to select instances within the graph for copying or
clarification.
♦ Polygon: Users can connect points to segregate information and isolate points for
reference.
⮚ This user guide is meant to assist users in their efforts to become familiar with some of the
features within the Explorer portion of the WEKA data mining software application and is used
for informational purposes only.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ It is a summary of the user information found on the program’s main web site. For a more
comprehensive and in-depth version users can visit the main site
http://www.cs.waikato.ac.nz/~ml/WEKA for examples and FAQs about the program.
• Step-by-step Installation:
Step 1: Visit this website using any web browser. Click on Free Download.
Step 2: Click on Start Download. Downloading of the executable file will start shortly. Now
check for the executable file in downloads in your system and run it.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Step 3: It will prompt confirmation to make changes to your system. Click on Yes. A setup
screen will appear, click on Next.
Step 4: The next screen will be of License Agreement; click on I Agree. Next screen is of
choosing components, all components are already marked so don’t change anything just click
on the Install button.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Step 5: The next screen will be in the installation location so choose the drive which will have
sufficient memory space for installation. It needed a memory space of 301 MB. Next screen
will be choosing the Start menu folder so don’t do anything just click on Install Button.
Step 6: After this installation process will start and will hardly take a minute to complete the
installation. Click on the Next button after the installation process is complete.
Step 7: Click on Finish to finish the installation process. Weka is successfully installed on the
system and an icon is created on the desktop.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 3
Aim: Perform the analysis, Preprocessing, and visualization on following
available datasets:
1) Weather
Ans: 14
Screenshot:
Ans: 5
Screenshot:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
(D) What is the data type (e.g., numeric, nominal, etc.) of the attributes in the
dataset?
Ans:
Screenshot:
SEM- 6 ICT-CLASS-B
Adani Institute of Infrastructure Engineering, Adani University
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
2) IRIS-dataset
Ans: 150
Screenshot:
Ans: 5
Screenshot:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
(D) What is the data type (e.g., numeric, nominal, etc.) of the attributes in the
dataset?
Ans:
Screenshot:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132005 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 4
Aim: Create, Analyze, Preprocess and Visualize three types ARFF databases
like Student, Subject, Faculty using the Weka tool.
2. Find the count for different subjects chosen by same or different students.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
@relation "Subject"
@attribute Sub_id numeric
@attribute Subject{DMBI,ITU,AI}
@attribute Learn_by{Rupali,Aditi,Dhruhi,Rutva,Mansi}
@attribute Teach_by{DV,RK,AS}
@data
1,DMBI,Rupali,DV
2,DMBI,Rupali,DV
3,DMBI,Rupali,DV
4,ITU,Aditi,RK
5,ITU,Aditi,RK
6,ITU,Aditi,RK
7,AI,Dhruhi,AS
8,AI,Dhruhi,AS
9,AI,Rutva,AS 10,AI,Mansi,AS
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
@relation "Student"
@attribute Enroll_No numeric
@attribute Name{Rupali,Aditi,Dhruhi,Rutva,Mansi}
@attribute Subject_Name{DMBI,ITU,AI,ADC,CPDP,MPMC}
@attribute Subject_Code{3160714,3161009,3161608,3161607,3160002,3160914}
@data
11,Rupali,DMBI,3160714
12,Rupali,ITU,3161009
13,Rupali,AI,3161608
14,Aditi,ADC,3161607
15,Aditi,CPDP,3160002
16,Dhruhi,MPMC,3160914
17,Rutva,ITU,3161009
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
18,Rutva,ADC,3161607
19,Mansi,CPDP,3160002 20,Dhruhi,MPMC,3160914
@relation "Faculty"
@attribute No numeric
@attribute Faculty_Name{DV,RK,AS,AKV,PR,MSG}
@attribute Subject_Name{DMBI,ITU,AI,ADC,CPDP,MPMC}
@attribute Subject_Code{3160714,3161009,3161608,3161607,3160002,3160914}
@data
001,DV,DMBI,3160714
002,RK,ITU,3161009
003,AS,AI,3161608
004,AKV,ADC,3161607
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
005,PR,CPDP,3160002
006,MSG,MPMC,3160914
007,RK,ITU,3161009
008,AKV,ADC,3161607
009,PR,CPDP,3160002 0010,MSG,MPMC,3160914
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 5
Aim: Calculate Mean, Mode and Median using python.
Mean:
The mean is the average value of all the values in a dataset. To calculate the mean value of a
dataset, we first need to find the sum of all the values and then divide the result by the number of
elements.
Code:
l1 = [1,2,3,4,5]
sum = 0 a =
range(0,5) for
i in a:
sum = sum + l1[i]
i = i + 1 mean =
sum/i; print("Sum is:
", sum)
print("Mean is: ",mean)
Output:
Median:
The median is the middle value among all the values in sorted order. There are two ways to calculate
the median value as:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Code:
list1 = [5,6,4,5,7,9]
list1.sort() median =
0 if(len(list1) %2 ==
0):
m1 = list1[len(list1) // 2]
m2 = list1[len(list1) // 2-1] median
= (m1 + m2) / 2 else:
median = list1[len(list1) // 2]
print(median)
Output:
Mode:
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 6
Aim: Calculate Variance & standard deviation in python.
Variance:
Variance is the measure of how notably a collection of data is spread out. If all the data values are
identical, then it indicates the variance is zero. All non-zero variances are positive. A little
variance represents that the data points are close to the mean, and to each other, whereas if the
data points are highly spread out from the mean and from one another indicates the high
variance.
Standard Deviation:
Standard Deviation is a measure which shows how much variation from the mean exists. The
standard deviation indicates a “typical” deviation from the mean. It is a popular measure of
variability because it returns to the original units of measure of the data set.
Code:
l = [9,108,27,36,54,27,63] mean =
(9+108+27+36+54+27+63)/7
print("The mean is: ", mean)
variance = sum([((i-mean)**2) for i in l])/len(l) print("The variance is: ", variance) stan_dev =
variance**0.5
print("The standard deviation is: ", stan_dev)
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117
Practical: 7
Aim: Perform Data Cleaning using panda’s library in python.
Data Set:
import pandas as pd import
numpy as np
students = [('Rupali', 22, 'AHMEDABAD', 'ADANI'),
('Aditi', 25, np.NAN, np.NAN),
('Jinal', np.NAN, 'NVS', 'KVS'),
('Jinal', np.NAN, 'NVS', 'KVS'),
(np.NAN, 20, 'AHMEDABAD', 'KSV'),
('Mansi', 25, np.NAN, 'AU'),
('Priya', 30, 'Baroda', np.NAN),
(np.NAN, 35, 'Surat', np.NAN),
('Dhruhi', np.NAN, 'Una', np.NAN),
('Rutva', 30, 'Mumbai', 'IIT'),
('Rutva', 30, 'Mumbai', 'IIT'),
(np.NAN, 15, np.NAN, 'AU'),
(np.NAN, np.NAN, np.NAN, np.NAN),
(np.NAN, 20, np.NAN, np.NAN)]
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
#returns the no. of times the null value has occurred in each column df.isnull().sum()
# returns the no of times the null value has occurred in entire dataset
df.isnull().sum().sum()
2) Remove Duplicates:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
3) Fill (0):
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
4)
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
5)
Data Set:
#creating new dataset
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
6)
Pad:
# fill missing/NaN values with previous ones
df.fillna(method = 'pad') print("Pad:
")
print(df)
7) Bfill:
# fill missing values with the next value df2
= df.fillna(method = 'bfill')
print("Filling the missing values with the next ones: ") print(df2)
SEM- 6 ICT-CLASS-B
ENROLL: 211310132018 DMBI
7)
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Data Set:
data1 = {'Clothes': ['pant', 't-shirt', 'kurta', 'pant', None, 'kurta', 't-shirt', 'pant']}
df3 = pd.DataFrame(data1)
print(df3)
8) Mode:
9) Mean:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 8
Aim: Implement Chi – Square Test in python using scipy.stats module.
Code:
# job satisfaction import
scipy.stats as stats
from scipy.stats import chi2_contingency
data=[[25,40,30,50],[15,20,25,30],[10,15,20,25]] print(data)
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Code:
# Running a Chi-Square Test of Independence in Python
import scipy.stats as stats stat, p, dof, expected =
chi2_contingency(table) print('Printing stat value', stat)
print('Printing p value', p) print('dof = %d' % dof)
print(expected)
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Output:
Code:
# Assessing our results if p < 0.05:
print('Reject null hypothesis') else:
print('Fail to reject the null hypothesis')
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 9
Aim: Implement Min-max normalization in python. (Use suitable dataset or
Use Data frame).
Code:
import pandas as pd import
numpy as np
numbers=[(12,23,34,56),
(93,33,44,55),
(43,87,89,52),
(93,56,82,61)]
df=pd.DataFrame(numbers,columns=['A','B','C','D']) print(df)
Output:
Code:
import matplotlib.pyplot as plt df.plot(kind
= 'bar')
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Output:
Code:
#Using min-max feature scaling
# copy the data
df_min_max=df.copy()
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Output:
Code:
import matplotlib.pyplot as plt df_min_max.plot(kind
= 'bar')
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 10
Aim: Implement z-score normalization in python. (Use suitable dataset or Use
Data frame).
Code:
import pandas as pd import
numpy as np
numbers=[(12,23,34,56),
(93,33,44,55),
(43,87,89,52), (93,56,82,61)]
df=pd.DataFrame(numbers,columns=['A','B','C','D']
) print(df)
Output:
Code:
df_z_score=df.copy() for column
in df_z_score.columns:
df_z_score[column]=(df_z_score[column] - df_z_score[column].mean())
/df_z_score[column].std()
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
display(df_z_score)
Output:
Code:
import matplotlib.pyplot as plt df_z_score.plot(kind
= 'bar')
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 11
Aim: Implement simple Linear regression in python.
Code:
import pandas as pd import
numpy as np
numbers=[(12,23),
(93,33),
(43,87),
(93,56)]
df=pd.DataFrame(numbers,columns=['A','B']) print(df)
Output:
Code:
x=df['A'].values # Independent variable
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Code:
y=df['B'].values
Output:
Code:
# Mean X & Y mean_x=np.mean(x)
mean_y=np.mean(y)
b0)
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Code:
# plotting values and regression line
numpy as np max_x=np.max(x) +
x=np.linspace(max_x,min_x,1000) y=b0
+ b1 * x
label='Regression Line')
plt.show()
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Code:
ss_t=0 ss_r=0
for i in range(n):
y_pred = b0 + b1 * x[i]
Output:
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 12
Aim: Study of APRIORI algorithm in detail.
Apriori Algorithm
In data mining, Apriori is a classic algorithm for learning association rules. Apriori is designed to
operate on databases containing transactions (for example, collections of items bought by
customers, or details of a website frequentation).
Other algorithms are designed for finding association rules in data having no transactions (Winepi
and Minepi), or having no timestamps (DNA sequencing).
Overview:
The whole point of the algorithm (and data mining, in general) is to extract useful information
from large amounts of data. For example, the information that a customer who purchases a
keyboard also tends to buy a mouse at the same time is acquired from the association rule below:
Support: The percentage of task-relevant data transactions for which the pattern is true.
Confidence: The measure of certainty or trustworthiness associated with each discovered pattern.
The algorithm aims to find the rules which satisfy both a minimum support threshold and a
minimum confidence threshold (Strong Rules).
Item: article in the basket.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Example :
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
A database has five transactions. Let the min sup = 50% and min con f = 80%.
Solution:
Step 1: Find all Frequent Itemsets
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Frequent Itemsets:
{A} {B} {C} {E} {A C} {B C} {B E} {C E} {B C E}
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 14
Aim: Study of Decision tree induction algorithm in detail.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buys_computer that indicates whether a customer
at a company is likely to buy a computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Input:
Data partition, D, which is a set of training tuples and their associated
class labels.
attribute_list, the set of candidate attributes. Attribute selection
method, a procedure to determine the splitting criterion that best
partitions that the data tuples into individual classes. This criterion
includes a splitting_attribute and either a splitting point or
splitting subset.
Output: A
Decision Tree
Method
create a node N;
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Tree Pruning:
Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers.
The pruned trees are smaller and less complex.
Tree Pruning Approaches
Here is the Tree Pruning Approaches listed below −
Cost Complexity:
The cost complexity is measured by the following two parameters −
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 16
Aim: Make a case study and design star schema of a Data Warehouse for an
organization by identifying facts and dimensions.
Datawarehouse:
“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data
in support of management's decision-making process.”
In other words
“A data warehouse is a collection of data designed to support management decision making. Data
warehouses contain a wide variety of data that present a coherent picture of business conditions
at a single point in time.”
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there will
be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from
3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is kept. For example, a transaction
system may hold the most recent address of a customer, where a data warehouse can hold all
addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
This represents the different data sources that feed data into the data warehouse. The data source
can be of any format -- plain text file, relational database, other types of database, Excel file, etc.,
can all act as a data source.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
● Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems
data.
● Web server logs with user browsing data.
● Internal market research data.
● Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data gets pulled from the data source into the data warehouse system. There is likely some minimal
data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart.
Having one common area makes it easier for subsequent data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from a
transactional nature to an analytical nature. This layer is also where data cleansing happens. The
ETL design phase is often the most time-consuming phase in a data warehousing project, and an
ETL tool is often used in this layer.
This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of
entities can be found here: data warehouse, data mart, and operational data store (ODS). In any
given system, you may have just one of the three, two of the three, or all three types.
This is where business rules are stored. Business rules stored here do not affect the underlying data
transformation rules, but do affect what the report looks like.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
This refers to the information that reaches the users. This can be in a form of a tabular / graphical
report in a browser, an emailed report that gets automatically generated and sent every day, or an
alert that warns users of exceptions, among others. Usually an OLAP tool and/or a reporting tool
is used in this layer.
Metadata Layer
This is where information about the data stored in the data warehouse system is stored. A logical
data model would be an example of something that's in the metadata layer. A metadata tool is often
used to manage metadata.
This layer includes information on how the data warehouse system operates, such as ETL job
status, system performance, and user access history.
Dimensional Data Model: Dimensional data model is commonly used in data warehousing
systems. This section describes this modeling technique, and the two common schema types, star
schema and snowflake schema.
Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This
section explains the problem, and describes the three ways of handling this problem with examples.
Conceptual Data Model: What is a conceptual data model, its features, and an example of this
type of data model.
Logical Data Model: What is a logical data model, its features, and an example of this type of data
model.
Physical Data Model: What is a physical data model, its features, and an example of this type of
data model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model.
This section compares the three different types of data models.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Data Integrity: What is data integrity and how it is enforced in data warehousing.
Factless Fact Table: A fact table without any fact may sound silly, but there are real life instances
when a factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is
it important.
To understand dimensional data modeling, let's define some of the terms commonly used in this
type of modeling:
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes
within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter
→ Month → Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount
would be such a measure. This measure is stored in the fact table with the appropriate granularity.
For example, it can be sales amount by store by day. In this case, the fact table would contain three
columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of the quarters
available in the data warehouse. Each row (each quarter) may have several fields, one for the
unique ID that identifies the quarter, and one or more additional fields that specifies how that
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
particular quarter is represented on a report (for example, first quarter of 2001 may be represented
as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more
lookup tables, but fact tables do not have direct relationships to one another. Dimensions and
hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup
tables.
In designing data models for data warehouses / data marts, the most commonly used schema types
are Star Schema and Snowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and business needs.
Personally, I am partial to snowflakes, when there is a business case to analyze the information at
that particular level.
SCHEMA
✔ Star Schema
In the star schema design, a single object (the fact table) sits in the middle and is radically
connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is
represented as a single table. The primary key in each dimension table is related to a foreign key
in the fact table.
All measures in the fact table are related to all the dimensions that fact table is related to. In other
words, they all have the same level of granularity.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
A star schema can be simple or complex. A simple star consists of one fact table; a complex star
can have more than one fact table.
Let's look at an example: Assume our data warehouse keeps store sales data, and the different
dimensions are time, store, product, and customer. In this case, the figure on the left represents our
star schema. The lines between two tables indicate that there is a primary key / foreign key
relationship between the two tables. Note that different dimensions are not related to one another.
✔ Snowflake Schema
The snowflake schema is an extension of the star schema, where each point of the star explodes
into more points. In a star schema, each dimension is represented by a single dimensional table,
whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables,
each representing a level in the dimensional hierarchy.
For example, the Student Dimension that consists of one row which is address field has 3 different
hierarchies:
1. Address → City
2. Address →Country.
3. Address →State.
We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for
month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating
the above relationships in the Student Dimension is shown to the right.
The main advantage of the snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The main disadvantage
of the snowflake schema is the additional maintenance efforts needed due to the increase number
of lookup tables.
Design star schema of a Data Warehouse using Microsoft SQL Server 2008
1. Select Microsoft SQL Server 2008 > SQL Server Business Intelligence Development
Studio from the Programs menu to launch Business Intelligence Development Studio.
2. Select File > New > Project.
3. In the New Project dialog box, select the Business Intelligence Projects project type.
4. Select the Analysis Services Project template.
5. Name the new project Interactive Training Program and select a convenient location to save
it.
6. Click OK to create the new project.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Figure 1.2 shows the Solution Explorer window of the new project, ready to be populated
with objects.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
1. Right-click on the Data Sources folder in Solution Explorer and select New Data Source.
2. Read the first page of the Data Source Wizard and click Next.
3. You can base a data source on a new or an existing connection. Because you don't have any
existing connections, click New.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
4. In the Connection Manager dialog box, select the server containing your analysis services
sample database from the Server Name combo box.
5. Fill in your authentication information.
6. Select the Native OLE DB\Microsoft Jet 4.0 OLE DB Provider (this is the default
provider).
7. Select the FoodMart database. Figure 1.3 shows the filled-in Connection Manager dialog
box.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
11.
12.
1.5 Data source wizard
13. Accept the default data source name and click Finish.
A data source view is a persistent set of tables from a data source that supply the data. BIDS
also includes a wizard for creating data source views, which you can invoke by right-clicking on
the Data Source Views folder in Solution Explorer.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
1. Right-click on the Data Source Views folder in Solution Explorer and select New Data Source
View.
2. Read the first page of the Data Source View Wizard and click Next.
3. Select the Foodmart data source and click Next. Note that you could also launch the Data
Source Wizard from here by clicking New Data Source.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
4. Select the Customer (dbo) table, Product Table, sales_fact_1997 table and time_by_day table
in the Available Objects list and click the > button to move it to the Included Object list. This
will be the fact table in the new cube.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
5. Click Next.
6. Click Finish.
BIDS will automatically display the schema of the new data source view, as shown in Figure 1.8.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
Practical: 17
Aim: Make a case study for data cube creation process. Also list and explain
various OLAP operations like drill down, slice, pivot and roll up.
What Is OLAP?
OLAP (online analytical processing) is computer processing that enables a user to easily
and selectively extract and view data from different points of view. OLAP data is stored in a
multidimensional database. Whereas a relational database can be thought of as twodimensional,
a multidimensional database considers each data attribute (such as product, geographic sales
region, and time period) as a separate "dimension."
OLAP software can locate the intersection of dimensions (all products sold in the Eastern
region above a certain price during a certain time period) and display them. Attributes such as time
periods can be broken down into sub attributes.
OLAP CUBE:
A cube can be considered a generalization of a three-dimensional spreadsheet. For example,
a company might wish to summarize financial data by product, by time-period, and by city to
compare actual and budget expenses. Property, time, city and scenario are the data's dimensions.
Cube is a shortcut for multidimensional dataset, given that data can have an arbitrary
number of dimensions. The term hypercube is sometimes used, especially for data with more than
three dimensions.
OLAP data is typically stored in a star schema or snowflake schema in a relational data
warehouse or in a special-purpose data management system. Measures are derived from the records
in the fact table and dimensions are derived from the dimension tables.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
8. Select measures.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ Roll-up: A roll-up involves summarizing the data along a dimension. The summarization rule
might be computing totals along a hierarchy or applying a set of formulas.
SEM- 6 ICT-CLASS-B
ENROLL: 211310132117 DMBI
⮚ Drill Down: A Drill Down allows the user to navigate among levels of data ranging from the
most summarized (up) to the most detailed (down).
SEM- 6 ICT-CLASS-B