1) Explain in Detail Drill Up & Drill Down Operations
1) Explain in Detail Drill Up & Drill Down Operations
Drill Up
This one lets you go into higher levels of the data.
This operation you can meet as a part of pair drill up and drill down. Drill-up is an operation to
gather data from the cube either by ascending a concept hierarchy for a dimension or by
dimension reduction in order to receive measures at a less detailed granularity. So that to see a
broader perspective in compliance with the concept hierarchy a user has to group columns and
unite the values. As there are fewer specifics, one or more dimensions from the data cube will
be deleted, when this operation is run. In some sources drill up and roll up operations come as
synonyms, so this variant is also possible.
Example of a Drill-up or roll up operations example:
• Do you have any data that is based on a specific date? Then the level of groupings could
look something like this: Year, Month, Year’s Day, Quarter, and Hour
• Do you have any geographic information? The level might then look like this: Country,
Province/State, Postal Code, and City are all required fields.
Drill down
Drilling down means being able to go into lower levels of hierarchical data without the need to
change the graph.
Drill-down is an operation opposite to Drill-up. It is carried out either by descending a concept
hierarchy for a dimension or by adding a new dimension. It lets a user deploy highly detailed
data from a less detailed cube. Consequently, when the operation is run, one or more
dimensions from the data cube must be appended to provide more information elements.
Example of Drill down:
Do you have data on a daily basis? To see the most recent results, it makes sense to default
the dataset to show the bottom level. Drilling down becomes a system that allows the user to
easily see new results while exploring more trends.
Filtering Data
Selecting filters for the reports allows you to see only the information you need. You can, for
example, select to view time logged on a specific project or all planned work for your team.
The Filter by box at the top of the report shows which filters are applied
To filter data in a report:
1. Click the Filter by box to display a list of filter options.
2. Select the data you want to include in the report.
• Use the search box to search for projects, teams, accounts, etc. To add a filter, select its
check-box. To remove a filter, clear the check-box or click the x beside its name in
the Filter by box.
• If you select to filter by issues, you can also choose to include sub-tasks.
• Click Back to return to the list of filters.
Grouping Data
Grouping data in your reports helps you to structure your information in a meaningful way.
The groups are displayed in the report according to Jira hierarchy. In order to avoid duplicating
data in its reports, Tempo assigns time record data to the groups that they have been added to
most recently. This logic applies to the following:
• Teams • Components
• Roles • Fix versions
This means that, if a time record is associated with an employee who is a member of multiple
teams, Tempo reports will place the time record under the team that the employee joined most
recently. Likewise, if a time record has multiple components, the one added most recently will
be reflected in the reports.
To group data in a report:
1. Click the Group by box to to display a list of possible choices. Select the groups you want
to add.
2. To remove a level of grouping, click Group by, and then click x to the right of the group
level.
Sorting Data
Sorting by alphabetical or numerical order allows you to organize and display your report's
data differently. You can sort a report by the data in a particular column by clicking that
column’s heading. This then sorts data according to that column’s ascending or descending
order: the text is sorted from A to Z, numerical data is sorted from highest to lowest, and
time/date data is sorted from earliest to latest.
• Up and down arrows next to a column name indicate that data is being sorted by that
column.
• To reverse the sort order, click the column heading a second time.
• In a report with multiple grouping levels, data is grouped by the top-level group. For
example, the report above shows the most number of planned hours at the top sorted by
user.
3) What is Charts? List different charts used. Discuss pie chart in details
Charts are an essential part of working with data, as they are a way to condense large amounts
of data into an easy to understand format. Visualizations of data can bring out insights to
someone looking at the data for the first time, as well as convey findings to others who won’t
see the raw data. There are countless chart types out there, each with different use cases. Often,
the most difficult part of creating a data visualization is figuring out which chart type is best for
the task at hand.
Pie Charts
When it comes to statistical types of graphs and charts, the pie chart (or the circle chart) has a
crucial place and meaning. It displays data and statistics in an easy-to-understand ‘pie-slice’
format and illustrates numerical proportion.
Each pie slice is relative to the size of a particular category in a given group as a whole. To say
it in another way, the pie chart brakes down a group into smaller pieces. It shows part-whole
relationships.
To make a pie chart, you need a list of categorical variables and numerical variables.
Pie Chart Uses:
▪ When you want to create and represent the composition of something.
▪ It is very useful for displaying nominal or ordinal categories of data.
▪ To show percentage or proportional data.
▪ When comparing areas of growth within a business such as profit.
▪ Pie charts work best for displaying data for 3 to 7 categories.
Example:
The pie chart below represents the proportion of types of transportation used by 1000 students to
go to their school.
Types Of Extensions:
• DOC/DOCX: A Microsoft Word document. DOC was the original extension used for Word
documents, but Microsoft changed the format when Word 2007 debuted. Word documents are
now based on the XML format, hence the addition of the “X” at the end of the extension.
• XLS/XLSX: A Microsoft Excel spreadsheet.
• PNG: Portable Network Graphics, a lossless image file format.
• HTM/HTML: The HyperText Markup Language format for creating web pages online.
• PDF: The Portable Document Format originated by Adobe, and is used to maintain formatting
in distributed documents.
• EXE: An executable format used for programs you can run.
.CSV file
A CSV is a comma-separated values file, which allows data to be saved in a tabular format. CSVs
look like a garden-variety spreadsheet but with a .csv extension.
CSV files can be used with most any spreadsheet program, such as Microsoft Excel or Google
Spreadsheets. They differ from other spreadsheet file types because you can only have a single
sheet in a file, they can not save cell, column, or row. Also, you cannot not save formulas in this
format.
How do I save CSV files?
Saving CSV files is relatively easy, you just need to know where to change the file type.
Under the "File name" section in the "Save As" tab, you can select "Save as type" and change it to
"CSV (Comma delimited) (*.csv). Once that option is selected, you are on your way to quicker and
easier data organization. This should be the same for both Apple and Microsoft operating systems.
CSV File Format
Usually the first line in a CSV file contains the table column labels. Each of the subsequent lines
represent a row of the table. Commas separate each cell in the row, which is where the name comes
from.
Here is an example of a CSV file. The example has three columns, labeled 'name', 'id', and 'food'.
It has five rows including the header row.
name, id, favorite food
quincy, 1, hot dogs
beau, 2, cereal
abbey, 3, pizza
mrugesh, 4, ice cream
5) How Data Grouping & Sorting is useful in reporting. Justify with
suitable example.
Grouping Data:
After designing the basic layout, you may decide that grouping the records by certain fields or
other criteria would make the report easier to read. Grouping allows you to separate groups of
records visually and display introductory and summary data for each group. The group break is
based on a grouping expression. This expression is usually based on one or more recordset fields
but it can be as complex as you like.
You can group the data in your reports using the C1ReportDesigner application or using code
Sorting Data:
You can sort data in reports the following two ways:
• Sort the data source object itself (for example, using a SQL statement with an ORDER BY
clause).
• Add groups to the report and specify how each group should be sorted using the
group's GroupBy and Sort properties.
Group sorting is done using the DataView.Sort property, which takes a list of column names
only (not expressions on column names). So if your grouping expression is DatePart("yyyy",
dateColumn), the control will actually sort on the dates in the dateColumn field, not on the years
of those dates as most would expect.
To sort based on the dates, add a calculated column to the data table (by changing the SQL
statement), and then group/sort on the calculated column instead. See the Sort property for an
XML discussion of this, including a sample.
Financial report provides a template for creating such financial reports as profit/loss statements,
customer profitability records and balance sheets.
7) What is Data Reduction? Explain Types of Data Reduction in detail
Data reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.
When facing a large dataset it is also appropriate to reduce its size, in order to make learning
algorithms more efficient, without sacrificing the quality of the results obtained.
There are three main criteria to determine whether a data reduction technique should be used:
▪ Efficiency. The application of learning algorithms to a dataset smaller than the original one
usually means a shorter computation time.
▪ Accuracy. Data reduction techniques should not significantly compromise the accuracy of the
model generated.
▪ Simplicity. it is important that the models generated be easily translated into simple rules that
can be understood by experts in the application domain.
There are several different data reduction techniques that can be used in data mining,
including:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data
by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset
that are most relevant to the task at hand.
Bivarate Analysis
Bivariate analysis is used to find out if there is a relationship between two different variables.
Something as simple as creating a scatterplot by plotting one variable against another on a
Cartesian plane (think X and Y axis) can sometimes give you a picture of what the data is trying
to tell you. If the data seems to fit a line or curve then there is a relationship or correlation between
the two variables. For example, one might choose to plot caloric intake versus weight.
Multivariate Analysis
Multivariate analysis is the analysis of three or more variables. There are many ways to perform
multivariate analysis depending on your goals. Some of these methods include:
▪ Additive Tree ▪ MANOVA
▪ Canonical Correlation Analysis ▪ Multidimensional Scaling
▪ Cluster Analysis ▪ Multiple Regression Analysis
▪ Factor Analysis ▪ Partial Least Square Regression
▪ Generalized Procrustean Analysis ▪ Redundancy Analysis.
▪ Correspondence Analysis / Multiple Correspondence Analysis
▪ Principal Component Analysis / Regression / PARAFAC
Feature selection
The purpose of feature selection, also called feature reduction, is to eliminate from the dataset
a subset of variables which are not deemed relevant for the purpose of the data mining activities.
Feature selection methods can be classified into three main categories: filter methods, wrapper
methods and embedded methods
▪ Filter methods. Filter methods select the relevant attributes before moving on to the
subsequent learning phase, and are therefore independent of the specific algorithm being used.
▪ Wrapper methods : use of a wrapper method is for attribute selection.
▪ Embedded methods. For the embedded methods, the attribute selection process lies inside the
learning algorithm, so that the selection of the optimal set of attributes is directly made during
the phase of model generation.
In particular, three distinct myopic search schemes can be followed: forward, backward and
forward–backward search.
▪ Forward. According to the forward search scheme, also referred to as bottom-up search(from
start index to last index)
▪ Backward. The backward search scheme, also referred to as top-down search (from last
indexes to first index)
▪ Forward–backward. The forward–backward method represents a trade-off between forward
and backward search(stops when appropriate value is found)
Male Female
Chocolate Candy 42 77
Fruit Candy 58 23
This is a contingency table comparing the variable ‘Gender’ with the variable ‘Candy
Preference’. You can see that, across the top of the table are the two gender options for this
particular study: ‘male students’ and ‘female students’. Down the left side are the two candy
preference options: ‘chocolate’ and ‘fruit’. The data in the center of the table indicates the
reported candy preferences of the 100 students polled during the study.
In this contingency table, columns represent computer types and rows represent genders. Cell
values are frequencies for each combination of gender and computer type. Totals are in the
margins. Notice the grand total in the bottom-right margin.
Marginal Distribution
These distributions represent the frequency distribution of one categorical variable without
regard for other variables. Unsurprisingly, you can find these distributions in the margins of a
contingency table.
The following marginal distribution examples correspond to the blue highlights.
For example, the marginal distribution of gender without considering computer type is the
following:
Males: 106
Females: 117
Alternatively, the marginal distribution of computer types is the following:
PC: 96
Mac: 127
Discretization techniques
▪ Subjective subdivision. Subjective subdivision is the most popular and intuitive method.
Classes are defined based on the experience and judgment of experts in the application domain.
▪ Subdivision into classes. subdivision can be based on classes of equal size or equal width.
▪ Hierarchical discretization. discretization is based on hierarchical relationships between
concepts and may be applied to categorical attributes, just as for the hierarchical relationships
between provinces and regions.
12) Write Short note on Logistic Regression
Logistic regression is used for predicting the categorical dependent variable using a given set of
independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1. Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
Logistic Regression Equation:
We use logistic function or sigmoid function to calculate probability in logistic regression. The
logistic function is a simple S-shaped curve used to convert data into a value between 0 and 1.
• In health care, logistic regression can be used to predict if a tumor is likely to be benign or
malignant.
• In the financial industry, logistic regression can be used to predict if a transaction is fraudulent
or not.
• In marketing, logistic regression can be used to predict if a targeted audience will respond or
not.
13)Discuss the classification evaluation model using Confusion matrix, Recall,
Precision and Accuracy
Confusion matrix
It is a matrix of size 2×2 for binary classification with actual values on one axis and predicted on
another.
Terms in the confusion matrix: true positive, true negative, false negative, and false
positive with an example.
EXAMPLE
A machine learning model is trained to predict tumor in patients. The test dataset consists of 100
people.
In the detection of spam mail, it is okay if any spam mail remains undetected (false negative),
but what if we miss any critical mail because it is classified as spam (false positive). In this
situation, False Positive should be as low as possible. Here, precision is more vital as compared
to recall.
Accuracy:
Accuracy represents the number of correctly classified data instances over the total number of
data instances.
14) How does Clustering differ from Classification? Discuss K Mean Partitioning
method with a suitable method.
Classification Clustering
It uses algorithms to categorize the new data as It uses statistical concepts in which the data set is
per the observations of the training set. divided into subsets with the same features.
In classification, there are labels for training In clustering, there are no labels for training data.
data.
Its objective is to find which class a new object Its objective is to group a set of objects to find
belongs to form the set of predefined classes. whether there is any relationship between them.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
15) What are association rules? How to evaluate them using Support and
Confidence? Explain with Example
Association Rule
Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction.
A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show
associations between items.It allows retailers to identify relationships between the items that
people buy together frequently.
It is employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the associations
between items. We can understand it by taking an example of a supermarket, as in a supermarket,
all products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram:
Association rule learning can be divided into three types of algorithms:
1. Apriori 3. F-P Growth Algorithm
2. Eclat
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as if A then B.
Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items is
known as single cardinality. It is all about creating rules, and if the number of items increases,
then cardinality also increases accordingly. So, to measure the associations between thousands
of data items, there are several metrics. These metrics are given below:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as
the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and
Y occur together in the dataset when the occurrence of X is already given. It is the ratio of the
transaction that contains X and Y to the number of records that contain X.
Weka
Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. The original
non-Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modelling algorithms
implemented in other programming languages, plus data preprocessing utilities in C and a
makefile-based system for running machine learning experiments.
Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be
formatted according to the Attribute-Relational File Format and filename with the .arff extension.
Features of Weka
1. Preprocess
The preprocessing of data is a crucial task in data mining. Because most of the data is raw, there
are chances that it may contain empty or duplicate values, have garbage values, outliers, extra
columns, or have a different naming convention. All these things degrade the results.
To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive set of
options under the filter category. Here, the tool provides both supervised and unsupervised types
of operations.
2. Classify
Classification is one of the essential functions in machine learning, where we assign classes or
categories to items. The classic examples of classification are: declaring a brain tumour as
"malignant" or "benign" or assigning an email to a "spam" or "not_spam" class.
After selecting the desired classifier, we select test options for the training set. Some of the
options are:
o Use training set: the classifier will be tested on the same training set.
o A supplied test set: evaluates the classifier based on a separate test set.
o Cross-validation Folds: assessment of the classifier based on cross-validation using the
number of provided folds.
o Percentage split: the classifier will be judged on a specific percentage of data.
3. Cluster
In clustering, a dataset is arranged in different groups/clusters based on some similarities. In this
case, the items within the same cluster are identical but different from other clusters. Examples
of clustering include identifying customers with similar behaviours and organizing the regions
according to homogenous land use.
4. Associate
Association rules highlight all the associations and correlations between items of a dataset. In
short, it is an if-then statement that depicts the probability of relationships between data items.
A classic example of association refers to a connection between the sale of milk and bread.
The tool provides Apriori, FilteredAssociator, and FPGrowth algorithms for association
rules mining in this category.
5. Select Attributes
Every dataset contains a lot of attributes, but several of them may not be significantly valuable.
Therefore, removing the unnecessary and keeping the relevant details are very important for
building a good model.
Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and Ranker.
6. Visualize
In the visualize tab, different plot matrices and graphs are available to show the trends and errors
identified by the model.
Accuracy and clarity of information: We can find the status and evolution of the whole
business at a glance, without the need of consulting different sources and cross data with the
help of spreadsheets. It also shortens the time it takes to train employees.
Information updates: Automatically, Business Intelligence tools are updated whenever new
data exists. It minimizes the reporting times as well as updates.
More agile and responsive: In case if something happens in the operation which will affect
the business, alerts are received immediately, which helps us accelerate decision making.
Fewer bottlenecks: With appropriate permissions, the information can be easily accessible by
the users, So that it is not required to request every area manager the data necessary to generate
status reports.
Broader context of information: Through the creation of evolutionary reports, it is easy to
understand the data we manage. Moreover, these kinds of tools offer visual information like
shipment tracking maps.
18) Comment “How might you implement business intelligence findings within
an organization?”
3. Understanding Customers
Understanding customers is an important aspect of using Business Intelligence (BI) in finance.
This refers to the process of using BI techniques and tools to gain insights into customer
behavior and preferences to make informed decisions about financial matters such as pricing,
marketing, and product development.
4. Financial Reporting
Financial reporting refers to the process of using Business Intelligence (BI) techniques and tools
to generate reports that provide insight into financial performance and trends. This is an
important aspect of using BI in finance, as it can provide finance teams and other stakeholders
such as management and investors with greater visibility into financial performance and trends.
5. Compliance
Compliance refers to the process of ensuring that an organization adheres to laws, regulations,
standards, and guidelines that are relevant to its operations. In finance, compliance can refer to
ensuring that an organization’s financial practices and reporting comply with relevant laws,
regulations, and standards. Business Intelligence (BI) can be used to support compliance by
providing organizations with the necessary tools to monitor and report on financial performance
BI in CRM is crucial to business success. Competitors have already embraced metrics and KPI
(key performance indicator) for customer relationship management to provide objective metrics
and understand what tasks and activities support goals, and where the business needs to refocus.
CRM is a business strategy that identifies the customer as a very important business asset. It
recognizes that customers are unique and need to be treated as uniquely as possible. It
recognizes customer satisfaction as important only as a determinant of managing churn and
future revenue potential, Information is critical in such a strategy and can only come about
through the analysis of data. Smart use and analysis of data makes business intelligence. So,
business intelligence is a broad category which includes many uses of data, one of which is in
pursuit of a CRM strategy.