0% found this document useful (0 votes)
31 views33 pages

ETL Interview 1

The document explains the differences between various SQL join types and operations. Inner join returns rows when there is a match between both tables, while outer join returns all rows from one or both tables including non-matching rows. UNION combines result sets and removes duplicates, MINUS returns rows in the first table that are not in the second, and INTERSECT returns matching rows between tables. Self join joins a table to itself to relate hierarchical data in a flat structure.

Uploaded by

suman duggi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views33 pages

ETL Interview 1

The document explains the differences between various SQL join types and operations. Inner join returns rows when there is a match between both tables, while outer join returns all rows from one or both tables including non-matching rows. UNION combines result sets and removes duplicates, MINUS returns rows in the first table that are not in the second, and INTERSECT returns matching rows between tables. Self join joins a table to itself to relate hierarchical data in a flat structure.

Uploaded by

suman duggi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

What is the difference between inner and outer join? Explain with example.

Inner Join

Inner join is the most common type of Join which is used to combine the rows from two tables and create
a result set containing only such records that are present in both the tables based on the joining condition
(predicate).

Inner join returns rows when there is at least one match in both tables

If none of the record matches between two tables, then INNER JOIN will return a NULL set. Below is an
example of INNER JOIN and the resulting set.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE


FROM DEPT dept, EMPLOYEE emp
WHERE emp.dept_id = dept.id

Department Employee

HR Inno

HR Privy

Engineering Robo

Engineering Hash

Engineering Anno

Engineering Darl

Marketing Pete
Marketing Meme

Sales Tomiti

Sales Bhuti

Outer Join

Outer Join can be full outer or single outer

Outer Join, on the other hand, will return matching rows from both tables as well as any unmatched rows
from one or both the tables (based on whether it is single outer or full outer join respectively).

Notice in our record set that there is no employee in the department 5 (Logistics). Because of this if we
perform inner join, then Department 5 does not appear in the above result. However in the below query
we perform an outer join (dept left outer join emp), and we can see this department.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE


FROM DEPT dept, EMPLOYEE emp
WHERE dept.id = emp.dept_id (+)

Department Employee

HR Inno

HR Privy

Engineering Robo
Engineering Hash

Engineering Anno

Engineering Darl

Marketing Pete

Marketing Meme

Sales Tomiti

Sales Bhuti

Logistics

The (+) sign on the emp side of the predicate indicates that emp is the outer table here. The above SQL
can be alternatively written as below (will yield the same result as above):

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE


FROM DEPT dept LEFT OUTER JOIN EMPLOYEE emp
ON dept.id = emp.dept_id

What is the difference between JOIN and UNION?

SQL JOIN allows us to “lookup” records on other table based on the given conditions between two tables.
For example, if we have the department ID of each employee, then we can use this department ID of the
employee table to join with the department ID of department table to lookup department names.
UNION operation allows us to add 2 similar data sets to create resulting data set that contains all the data
from the source data sets. Union does not require any condition for joining. For example, if you have 2
employee tables with same structure, you can UNION them to create one result set that will contain all
the employees from both of the tables.

SELECT * FROM EMP1


UNION
SELECT * FROM EMP2;

What is the difference between UNION and UNION ALL?

UNION and UNION ALL both unify for add two structurally similar data sets, but UNION operation
returns only the unique records from the resulting data set whereas UNION ALL will return all the rows,
even if one or more rows are duplicated to each other.

In the following example, I am choosing exactly the same employee from the emp table and performing
UNION and UNION ALL. Check the difference in the result.

SELECT * FROM EMPLOYEE WHERE ID = 5


UNION ALL
SELECT * FROM EMPLOYEE WHERE ID = 5

ID MGR_ID DEPT_ID NAME

5.0 2.0 2.0 Anno

5.0 2.0 2.0 Anno

SELECT * FROM EMPLOYEE WHERE ID = 5


UNION
SELECT * FROM EMPLOYEE WHERE ID = 5

ID MGR_ID DEPT_ID NAME

5.0 2.0 2.0 Anno

What is the difference between WHERE clause and HAVING clause?


WHERE and HAVING both filters out records based on one or more conditions. The difference is,
WHERE clause can only be applied on a static non-aggregated column whereas we will need to use
HAVING for aggregated columns.

To understand this, consider this example. 


Suppose we want to see only those departments where department ID is greater than 3. There is no
aggregation operation and the condition needs to be applied on a static field. We will use WHERE clause
here:

SELECT * FROM DEPT WHERE ID > 3

ID NAME

4 Sales

5 Logistics

Next, suppose we want to see only those Departments where Average salary is greater than 80. Here the
condition is associated with a non-static aggregated information which is “average of salary”. We will
need to use HAVING clause here:

SELECT dept.name DEPARTMENT, avg(emp.sal) AVG_SAL


FROM DEPT dept, EMPLOYEE emp
WHERE dept.id = emp.dept_id (+)
GROUP BY dept.name
HAVING AVG(emp.sal) > 80

DEPARTMENT

Engineering

As you see above, there is only one department (Engineering) where average salary of employees is
greater than 80.

What is the difference among UNION, MINUS and INTERSECT?

UNION combines the results from 2 tables and eliminates duplicate records from the result set.
MINUS operator when used between 2 tables, gives us all the rows from the first table except the rows
which are present in the second table.

INTERSECT operator returns us only the matching or common rows between 2 result sets.

To understand these operators, let’s see some examples. We will use two different queries to extract data
from our emp table and then we will perform UNION, MINUS and INTERSECT operations on these two
sets of data.

UNION

SELECT * FROM EMPLOYEE WHERE ID = 5


UNION
SELECT * FROM EMPLOYEE WHERE ID = 6

ID MGR_ID DEPT_ID NAME

5 2 2.0 Anno

6 2 2.0 Darl

MINUS

SELECT * FROM EMPLOYEE


MINUS
SELECT * FROM EMPLOYEE WHERE ID > 2

ID MGR_ID DEPT_ID NAME

1 2 Hash

2 1 2 Robo

INTERSECT

SELECT * FROM EMPLOYEE WHERE ID IN (2, 3, 5)


INTERSECT
SELECT * FROM EMPLOYEE WHERE ID IN (1, 2, 4, 5)

ID MGR_ID DEPT_ID NAME

5 2 2 Anno

2 1 2 Robo

What is Self Join and why is it required?

Self Join is the act of joining one table with itself.

Self Join is often very useful to convert a hierarchical structure into a flat structure

In our employee table example above, we have kept the manager ID of each employee in the same row as
that of the employee. This is an example of how a hierarchy (in this case employee-manager hierarchy) is
stored in the RDBMS table. Now, suppose if we need to print out the names of the manager of each
employee right beside the employee, we can use self join. See the example below:

SELECT e.name EMPLOYEE, m.name MANAGER


FROM EMPLOYEE e, EMPLOYEE m
WHERE e.mgr_id = m.id (+)

EMPLOYEE MANAGER

Pete Hash

Darl Hash

Inno Hash
Robo Hash

Tomiti Robo

Anno Robo

Privy Robo

Meme Pete

Bhuti Tomiti

Hash

The only reason we have performed a left outer join here (instead of INNER JOIN) is we have one
employee in this table without a manager (employee ID = 1). If we perform inner join, this employee will
not show-up.

How can we transpose a table using SQL (changing rows to column or vice-versa) ?

The usual way to do it in SQL is to use CASE statement or DECODE statement.

How to generate row number in SQL Without ROWNUM

Generating a row number – that is a running sequence of numbers for each row is not easy using plain
SQL. In fact, the method I am going to show below is not very generic either. This method only works if
there is at least one unique column in the table. This method will also work if there is no single unique
column, but collection of columns that is unique. Anyway, here is the query:

SELECT name, sal, (SELECT COUNT(*) FROM EMPLOYEE i WHERE o.name >=
i.name) row_num
FROM EMPLOYEE o
order by row_num

NAME SAL ROW_NUM

Anno 80 1

Bhuti 60 2

Darl 80 3

Hash 100 4

Inno 50 5

Meme 60 6

Pete 70 7

Privy 50 8

Robo 100 9

Tomiti 70 10
The column that is used in the row number generation logic is called “sort key”. Here sort key is “name”
column. For this technique to work, the sort key needs to be unique. We have chosen the column “name”
because this column happened to be unique in our Employee table. If it was not unique but some other
collection of columns was, then we could have used those columns as our sort key (by concatenating
those columns to form a single sort key).

Also notice how the rows are sorted in the result set. We have done an explicit sorting on the row_num
column, which gives us all the row numbers in the sorted order. But notice that name column is also
sorted (which is probably the reason why this column is referred as sort-key). If you want to change the
order of the sorting from ascending to descending, you will need to change “>=” sign to “<=” in the
query.

As I said before, this method is not very generic. This is why many databases already implement other
methods to achieve this. For example, in Oracle database, every SQL result set contains a hidden column
called ROWNUM. We can just explicitly select ROWNUM to get sequence numbers.

How to select first 5 records from a table?

This question, often asked in many interviews, does not make any sense to me. The problem here is how
do you define which record is first and which is second. Which record is retrieved first from the database
is not deterministic. It depends on many uncontrollable factors such as how database works at that
moment of execution etc. So the question should really be – “how to select any 5 records from the table?”
But whatever it is, here is the solution:

In Oracle,

SELECT *
FROM EMP
WHERE ROWNUM <= 5;

In SQL Server,

SELECT TOP 5 * FROM EMP;

Generic solution,

I believe a generic solution can be devised for this problem if and only if there exists at least one distinct
column in the table. For example, in our EMP table ID is distinct. We can use that distinct column in the
below way to come up with a generic solution of this question that does not require database specific
functions such as ROWNUM, TOP etc.

SELECT name
FROM EMPLOYEE o
WHERE (SELECT count(*) FROM EMPLOYEE i WHERE i.name < o.name) < 5

name
Inno

Anno

Darl

Meme

Bhuti

I have taken “name” column in the above example since “name” is happened to be unique in this table. I
could very well take ID column as well.

In this example, if the chosen column was not distinct, we would have got more than 5 records returned in
our output.

Do you have a better solution to this problem? If yes, post your solution in the comment.

What is the difference between ROWNUM pseudo column and ROW_NUMBER() function?

ROWNUM is a pseudo column present in Oracle database returned result set prior to ORDER BY being
evaluated. So ORDER BY ROWNUM does not work.

ROW_NUMBER() is an analytical function which is used in conjunction to OVER() clause wherein we


can specify ORDER BY and also PARTITION BY columns.

Suppose if you want to generate the row numbers in the order of ascending employee salaries for
example, ROWNUM will not work. But you may use ROW_NUMBER() OVER() like shown below:

SELECT name, sal, row_number() over(order by sal desc) rownum_by_sal


FROM EMPLOYEE o

name Sal ROWNUM_BY_SAL


Hash 100 1

Robo 100 2

Anno 80 3

Darl 80 4

Tomiti 70 5

Pete 70 6

Bhuti 60 7

Meme 60 8

Inno 50 9

Privy 50 10

What are the differences among ROWNUM, RANK and DENSE_RANK?

ROW_NUMBER assigns contiguous, unique numbers from 1.. N to a result set.


RANK does not assign unique numbers—nor does it assign contiguous numbers. If two records tie for
second place, no record will be assigned the 3rd rank as no one came in third, according to RANK. See
below:

SELECT name, sal, rank() over(order by sal desc) rank_by_sal


FROM EMPLOYEE o

name Sal RANK_BY_SAL

Hash 100 1

Robo 100 1

Anno 80 3

Darl 80 3

Tomiti 70 5

Pete 70 5

Bhuti 60 7

Meme 60 7

Inno 50 9
Privy 50 9

DENSE_RANK, like RANK, does not assign unique numbers, but it does assign contiguous numbers.
Even though two records tied for second place, there is a third-place record. See below:

SELECT name, sal, dense_rank() over(order by sal desc)


dense_rank_by_sal
FROM EMPLOYEE o

name Sal DENSE_RANK_BY_SAL

Hash 100 1

Robo 100 1

Anno 80 2

Darl 80 2

Tomiti 70 3

Pete 70 3

Bhuti 60 4

Meme 60 4
Inno 50 5

Privy 50 5

How to print/display the first line of a file?

There are many ways to do this. However the easiest way to display the first line of a
file is using the [head] command.
$> head -1 file.txt

No prize in guessing that if you specify [head -2] then it would print first 2 records of
the file.

Another way can be by using [sed] command. [Sed] is a very powerful text editor
which can be used for various text manipulation purposes like this.
$> sed '2,$ d' file.txt
You may be wondering how does the
above command work? OK, the 'd'
parameter basically tells [sed] to delete
all the records from display output
from line no. 2 to last line of the file
(last line is represented by $ symbol).
Of course it does not actually delete
those lines from the file, it just does
not display those lines in standard
output screen. So you only see the
remaining line which is the first line.

How to print/display the last line of a file?

The easiest way is to use the [tail] command.


$> tail -1 file.txt
If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test

From our previous answer, we already know that '$' stands for the last line of the file.
So '$ p' basically prints (p for print) the last line in standard output screen. '-n' switch
takes [sed] to silent mode so that [sed] does not print anything else in the output.

How to display n-th line of a file?

The easiest way to do it will be by using [sed] I guess. Based on what we already
know about [sed] from our previous examples, we can quickly deduce this command:
$> sed –n '<n> p' file.txt

You need to replace <n> with the actual line number. So if you want to print the 4th
line, the command will be
$> sed –n '4 p' test

Of course you can do it by using [head] and [tail] command as well like below:
$> head -<n> file.txt | tail -1

You need to replace <n> with the actual line number. So if you want to print the 4th
line, the command will be
$> head -4 file.txt | tail -1

How to remove the first line / header from a file?

We already know how [sed] can be used to delete a certain line from the output – by
using the'd' switch. So if we want to delete the first line the command should be:
$> sed '1 d' file.txt

But the issue with the above command is, it just prints out all the lines except the first
line of the file on the standard output. It does not really change the file in-place. So if
you want to delete the first line from the file itself, you have two options.

Either you can redirect the output of the file to some other file and then rename it back
to original file like below:
$> sed '1 d' file.txt > new_file.txt
$> mv new_file.txt file.txt
Or, you can use an inbuilt [sed] switch '–i' which changes the file in-place. See below:
$> sed –i '1 d' file.txt

How to remove the last line/ trailer from a file in Unix script?

Always remember that [sed] switch '$' refers to the last line. So using this knowledge
we can deduce the below command:
$> sed –i '$ d' file.txt

How to remove certain lines from a file in Unix?

If you want to remove line <m> to line <n> from a given file, you can accomplish the
task in the similar method shown above. Here is an example:
$> sed –i '5,7 d' file.txt

The above command will delete line 5 to line 7 from the file file.txt

How to remove the last n-th line from a file?

This is bit tricky. Suppose your file contains 100 lines and you want to remove the last
5 lines. Now if you know how many lines are there in the file, then you can simply
use the above shown method and can remove all the lines from 96 to 100 like below:
$> sed –i '96,100 d' file.txt # alternative to command [head -95 file.txt]
But not always you will know the number of lines present in the file (the file may be
generated dynamically, etc.) In that case there are many different ways to solve the
problem. There are some ways which are quite complex and fancy. But let's first do it
in a way that we can understand easily and remember easily. Here is how it goes:
$> tt=`wc -l file.txt | cut -f1 -d' '`;sed –i "`expr $tt - 4`,$tt d" test
As you can see there are two commands. The first one (before the semi-colon)
calculates the total number of lines present in the file and stores it in a variable called
“tt”. The second command (after the semi-colon), uses the variable and works in the
exact way as shows in the previous example.

How to check the length of any line in a file?

We already know how to print one line from a file which is this:
$> sed –n '<n> p' file.txt
Where <n> is to be replaced by the actual line number that you want to print. Now
once you know it, it is easy to print out the length of this line by using [wc] command
with '-c' switch.
$> sed –n '35 p' file.txt | wc –c
The above command will print the length of 35th line in the file.txt.

How to get the nth word of a line in Unix?

Assuming the words in the line are separated by space, we can use the [cut] command.
[cut] is a very powerful and useful command and it's real easy. All you have to do to
get the n-th word from the line is issue the following command:
cut –f<n> -d' '
'-d' switch tells [cut] about what is the delimiter (or separator) in the file, which is
space ' ' in this case. If the separator was comma, we could have written -d',' then. So,
suppose I want find the 4th word from the below string: “A quick brown fox jumped
over the lazy cat”, we will do something like this:
$> echo “A quick brown fox jumped over the lazy cat” | cut –f4 –d' '
And it will print “fox”

How to reverse a string in unix?

Pretty easy. Use the [rev] command.


$> echo "unix" | rev
xinu

How to get the last word from a line in Unix file?

We will make use of two commands that we learnt above to solve this. The commands
are [rev] and [cut]. Here we go.

Let's imagine the line is: “C for Cat”. We need “Cat”. First we reverse the line. We get
“taC rof C”. Then we cut the first word, we get 'taC'. And then we reverse it again.
$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat

How to get the n-th field from a Unix command output?

We know we can do it by [cut]. Like below command extracts the first field from the
output of [wc –c] command
$>wc -c file.txt | cut -d' ' -f1
109
But I want to introduce one more command to do this here. That is by using [awk]
command. [awk] is a very powerful command for text pattern scanning and
processing. Here we will see how may we use of [awk] to extract the first field (or
first column) from the output of another command. Like above suppose I want to print
the first column of the [wc –c] output. Here is how it goes like this:
$>wc -c file.txt | awk ' ''{print $1}'
109

The basic syntax of [awk] is like this:


awk 'pattern space''{action space}'
The pattern space can be left blank or omitted, like below:
$>wc -c file.txt | awk '{print $1}'
109
In the action space, we have asked [awk] to take the action of printing the first column
($1). More on [awk] later.

How to replace the n-th line in a file with a new line in Unix?

This can be done in two steps. The first step is to remove the n-th line. And the second
step is to insert a new line in n-th line position. Here we go.

Step 1: remove the n-th line


$>sed -i'' '10 d' file.txt # d stands for delete

Step 2: insert a new line at n-th line position


$>sed -i'' '10 i This is the new line' file.txt # i stands for insert

How to show the non-printable characters in a file?

Open the file in VI editor. Go to VI command mode by pressing [Escape] and then [:].
Then type [set list]. This will show you all the non-printable characters, e.g. Ctrl-M
characters (^M) etc., in the file.

How to zip a file in Linux?

Use inbuilt [zip] command in Linux

How to unzip a file in Linux?

Use inbuilt [unzip] command in Linux.


$> unzip –j file.zip

How to test if a zip file is corrupted in Linux?

Use “-t” switch with the inbuilt [unzip] command


$> unzip –t file.zip

How to check if a file is zipped in Unix?

In order to know the file type of a particular file use the [file] command like below:
$> file file.txt
file.txt: ASCII text
If you want to know the technical MIME type of the file, use “-i” switch.
$>file -i file.txt
file.txt: text/plain; charset=us-ascii
If the file is zipped, following will be the result
$> file –i file.zip
file.zip: application/x-zip

How to connect to Oracle database from within shell script?

You will be using the same [sqlplus] command to connect to database that you use
normally even outside the shell script. To understand this, let's take an example. In
this example, we will connect to database, fire a query and get the output printed from
the unix shell. Ok? Here we go –
$>res=`sqlplus -s username/password@database_name <<EOF
SET HEAD OFF;
select count(*) from dual;
EXIT;
EOF`
$> echo $res
1

If you connect to database in this method, the advantage is, you will be able to pass
Unix side shell variables value to the database. See below example
$>res=`sqlplus -s username/password@database_name <<EOF
SET HEAD OFF;
select count(*) from student_table t where t.last_name=$1;
EXIT;
EOF`
$> echo $res
12

How to execute a database stored procedure from Shell script?


$> SqlReturnMsg=`sqlplus -s username/password@database<<EOF
BEGIN
Proc_Your_Procedure(… your-input-parameters …);
END;
/
EXIT;
EOF`
$> echo $SqlReturnMsg

How to check the command line arguments in a UNIX command in Shell Script?

In a bash shell, you can access the command line arguments using $0, $1, $2, …
variables, where $0 prints the command name, $1 prints the first input parameter of
the command, $2 the second input parameter of the command and so on.

How to fail a shell script programmatically?

Just put an [exit] command in the shell script with return value other than 0. this is
because the exit codes of successful Unix programs is zero. So, suppose if you write
exit -1
inside your program, then your program will thrown an error and exit immediately.

How to list down file/folder lists alphabetically?

Normally [ls –lt] command lists down file/folder list sorted by modified time. If you
want to list then alphabetically, then you should simply specify: [ls –l]

How to check if the last command was successful in Unix?

To check the status of last executed command in UNIX, you can check the value of an
inbuilt bash variable [$?]. See the below example:
$> echo $?

How to check if a file is present in a particular directory in Unix?

Using command, we can do it in many ways. Based on what we have learnt so far, we
can make use of [ls] and [$?] command to do this. See below:
$> ls –l file.txt; echo $?
If the file exists, the [ls] command will be successful. Hence [echo $?] will print 0. If
the file does not exist, then [ls] command will fail and hence [echo $?] will print 1.

How to check all the running processes in Unix?


The standard command to see this is [ps]. But [ps] only shows you the snapshot of the
processes at that instance. If you need to monitor the processes for a certain period of
time and need to refresh the results in each interval, consider using the [top]
command.
$> ps –ef
If you wish to see the % of memory usage and CPU usage, then consider the below
switches
$> ps aux
If you wish to use this command inside some shell script, or if you want to customize
the output of [ps] command, you may use “-o” switch like below. By using “-o”
switch, you can specify the columns that you want [ps] to print out.
$>ps -e -o stime,user,pid,args,%mem,%cpu

How to tell if my process is running in Unix?

You can list down all the running processes using [ps] command. Then you can
“grep” your user name or process name to see if the process is running. See below:
$>ps -e -o stime,user,pid,args,%mem,%cpu | grep "opera"
14:53 opera 29904 sleep 60 0.0 0.0
14:54 opera 31536 ps -e -o stime,user,pid,arg 0.0 0.0
14:54 opera 31538 grep opera 0.0 0.0

How to get the CPU and Memory details in Linux server?

In Linux based systems, you can easily access the CPU and memory details from
the /proc/cpuinfo and /proc/meminfo, like this:
$>cat /proc/meminfo
$>cat /proc/cpuinfo

What is data warehouse?

A data warehouse is a electronic storage of an Organization's historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.

Other than that a data warehouse can also be used for the purpose of data integration, master data
management etc.
According to Bill Inmon, a datawarehouse should be subject-oriented, non-volatile, integrated and time-
variant.

Non-volatile means that the data once loaded in the warehouse will not get deleted later. Time-variant
means the data will change with respect to time.

What is the benefits of data warehouse?

A data warehouse helps to integrate data (see Data integration) and store them historically so that we
can analyze different aspects of business including, performance analysis, trend, prediction etc. over a
given time frame and use the result of our analysis to improve the efficiency of business processes.

Why Data Warehouse is used?

For a long time in the past and also even today, Data warehouses are built to facilitate reporting on
different key business processes of an organization, known as KPI. Data warehouses also help to
integrate data from different sources and show a single-point-of-truth values about the business
measures.

Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern
recognition etc.

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis
system on that data.

OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the
other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT
operations.

Explanatory Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales person at the
counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the
related system is a OLTP system.

On the other hand, the manager of the store might want to view a report on out-of-stock materials, so
that he can place purchase order for them. Such report will come out from OLAP system

What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to
different departments like Finance, HR, Marketting etc. stored in data warehouse and each department
may have separate data marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of
modeling is to normalize the data by reducing redundancy. This is different than dimensional modeling
where the main goal is to improve the data retrieval mechanism.

What is dimensional modeling?

Dimensional model consists of dimension and fact tables. Fact tables store different transactional
measurements and the foreign keys from dimension tables that qualifies the data. The goal of
Dimensional model is not to achive high degree of normalization but to facilitate easy and faster data
retrieval.

Ralph Kimball is one of the strongest proponents of this very popular data modeling technique which is
often used in many enterprise level data warehouses.
If you want to read a quick and simple guide on dimensional modeling,

What is snow-flake schema?

This is another logical arrangement of tables in dimensional modeling where a centralized fact table
references number of other dimension tables; however, those dimension tables are further normalized
into multiple related tables.

Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales
quantity will be the measure here and keys from customer, product and time dimension tables will flow
into the fact table. Additionally all the products can be further grouped under different product families
stored in a different table so that primary key of product family tables also goes into the product table
as a foreign key. Such construct will be called a snow-flake schema as product table is further snow-
flaked into product family.

snowflake-schema

Note

Snow-flake increases degree of normalization in the design.

What are the different types of dimension?

In a data warehouse model, dimension can be of following types,

Conformed Dimension

Junk Dimension
Degenerated Dimension

Role Playing Dimension

Based on how frequently the data inside a dimension changes, we can further classify dimension as

Unchanging or static dimension (UCD)

Slowly changing dimension (SCD)

Rapidly changing Dimension (RCD)

You may also read, Modeling for various slowly changing dimension and Implementing Rapidly changing
dimension to know more about SCD, RCD dimensions etc.

What is a 'Conformed Dimension'?

A conformed dimension is the dimension that is shared across multiple subject area. Consider
'Customer' dimension. Both marketing and sales department may use the same customer dimension
table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas.
These dimensions are conformed dimension.

Theoretically, two dimensions which are either identical or strict mathematical subsets of one another
are said to be conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not have its own
dimension table.

A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any
more associated attributes and hence can not be designed as a dimension table.
What is junk dimension?

A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those
can be removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this table is to store all the
combinations of the dimensional attributes which you could not fit into the different dimension tables
otherwise. Junk dimensions are often used to implement Rapidly Changing Dimensions in data
warehouse.

What is a role-playing dimension?

Dimensions are often reused for multiple applications within the same database with different
contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of
Delivery", or "Date of Hire". This is often referred to as a 'role-playing dimension'

What is SCD?

SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can
be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most
common. Read this article to gather in-depth knowledge on various SCD tables.

What is rapidly changing dimension?

This is a dimension where data changes rapidly. Read this article to know how to implement RCD.

Describe different types of slowly changing Dimension (SCD)

Type 0:
A Type 0 dimension is where dimensional changes are not considered. This does not mean that the
attributes of the dimension do not change in actual business situation. It just means that, even if the
value of the attributes change, history is not kept and the table holds all the previous data.

Type 1:

A type 1 dimension is where history is not maintained and the table always shows the recent data. This
effectively means that such dimension table is always updated with recent data whenever there is a
change, and because of this update, we lose the previous values.

Type 2:

A type 2 dimension table tracks the historical changes by creating separate rows in the table with
different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer
is changed to group G2. Then there will be two separate records in dimension table like below,

Key Customer Group Start Date End Date

1 C1 G1 1st Jan 2000 31st Dec 2005

2 C1 G2 1st Jan 2006 NULL

Note that separate surrogate keys are generated for the two records. NULL end date in the second row
denotes that the record is the current record. Also note that, instead of start and end dates, one could
also keep version number column (1, 2 … etc.) to denote different versions of the record.

Type 3:

A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2
dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example
below,
Key Customer Previous Group Current Group

1 C1 G1 G2

This is only good when you need not store many consecutive histories and when date of change is not
required to be stored.

Type 6:

A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you
add one extra column to denote which record is the current record.

Key Customer Group Start Date End Date Current Flag

1 C1 G1 1st Jan 2000 31st Dec 2005 N

2 C1 G2 1st Jan 2006 NULL Y

What is a mini dimension?

Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge
number of rapidly changing attributes it is better to separate those attributes in different table called
mini dimension. This is done because if the main dimension table is designed as SCD type 2, the table
will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing
members in different table thereby keeping the main dimension table small and performing.

What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys
from different dimension tables. This is often used to resolve a many-to-many cardinality issue.

Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher may
have many students. To model this situation in dimensional model, one might introduce a fact-less-fact
table joining teacher and student keys. Such a fact table will then be able to answer queries like,

Who are the students taught by a specific teacher.

Which teacher teaches maximum students.

Which student has highest number of teachers.etc. etc.

Ad by Browser Shop | Close

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but can not answer a negative
query. Again consider the illustration in the above example. A fact-less fact containing the keys of tutors
and students can not answer a query like below,

Which teacher did not teach any student?

Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a
tutor) but if there is a student who is not being taught by a teacher, then that student's key does not
appear in this table, thereby reducing the coverage of the table.

Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a
negative condition and flag = 1 indicates a positive condition. To understand this better, let's consider a
class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500
records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding
flag for that record will be 0.
What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are stored (or captured)
against a specific time and these measurements vary with respect to time. Now it might so happen that
the business might not able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept empty (Null) or can be filled up with the last available
measurements. The first case is the example of incident fact and the second one is the example of
snapshot fact.

What is aggregation and what is the benefit of aggregation?

A data warehouse usually captures data with same degree of details as available in source. The "degree
of detail" is termed as granularity. But all reporting requirements from that data warehouse do not need
the same degree of details.

To understand this, let's consider an example from retail business. A certain retail chain has 500 shops
accross Europe. All the shops record detail level transactions regarding the products they sale and those
data are captured in a data warehouse.

Each shop manager can access the data warehouse and they can see which products are sold by whom
and in what quantity on any given date. Thus the data warehouse helps the shop managers with the
detail level data that can be used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which certain sales girl in
London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is
interested is, perhaps to check the percentage increase of his revenue margin accross Europe. Or may
be year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of
goods in East Europe is derived by summing up the individual sales data from each shop in East Europe.

Therefore, to support different levels of data warehouse users, data aggregation is needed.
What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g.
Brown Bread) and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in different level of
aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?

Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined
this year compared to last year, he then might want to know the root cause of the decrease. For this, he
may start drilling through his report to more detail level and eventually find out that even though
individual shop sales has actually increased, the overall sales figure has decreased because a certain
shop in Turkey has stopped operating the business. The detail level of data, which CEO was not much
interested on earlier, has this time helped him to pin point the root cause of declined sales. And the
method he has followed to obtain the details from the aggregated data is called drill through.

BI testing – Part 1
posted in Testing career, Testing process

Imagine tasks of a sales head of a big company selling different goods across country. There can be multiple
software applications which maintain data about the products, sales numbers, sales schemes, product stock,
customers, distributors etc across different regions. In order to effectively plan and implement sales strategies he
needs to have thorough insight into this huge chunk of data. He need to view data in different forms like
comparative analysis, graphs, charts, summaries, trends  and so on to get the necessary intelligence and take
informed decisions.
Data warehousing and Business intelligence solutions (DW-BI) address this need to EXTRACT data from multiple
data sources, TRANSFORM it into the required structure and LOAD into the target database for creating different
types of reports.

This entire ETL process (Extract, transform, load) and the reports need to be tested thoroughly to ensure that the end
users are easily getting right data in right format as and when required,  for effective control on their business
processes. This is called Business intelligence testing or BI testing.

First part of BI testing is checking correctness of ETL process.

A data can be extracted from multiple data sources like different RDBMS  or files. In Extract process, it is required
to test how the data sources are accessed, which data needs to be extracted, extraction schedule and logic, extraction
rules and the temporary storage (staging). Once data is extracted, it needs to be processed as per the business need
prior to loading it in target system. Here the testing should be done to check the transformation process outcome
like combination or splitting of data fields, format changes, data unit conversion, selecting records based on
particular rules, aggregation of data (summing, counting etc) and so on.  Load phase loads data into a target
database, generally a Data warehouse. Testing also needs to be conducted to check the load results.

This data is then presented to different users in the form of different reports using Business Intelligence tools. 

BI testing part 2 – Reports testing


posted in Testing career, Testing process

In last post (BI testing – Part 1) we discussed about the ETL process.  In this post, let us discuss more about testing
the BI reports.

Report is a consolidated information presented in the user friendly format. Reports enables users to gain insight into
the business data, metrics and take decisions based on the intelligence gained from this data.

Typically BI reporting tools have capability to create huge number of reports presenting complex information in
multiple user friendly formats. Broadly, reports can be of 2 types – Canned reports are the ones which are predefined
and created automatically by the system at scheduled intervals, Ad-hoc or On demand reports are created by users as
per their needs. The reports can be static providing the required information in one go or they can be dynamic where
user can interactively drill down, filter information or  change data representation etc.

Key elements in report testing are to check the report generation process, data content accuracy, data consolidation
and calculations (sums, averages , section wise breaking etc) correctness , formatting and layout of contents,
conversion of reports to multiple formats,  interactive features  and usability in terms of ability to quickly derive
intelligence from the data.

It is important for tester to have knowledge of SQL, insight into the BI tool’s reporting features and correct
understanding of customer expectations in order to effectively test the reporting functionality.

BI testing is a continuously growing and so are the opportunities for the testers in this specialized and niche testing
area.

You might also like