My To Days Paper of Cs 614 With Solution

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 18

Current Papers CS 614 Spring

2014 Solved By Humda Mcqs


and Subjective Questions. Mit
4th SEM 25 Feb 2015
1. It is called a _____________ violation, if we have null values for attributes where
NOT NULL constraint exists
Load
Transform
Constraint page 161
Extraction
1. UAT stands for
User acceptance testing page 193
1. Implementing a DWH requires ____________ integrated activities.
Tightly page 289
Loosely
Slackly
Lethargically
1. The application development quality assurance activities cannot be completed
until the data is _____________
Stabilized page 308
Identified
Finalized
Computerized
1. Normalization is a process of efficiently organizing data in a data base by
_________ a relational table into smaller table by projection.
Composing
Decomposing page 41
Joining/merging
Combining
1. Dirty data class of anomalies include
1.
Lexical errors
2.
Integrity constraints violation
Mc130400536 Humda Mit 4th Sem

3.
4.
5.

Business rule contradiction


Irregularities
Duplication
i and iii and iv
i and ii and v
i and ii
Syntactically Dirty data: lexical errors, irregularities
Semantically dirty data: integrity constraint violation, business rule
contradiction, duplication
Coverage anomalies: missing attributes, missing records
1. Quality sold is stored as fact.
Additive
Non-additive
Association
Non-association
1. Product selection phase fall in Kimball
Lifecycle Technology Track page 290
Lifecycle Data Track
Lifecycle Analytic Applications Track
None of the given
1. Give least time to ____ can prove suicidal attempt of DWH project
OLAP
De-normalization
ETL page 313
None of the given
1. Multan division is the cotton hub
1. Which is not an issue of Click stream data.
Identifying the Visitor Origin
Identifying the Session
Identifying the Visitor
Another option was given which is not issue of click stream.
1. HTTP true statement
Is stateless page 364
Non world wide web protocol
Used to maintain session
Message routing protocol

Mc130400536 Humda Mit 4th Sem

1. SMP stand for Symmetric Multi-Processing


1. K-clustering is equal to sequence of n
K much greater than n
K much smaller than n
K is equal to square of n
None of the given
1. The ith bit is set to 1, if ith row of the base table has the value for the indexed
column. The statement refer to
Inverted
Bitmap page 233
Dense
Sparse index
1. __________ is a systematic sampling process that provides field specific
information on pest pressure and crop injury.
Pest scouting page 333
Soil survey
Seed survey
Water survey
1. In context of web data ware house. Which is NOT one of way to identify session
Using asynchronous session tracking protocol
Using Time-contiguous Log Entries
Using Transient Cookies
Using HTTP's secure sockets layer (SSL)
Using session ID Ping-pong
Using Persistent Cookies
Some mcqs from my midterm paper. 2 underlined MCQs are also included in
my final paper
1. The telecommunication data warehouse is dominated by the sheer volume of data
generated at the call level _________ area.
Subject page 35
Object
Aggregate
Details
1. 4NF has an additional requirements which is
Data is in 3NF and no null key dependency
Mc130400536 Humda Mit 4th Sem

Data is in 2NF and no Multi value dependency page 48


Data is in 3NF and no multi value dependency
Data is in 3NF and no foreign key table

1. 3NF remove even more data redundancy than 2NF but it is at the cost of
Simplicity and performance page 48
Complexity
No of table
Relations
1. In full extraction, data extracted completely from source. No need to keep track of
change to the _________
Data source page 133
DWH
Data mart
Data destination
1. Which is not the characteristics of DWH
Ad-hoc access
Complete repository
Historical data
Volatile page 27
1. Experienced showed that for a single pass of magnetic tape that scanned 100% of
the record only________ of the records.
5% page 12
30%
50%
80%
1. HOLAP provides a combination of relational database access and cube data
structures within a single framework. The goal is to get the best of both MOLAP
and ROLAP:
scalability and high performance page 78
1. ____________ are created out of the data warehouse to service the needs of
different departments such as marketing, sales etc.
MIS
OLAPs
Data mart page 31
None of the given

Mc130400536 Humda Mit 4th Sem

Current Papers Spring 2014


solved By Humda For Finltrm
Exams

1. Write two unsupervised learning? page no. 270


Answer:

one way clustering


two way clustering

1. Bitmap index: run length encoding ka ek question tha input di hoi output
find out kerni thi Page no.234
Answer: If we apply Run length Encoding on the input 11001100, the output
will be

12#02#12#02

1. B-tree vs. hash indexes men se ye query di hoi thi


SELECT*FROM R WHERE A= 5 page no.228
Btana tha k is men dense index sparse index B-tree index and bitmap index men
se konsi technique use ho gi aur explain kerna tha ise
1. Identify kerna tha k ye statement correct he ya incorrect aur reason
btana tha
Bayesian modeling is an example of unsupervised learning page no 270
Answer: incorrect. Bayesian modeling is an example of supervised learning

Mc130400536 Humda Mit 4th Sem

Forward Proxy (2)


Answer: Ch#40 Page no: 369
The type of proxy we are referring to in this discussion is called a forward proxy.
It is outside of our control because it belongs to a networking company or an ISP..
Drawbacks of waterfall model for DWH (3)
First and foremost, the project is likely to occur over an extended period of time,
during which the users may not have had an opportunity to review what will be
delivered.
Second, in today's demanding competitive environment there is a need to produce
results in a much shorter timeframe.
In which scenario we can use waterfall (2)
The model is a linear sequence of activities like requirements definition, system
design, detailed design, integration and testing, and finally operations and
maintenance. The model is used when the system requirements and objectives are
known and clearly specified.
How gender guide used.
If for very large number of records gender is missing, it would become impossible
for us to manually check each and every individuals name and identify the
gender. In such cases we can formulate a mechanism to correct gender. We can
either use a standard gender guide or create a new table Gender_guide.
Gender_guide contains only two columns name and gender. Populate
Gender_guide table by a query for selecting all distinct first names from student
table. Then manually placing their gender.
This table can serve us as guide by telling what can be the gender against this
particular name. For example if we have hundred students in our database with
first name equal to Muhammad. Then in our Gender guide table we will have
just one entry Muhammad and we will manually set the gender as Male against
Muhammad. Now to fill missing genders in exception table we will just do an
inner join on Error table and Gender guide table.
run length encoding on these 2 ad-hoe the or output btana the .
Run length used in bitmap indexing
Output 1 may be
15#02# 18# (mean 1 come 5 time and 0 come 2 times and 1 come 1 8 times
(111110011111111))
Output 2 may be
Mc130400536 Humda Mit 4th Sem

11#01#11#
Output 3 may be
112#012#
Step of Kimball approach for data life cycle.
Kimball Process. Four step approach. (Business process-->Grains-->Facts->dimension). He defines a business process as a major operational process in the
organization that is supported by some kind of legacy system (or systems). (Read
"Business Development Lifecycle") page see #290
Drawback of traditional web search. Ch: 39 page 351
1. Limited to keyword based matching.
2. Cannot distinguish between the contexts in which a link is used.
3. Coupling of files has to be done manually.
Two ways of session describe in World Wide Web.
Identifying the Session

Web-centric data warehouse applications require every visitor session (visit) to


have its own unique identity
The basic protocol for the World Wide Web, HTTP, stateless so session identity
must be established in some other way.
There are several ways to do this
Using Time-contiguous Log Entries
Using Transient Cookies
Using HTTP's secure sockets layer (SSL)
Using session ID Ping-pong
Using Persistent Cookies
MCQs
Execution will be terminated abnormally.... (Quiz 4 file- 2 MCQs)
Kimballs approach ......driven (quiz 4 file-5 mcqs)
Pipeline per increase through..... (Quiz 4 file- 1 mcq)
Selectivity of query in olap... (Queries must be executed in a small number of
seconds.)
star schema simplify ...
Majority of data ...fail if (Majority of projects fail due to the complexity of the
development process.)
er is .......design (constituted to optimize OLTP performance)
Survival of fittest is.....algorithm (Genetic Algorithms: These are based on the
principle survival of the fittest. In these techniques, a model is formed to solve
Mc130400536 Humda Mit 4th Sem

problems having multiple options and many values. Briefly, these techniques are
used to select the optimal solution out of a number of possible solutions.
However, are not much robust as can not perform well in the presence of noise.
Shipy in kobol develop....... (In 1972 the Mitsubishi Shipyards in Kobe developed
a technique in which customer wants were linked to product specifications via
a matrix format. Technique is known today as The House of Quality and is one of
many techniques of Quality Function Deployment, which can briefly be defined
as a system for translating customer requirements into appropriate company
requirements. The purpose of the technique is to reduce two types of risk. First,
the risk that the product specification does not comply with the wants of the
predetermined target group of customers. Secondly, the risk that the final product
does not comply with the product specification
Q: 1 briefly explains any two types of precedence constraints that we can use
in DTS.
Answer: page 395
Precedence constraints sequentially link tasks in a package. In DTS, you can
use three types of precedence constraints, which can be accessed either through
DTS Designer or programmatically:
Unconditional: If you want Task 2 to wait until Task 1 completes, regardless of
the outcome, link Task 1 to Task 2 with an unconditional precedence constraint.
On Success: If you want Task 2 to wait until Task 1 has successfully completed,
link Task 1 to Task 2 with an On Success precedence constraint.
On Failure: If you want Task 2 to begin execution only if Task 1 fails to execute
successfully, link Task 1 to Task 2 with an On Failure precedence constraint. If
you want to run an alternative branch of the workflow when an error is
encountered, use this constraint.
Q:2 Time complexity of K-means algorithm is O(tkn) what does t,k,and n
represents here?
Page 281
Answer: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations.
Normally, k, t n.
Q: 3 what are the problems you will face if low priority is given to cube
construction?
Answer: page 313
Low priority for OLAP Cube Construction: Make sure your OLAP cube-building
or pre-calculation process is optimized and given the right priority. It is common
for the data warehouse to be on the bottom of the nightly batch loads, and after
the loading the DWH, usually there isn't much time left for the OLAP cube to be
Mc130400536 Humda Mit 4th Sem

refreshed. As a result, it is worthwhile to experiment with the OLAP cube


generation paths to ensure optimal performance.
Q: 4 List down any two parallel software Architectures?
Answer: Shared Memory, Shard Disk and Shared Nothing
Q: 5 what is unsupervised learning in Data mining?
Answer: page 27
Unsupervised learning where you dont know the number of clusters and
obviously no idea about their attributes too. In other words you are not guiding in
any way the DM process for performing the DM, no guidance and no input.
Unsupervised learning is closer to the exploratory spirit of Data Mining as small a
stressed in the definitions given above. In unsupervised learning situations all
variables are treated in the same way, there is no distinction between explanatory
and dependent variables. However, in contrast to the name undirected data mining
there is still some target to achieve. This target might be as general as data
reduction or more specific like clustering. For unsupervised learning typically
either the target variable is unknown or has only been recorded for too number of
cases.
Q: 6 which scripting language are used to perform complex transformations
in Data packages?
Answer: Microsoft SQL Server provides graphical tools to build DTS packages.
These tools provide good support for transformations. Complex transformations
are achieved through VB Script or Java Script that is loaded in DTS package.
Package can also be programmed by using DTS object model instead of using
graphical tools but DTS programming is rather complicated.
Q: 7 "Dense index consist of a number of bit vector" justify it .
Answer Dense Index: Every key in the data file is represented in the index file.
Bitmap index record (Value, Bit Vector): Bit Vector has one bit for every record in
the file, ith bit of Bit Vector is set off record it has Value in the given column. Bit
vectors typically compressed. Converted to sets of rids during query evaluation.
Q :8 It is essential: to have a sub-matter expert as part of data modeling team
. What will be the implication if such expert is not present in organization?
Answer: It is essential to have a subject-matter expert as part of the data modeling
team. This person can be an outside consultant or can be someone in-house with
extensive industry experience. Without this person, it becomes difficult to get a
definitive answer on many of the questions, and the entire project gets dragged
out, as the end users may not always be available

Mc130400536 Humda Mit 4th Sem

Suppose there is a large enterprise which uses the same server for the
development and production environments. What problems can arise if it
uses single server for both purposes? 5m
To save capital, often data warehousing teams will decide to use only a single
database and a single server for the different environments i.e. development and
production. Environment separation is achieved by either a directory structure or
setting up distinct instances of the database.
This is awkward for the following reasons:
Sometimes it is possible that the server needs to be rebooted for the
development environment. Having a separate development environment will
prevent the production environment from being effected by this.
There may be interference while having different database environments on a
single server. For example, having multiple long queries running on the
development server could affect the performance on the production server, as both
are same.
Write down any two drawbacks if Date is stored in text format rather than
using proper date format like dd-MMM-yy etc. 5m
In context of Web data warehousing, consider the web page dimension, list
at least five possible attributes of this dimension. 5m
Page key
Page source
Page function
Page template
Item type
Graphic type
Animation type
Sound type
Page file name
There are different data mining techniques e.g. clustering, description
etc. Each of the following statement corresponds to some data mining
technique. For each statement name the technique the statement corresponds
to. 5m
a) Assigning customers to predefined customer segments (i.e. good vs.
bad) classification
b) Assigning credit applicants to predefined classes (i.e. low, medium, or high
risk) classification

Mc130400536 Humda Mit 4th Sem

c) Guessing how much customers will spend during next 6 months prediction
d) Building a model and assigning a value from 0 to 1 to each member of the set.
Then classifying the members into categories based on a threshold
value. Estimation
e) Guessing how much students will score more than 65% grades in
midterm. Prediction
Specify at least one implication, if you dont provide proper documentation
as part of data warehouse development.3 m
Usually by this time most, if not all, of the developers will have left the project, so
it is essential that proper documentation is left for those who are handling
production maintenance. There is nothing more frustrating than staring at
something another person did, yet unable to figure it out due to the lack of proper
documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is
another phase of the data warehouse planned, start on that as soon as possible.
In context of nested loop join, mention two guidelines for selecting a table as
inner table. 3m
For a Nested-Loop join inner and outer tables are determined as follows: page
242
The outer table is usually the one that has:
The smallest number of qualifying rows, and/or
The largest numbers of I/Os required to locate the rows.
The inner table usually has:
The largest number of qualifying rows, and/or
The smallest number of reads required to locate rows
We can identify the Session in Word Wide Web by using Time-contiguous
Log Entries however there are some limitations of this technique. Briefly
explain any two limitations. 3m
Answer: A session can be consolidated by collecting time-contiguous log entries
from the same host (Internet Protocol, or IP, address). In many cases, the
individual hits comprising a session can be consolidated by collating timecontiguous log entries from the same host (Internet Protocol, or IP, address). If the
log contains a number of entries with the same host ID in a short period of time
(for example, one hour), one can reasonably assume that the entries are for the
same session.

Mc130400536 Humda Mit 4th Sem

Limitations: This method breaks down for visitors from large ISPs because
different visitors may reuse dynamically assigned IP addresses over a brief time
period.
Different IP addresses may be used within the same session for the same visitor.
This approach also presents problems when dealing with browsers that are
behind some firewalls.
Identify the given statement as correct or incorrect and justify your answer
in either case.
"The problem of Referential Integrity always occurs in traditional OLTP
system as well as in DWH". 3m
Answer: While doing total quality measurement, you measure RI every week (or
month) and hopefully the number of orphan records will be going down, as you
will be fine tuning the processes to get rid of the RI problems. Remember, RI
problem is peculiar to a DWH, this will not happen in a traditional OLTP system.
There are two primary techniques for gathering requirements i.e. interviews
or facilitated sessions. Which technique is preferred by Ralph Kimball? 2m
Both have their advantages and disadvantages. Interviews encourage lots of
individual participation. They are also easier to schedule. Facilitated sessions may
reduce the elapsed time to gather requirements, although they require more time
commitment from each participant. Kimball prefers using a hybrid approach with
interviews to gather the gory details and then facilitation to bring the group to
consensus.
List down any two Parallel Software Architectures? 2m
Brief Intro to Parallel Processing:
Parallel Hardware Architectures

Symmetric Multi-Processing (SMP)


Distributed Memory or Massively Parallel Processing (MPP)
Non-uniform Memory Access (NUMA)
Parallel Software Architectures

Shared Memory
Shard Disk
Shared Nothing
Types of parallelism

Data Parallelism
Spatial Parallelism
List down any four Static Attributes recorded by the scouts in Agriculture
Data Warehouse Case Study. 2m

Mc130400536 Humda Mit 4th Sem

Static attributes Dynamic attributes


Farmer name

Date of visit

Farmer address Pest population


Field acre age

CLCV

Variety sown

Predator population

Sowing date

Pesticide spray dates

Sowing method Pesticides used

List down any four issues of


Click stream Data. 2m
Issues of Click stream Data:
(Page#341)
Click stream data has many
issues:
Identifying the Visitor Origin

Identifying the Session


Identifying the Visitor
Proxy Servers
Browser Caches
Subjective:
1. what is Web Data Warehouse? (2 marks)
Answer: Page no: 350 Chapter: 39
Web Warehousing can be used to mine the huge web content for searching
information of interest. Its like searching the golden needle from the haystack.
Second reason of Web warehousing is to analyze the huge web traffic. This can be
of interest to Web Site owners, for e-commerce, for e-advertisement and so on.
Last but not least reason of Web warehousing is to archive the huge web content
because of its dynamic nature.
3. Write first two phases of Kimball's Approach of business dimensional
lifecycle. (2 marks)
Answer= Kimball also proposes a four-step approach where he starts to choose a
business process, takes the grain of the process, and chooses dimensions and
organization that is supported by some kind of legacy system (or systems).facts.
He defines a business process as a major operational process in the
4. There are four categories of data quality improvement. Write any two. (2
marks)
Ans. The four categories of Data Quality Improvement
Process
System
Policy & Procedure
Data Design

Mc130400536 Humda Mit 4th Sem

1. Data profiling is a process which involves gathering of information. What are


the purposes that it must fulfill? (3 marks)
Answer: Data profiling is a process which involves gathering of information
about column through execution of certain queries with intention to identify
erroneous records. In this process we identify the following:

Total number of values in a column


Number of distinct values in a column
Domain of a column
Values out of domain of a column
Validation of business rules
We run different SQL queries to get the answers of above questions. During this
process we can identify the erroneous records. Whenever we will come across an
erroneous record, we will just copy it in error or exception table and set the dirty
bit of record in the actual student table. Then we will correct the exception table.
After this profiling process we will transform the records and load them into a
new table
Student_Info
Ref: Handout Page No. 354
7. Apply Run length encoding on the given code and write output. (3 marks)
Case-I:

1111111110000111
Answer: 19#04#13

Case-II:

00001111000000

Answer: 04#14#06
8. Identify the given statement as correct or incorrect and justify your
answer in either case. (3 marks)
"One-way clustering is used to get local view and Two-way clustering is used to
get global view."
Answer: Incorrect
One-way clustering gives global view and bi-clustering gives local view
9. A pilot project strategy is highly recommended in data warehouse. What
are the reasons for its recommendation? (5 marks)
Answer: A pilot project strategy is highly recommended in data warehouse
construction, as a full blown data warehouse construction requires significant
capital investment, effort and resources. Therefore, the same must be attempted
only after a thorough analysis, and a valid proof of concept. A small scale project
in this regard serves many purposes such as (i) Show users the value of DSS
information, (ii) establish blue print processes for later full-blown project, (iii)
Mc130400536 Humda Mit 4th Sem

identify problem areas and, (iv) reveal true data demographics. Hence doing a
pilot project on a small scale seemed to be the best strategy.
10. Data acquisition and cleansing. (5 marks)
The pest scouting sheets are larger than A4 size (8.5 x 11), hence the right end
was cropped when scanned on a flat-bed A4 size scanner.
The right part of the scouting sheet is also the most troublesome, because of
pesticide names for a single record typed on multiple lines i.e. for multiple
farmers.
As a first step, OCR (Optical Character Reader) based image to text
transformation of the pest scouting sheets was attempted. But it did not work even
for relatively clean sheets with very high scanning resolutions.
Subsequently DEOs (Data Entry Operators) were employed to digitize the
scouting sheets by typing.
Data cleansing and standardization is probably the largest part in an ETL exercise.
For Agri-DWH major issues of data cleansing had arisen due to data processing
and handling at four levels by different groups of people i.e.
(i)

Hand recordings by the scouts at the field level

(ii)

typing hand recordings into data sheets at the DPWQCP office

(iii)
finally

photocopying of the scouting sheets by DPWQCPpersonnel and

(iv)

Data entry or digitization by hired data entry operators.

12. 1 table dia hua tha us mein Name, item, time aur gender dia hua tha aur
sath ye statement di Hui thi. (5 marks)
IF
Items/Time >= 6
Then
Gender= F
else
Gender = M
a) Find the accuracy % of given data.
b) If Name: Ali, Items: 2, time: 14 then find the gender of Ali.
Answer: page 278
The model in our case is a rule that if the per item minutes for any customer is
greater or equal than 6 than the customer is female else a male i.e.
The above rule is based on the common notion that females spend more time
during shopping than male customers. Exceptions can be there and are treated as
outliers.
Mc130400536 Humda Mit 4th Sem

Since for the first record the ration is greater than 6 meaning that our model will
assign it to the female class, but that may be an exception or noise. The second
and the third records are as per rule. Thus, the accuracy of our model is 2/3 i.e. .
66%. In other words we can say the confidence level of our classification model is
66%. The accuracy may change as we add more data. Now unseen data is brought
into the picture. Suppose there is a record with name Firdous, time spent 15
minutes and 1 item purchased. We predict the gender by using our classification
model and as per our model the customer is assigned F (15/1=15 which is
greater than 6).

Subjective:
1. 1.

Write 4 partitioning types of shared nothing in Parallel Software

Architecture?
Answer
Shared nothing RDBMS architecture requires a static partitioning of each table in
the database.
How do you perform the partitioning?

Hash partitioning
Key range partitioning.
List partitioning.
Round-Robin
Combinations (Range-Hash & Range-List

1. 2.
1. 3.

What is Web data ware house? (Answer in current solution file)


Variants of nested-loop?

Answer:
Nested-Loop Join: Variants
1. Naive nested-loop join
2. Index nested-loop join
3. Temporary index join nested-loop
1. 4.

Is there any strategy to standardize a column

Answer: page 480


There are no fixed strategies to standardize the columns.
1. 5.

Dynamic attributes of agri data ware house(answer in current solution file)


Mc130400536 Humda Mit 4th Sem

1. 6.

Write 2 limitation of persistent cookies

Answer:
Answer= Limitations
It's possible that the visitor will have his or her browser set to refuse cookies or
may clean out his or her cookie file manually, so there is no absolute guarantee
that even a persistent cookie will survive.
Although any given cookie can be read only by the Web site that caused it to be
created, certain groups of Web sites can agree to store a common ID tag that
would let these sites combine their separate notions of a visitor session into a
super session
1. 7. as the number of processes increase, the speedup should also increase.
Thus theoretically there should be a linear speedup; however this is not the
case in real. List at least 2 barrier of linear speedup.

Answer:

Amdahl Law
Startup
Interference
Skew

1. 8.
In context of nested loop join, mention two guide lines for outer table.
(answer in current solution file)
1. 9.
before sitting down with the business community to gather
information, it is suggested to set you up for a productive session. Write three
activities requirement preplanning phase
Answer:
Requirements preplanning: This phase consists of activities like choosing the
forum, identifying and preparing the requirements team and finally selecting,
scheduling and preparing the business representatives.

Best of Luck
Mc130400536 Humda Mit 4th Sem

Mc130400536 Humda Mit 4th Sem

You might also like