Data Warehousing Interview Questions and Answers
Data Warehousing Interview Questions and Answers
1. What is Data Warehousing? Integrated, Subject Oriented, Non Volatile, Time Variant database that provides support for decision making. Integrated: Data warehouse is centralized, consolidated database that integrated data derived from entire organization. Multiple Sources Diverse Sources Diverse Format Subject Oriented: Data is arranged and optimized to provide answers to the question from diverse functional areas. Data is organized and summarized by topic. Sales/Marketing/Finance/Distribution/etc. Time Variant: The data warehouse represent flow of data through time. Can contain projected data from statistical models. Data is periodically uploaded then time dependent data is recomputed. Non Volatile: Once data is entered it is never removed. Represents the entire companys history Near team history is continuously added to it. Always growing Must support terabyte database and multiprocessors. 2. What is Data Mart? Subset of Data Warehouse or we can say small data store. According to business requirement data warehouse is divided into data marts. Ex. Territory, Purchase, Sales. 3. Types of Data marts: a) Dependent Data Mart b) Independent Data Mart. Dependent Data Mart: It is like top down approach, data marts are divided according to the data warehouse. Data captured from existing enterprise data warehouse. Independent Data Mart: Data marts are created without data warehouse. Data Marts those are not dependent on data warehouse. Data captured from transaction processing system. 4. Types of Dimension Models? a) Star Schema. B) Snowflake Schema. c) Multi star Model.
5. What is Star Schema? Fact table surrounded by Dimensions tables. Or we can say one Fact table and several Dimension table Fact
6. Snowflake Schema: Further normalization of dimension table in star schema is result in snowflake schema. 7. Top down Approach: Data marts are divided according to the data warehouse. 8. Bottom Up Approach: Here Data warehouse is created from data marts. That means by merging data marts data warehouse is created. Note: It is better to prefer Top down Approach, because in bottom up approach number of ETL process increases so there is more chance of errors. 9. Star Schema Grouping: To get confirmed dimension we group two or more star schemas. 10. Conformed Dimension: One Dimension table used in more than one star schema then that dimension called as Conformed Dimension. 11. Types of tables Dimension Table and Fact table Dimension Table: Dimension table contains attributes and levels. It contains descriptive information of the numerical values in the fact table. They contains key attribute of the facts. Ex. Time Dimension Year Quarter Month Week Day Fact Table: Fact Table Contains numerical values and Dimensional ids that is keys associated to dimension table. Ex Sales Order id Product id Customer id Store id Quantity Amount 12. Types of Dimensions. Confirmed Dimension Degenerated Dimension Junk Dimension 13. What is the use of star schema grouping? Star schema groups can facilitate multiple-fact reporting by indicating how to use regular dimensions that are common to many measure dimensions. Multiple-fact reporting is also known as cross-functional reporting.
Mail id : [email protected] [email protected] Skype: [email protected] 14. Junk Dimension/ Garbage Dimension: Garbage dimension consist of low cardinality columns such as flags, indicator, code and status. Attributes in Garbage dimension are not related to any hierarchy. 15. Conformed Dimension: Conformed Dimension means same thing to each fact table to which it can join. A more precise definition is that two dimension are conformed if they share one or more than one, or all attributes that drawn from same domain 16. Degenerate Dimension: The dimension that itself doesnt contain any primary key, the primary keys of other dimensions acts as primary keys in this dimension. For eg. Sales have Product ID, Customer ID, Territory ID, all these IDs are primary keys in their respective dimension but in Sales there is no primary key these IDs acts as primary keys in this Sales. 17. Types of Measures or Facts Additive Facts: Additive Facts are the facts that can be added across the entire dimension in the fact table, and are the most common type of fact. These facts would be used across several dimensions for summation process. Semi Additive Facts: These are the facts that can be added across some of the dimension but not all. For e.g. Head Count and quality on hand referred to semi additive facts. Non Additive Facts: Facts cannot be added for any dimension. That is, they cannot be logically added between records or facts. 18. Surrogate Keys Surrogate keys are keys that are maintained within the data warehouse instead of natural keys taken from the source data system. It may possible that a single key is being used by different instances of the same entity. The major problem comes when trying to consolidate information from various source systems. We cannot rely on using the natural primary keys of the source system as dimension primary key because there is no guarantee that natural primary keys will be unique for each instance.
19. Granularity: Level of summarization of data element Granularity may be defined as the level of details made available in the dimensional model. It refers to the level of detail or summarization of the unit data in the data warehouse. More detail is the lower level of granularity and less details is the highest level of granularity. So it is better to maintain data at lower level so that data can drill down from higher level to lower level. 20. Slowly Changing Dimension Slowly Changing Dimension is a dimension whose attribute or attributes for a record (row) changes slowly over time. Type 1(Current data): Type 1 approach overwrites the existing dimensional model with new data, and therefore no history is preserved only current data is maintained. This may be best approach to use if the attribute change is simple such as correction in spelling. And if the old value was wrong, it may not be critical that history is not maintained.
Mail id : [email protected] [email protected] Skype: [email protected] Type 2(Current Data + Whole Historical Data): Type 2 approach adds new dimension row for the changed attribute, and therefore preserves history. As it adds new record for every attribute change, it can significantly increase the database size. Flags: Version No. New, Old Effective Data Range Type3 (Current data + 1 time historical data): Type 3 approach is used only if there is a limited need to preserve and accurately describe history. 21. Types of OLAP ROLAP (Relational OLAP): Multi dimensional analysis using multidimensional view of relational data. A relational database is used as an underlying data structure. Multi dimensional analysis means analysis of data along several dimensions. Ex. Analyzing Revenue by Product, Store and Date. MOLAP (Multi dimensional OLAP): It is an OLAP that uses Multi dimensional database as data structure. HOLAP (Hybrid OLAP) 22. Factless Fact Table: A fact table with only foreign keys and no facts is called Factless Fact Table. The Foreign keys can be used for counting purpose. 23. Hierarchy: Hierarchy defines relationship between the attributes of the dimension that identifies different level that exist within them. Data is maintained at levels, from one level to another level is known as hierarchy. We can arrange members of dimension into one or more hierarchies; each hierarchy can also have multiple hierarchy levels. The relationship between from one level to another is 1 to n. For e.g. A Year have n quarters. 24. OLTP (ONLINE TRANSACTION PROCESSING): OLTP supports ER modeling. OLTP application can perform INSERT, UPDATE and DELETE operation against database operation. OLTP application can keep data of 1 year. Data is normalized. OLTP is a traditional term used for transaction to carry day to day business functions such as ERP, CRM. OLTP solved critical business problems but not designed for analysis and quires. In OLTP database there is detailed and current data 25. OLAP (ONLINE ANALYTICAL PROCESSING): OLAP supports dimensional modeling. OLAP application can perform only INSERT operation. OLAP application can keep n no. of years of data. Data is denormalized. For analyzing data OLAP performs four types of operation drill up, roll up, slice (cuts the cube) and dice (rotate the cube). We can analyze the data in more than 1 dimensional point of view. Use of OLAP is fast response time for ad hoc queries. OLAP were designed to provide analysis and queries efficiently as compared to real time OLTP. OLAP uses multi dimensional model, with primary purpose of running
Mail id : [email protected] [email protected] Skype: [email protected] complex analytical and ad hoc queries. In OLAP database there is aggregated, historical data, store in multidimensional schema (usually star schema) 26. Difference between OLTP and OLAP. OLTP OLAP OLTPS are the original source of data OLAP data comes from various OLTP databases. Run fundamental business task Provide support for decision making. Short and fast inserts and updates initiated by end users. Long running batch jobs refresh the data. Processing speed is typically very fast. Depends on amount of data. Highly normalized with many tables. Denormalized with fewer tables. OLTP reports run on low volume of data and returns fewer records OLAP reports run on huge volume of data. Reveals a snapshot of ongoing business processes. Multi dimensional views of various kinds of business activities. Simple queries returning relatively few records. Aggregation. OLTP database contains detailed and current data. OLTP database contains historical data and aggregation structure Highly normalized with many tables Denormalized with fewer tables. 27. ODS (Operational Data Store): Many businesses require very fast response time and those businesses dont have access to data warehouse. When subsecond response time is required and integrated data must be accessed, there is structure known as ODS operational data store that is place to go to when high performance processing must be done. 28. 29. 30. Alternative Hierarchy Normalization Denormalization