Unit Ii
Unit Ii
7) Explain the different way of implementing basic audit in the logging face.
Every time a process is executed this logging allows you to log everything that occurs a central file
1. Process Tracking:
i. I normally build a controlled systematic and independent examination of the process for the
hardware logging.
ii. There is numerous server-based software that monitors temperature sensors, voltage, fan
speeds, and load and clock speeds of a computer system.
2. Data Provenance:
i. Keep records for every data entity in the data lake, by tracking it through all the transformations
in the system
ii. This ensures that you can reproduce the data, if needed, in the future and supplies a detailed
history of the data’s source in the system.
3. Data Lineage:
i. This involves keeping records of every change that happens to the individual data values in the
data lake.
ii. This enables you to know what the exact value of any data record was in the past.
iii. It is normally achieved by a valid-from and valid-to audit entry for each data set in the data
science environment.
8) List and explain the data structure of the functional layer of the ecosystem.
i. The functional layer of the data science ecosystem is the largest and most essential layer for
programming and modeling.
ii. It consists of several structures as follows:
1. Data schemas and data formats: Functional data schemas and data formats deploy onto the data
lake’s raw data, to perform the required schema-on-query via the functional layer
2. Data models: These form the basis for future processing to enhance the processing capabilities of
the data lake, by storing already processed data sources for future use by other processes against
the data lake.
3. Processing algorithms: The functional processing is performed via a series of well-designed
algorithms across the processing chain.
4. Provisioning of infrastructure: The functional infrastructure provision enables the framework to add
processing capability to the ecosystem, using technology such as Apache Mesos, which enables the
dynamic previsioning of processing work cells.
DATA SCIENCE UNIT II PART- II
1) Explain the Retrieve Superstep.
1. The Retrieve superstep is a practical method for importing completely into the processing ecosystem a data
lake consisting of various external data sources.
2. The Retrieve superstep is the first contact between your data science and the source systems.
3. The successful retrieval of the data is a major stepping-stone to ensuring that you are performing good data
science.
4. Data lineage delivers the audit trail of the data elements at the lowest granular level, to ensure full data
governance.
5. Data governance supports metadata management for system guidelines, processing strategies, policies
formulation, and implementation of processing.
6. Data quality and master data management helps to enrich the data lineage with more business values, if you
provide complete data source metadata.
7. The Retrieve superstep supports the edge of the ecosystem, where your data science makes direct contact
with the outside data world.
8. a current set of data structures that you can use to handle the deluge of data you will need to process to
uncover critical business knowledge.
3) State and explain the four critical steps to avoid data swamp.
Following are four critical steps to avoid data swamp:-
1. Start with Concrete Business Questions:
i. Simply dumping a horde of data into a data lake, with no tangible purpose in mind,will result in a big
business risk.
ii. The data lake must be enabled to collect the data required to answer your business questions.
iii. It is suggested to perform a comprehensive analysis of the entire set of data, stating full data lineage
for allowing it into the data lake
2. Data Quality:
i. More data points do not mean that data quality is less relevant.
ii. Data quality can cause the invalidation of a complete data set, if not dealt with correctly.
3. Audit and Version Management:
i. You must always report the following:
ii. Who used the process?
iii. When was it used?
iv. Which version of code was used?
4. Data Governance:
i. The role of data governance, data access, and data security does not go away with the volume of
data in the data lake.
ii. It simply collects together into a worse problem, if not managed.
iii. Data Governance can be implemented by the following ways:
a. Data Source Catalog
b. Business Glossary
c. Analytical Model Usage
3. MySQL
i. MySQL is widely used by lots of companies for storing data.
ii. This opens that data to your data science with the change of a simple connection string.
iii. There are two options:-
For direct connect to the database
For connection via the DSN service
4. Oracle
i. Oracle is a common database storage option in bigger companies.
ii. It enables you to load data from the following data source with ease:
5. Microsoft Excel
i. Excel is common in the data sharing ecosystem, and it enables you to load files using this format
with ease.
6. Apache Spark
i. Apache Spark is now becoming the next standard for distributed data processing.
ii. The universal acceptance and support of the processing ecosystem is starting to turn mastery of this
technology into a must-have skill.
7. Apache Cassandra
8. Apache Hive
9. Apache Hadoop
10. PyDoop
11. Amazon Web Services