0% found this document useful (0 votes)
156 views11 pages

Unit Ii

The operational management layer stores all processing schedules and workflows for the data science ecosystem. It enables centralized monitoring and communication across the entire ecosystem. The layer manages processing stream definitions, parameters, scheduling, monitoring, and communication. The Drum-Buffer-Rope methodology identifies the slowest process and uses it to pace the entire pipeline, tying other processes to this "drum" to control system speed. The audit, balance, and control layer records processes, ensures balanced processing capability, and controls execution while enabling error recovery. Built-in logging is monitored through independent watchers that pipe logs to the central audit store to provide oversight of the entire ecosystem.

Uploaded by

112 Pranav Khot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views11 pages

Unit Ii

The operational management layer stores all processing schedules and workflows for the data science ecosystem. It enables centralized monitoring and communication across the entire ecosystem. The layer manages processing stream definitions, parameters, scheduling, monitoring, and communication. The Drum-Buffer-Rope methodology identifies the slowest process and uses it to pace the entire pipeline, tying other processes to this "drum" to control system speed. The audit, balance, and control layer records processes, ensures balanced processing capability, and controls execution while enabling error recovery. Built-in logging is monitored through independent watchers that pipe logs to the central audit store to provide oversight of the entire ecosystem.

Uploaded by

112 Pranav Khot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DATA SCIENCE UNIT II PART- I

1) Explain in detail the function of operational management layer.


i. The operational management layer is the core store for the data science ecosystem’s complete
processing capability. The layer stores every processing schedule and workflow for the all-inclusive
ecosystem.
ii. This area enables you to see a singular view of the entire ecosystem. It reports the status of the
processing.
iii. The operations management layer is the layer where the following:
1. Processing-Stream Definition and Management:
 The processing-stream definitions are the building block of the data science ecosystem.
 Definition management describes the workflow of the scripts through the system, ensuring
that the correct execution order is managed, as per the data scientists’ workflow design.
2. Parameters:
 The parameters for the processing are stored in this section, to ensure a single location for
all the system parameters.
3. Scheduling:
 The scheduling plan is stored in this section, to enable central control and visibility of the
complete scheduling plan for the system.
4. Monitoring:
 The central monitoring process is in this section to ensure that there is a single view of the
complete system.
 Always ensure that you monitor your data science from a single point.
 Having various data science processes running on the same ecosystem without central
monitoring is not advised.
5. Communication:
 All communication from the system is handled in this one section, to ensure that the system
can communicate any activities that are happening.
2) Give overview of Drum-Buffer-Rope Method.
1. The scheduling plan is stored in this section, to enable central control and visibility of the complete
scheduling plan for the system.
2. Drum-Buffer-Rope methodology. The principle is simple.

Original Drum-Buffer-Rope use

3. Similar to a troop of people marching, the Drum-Buffer-Rope methodology is a standard practice to


identify the slowest process and then use this process to pace the complete pipeline. Then tie the rest of
the pipeline to this process to control the eco-system’s speed.
4. So, now place the “drum” at the slow part of the pipeline, to give the processing pace, and attach the
“rope” to the beginning of the pipeline, and the end by ensuring that no processing is done that is not
attached to this drum.
5. This ensures that processes completed more efficiently, as nothing is entering or leaving the process
pipe without been recorded by the drum’s beat.

3) Give overview of the function of audit, balance, and control layer.


1. Audit:
i. An audit is a systematic and independent examination of the ecosystem.
ii. The audit sublayer records the processes that are running at any specific point within the
environment. This information is used by data scientists and engineers to understand and plan
future improvements to the processing.
iii. The use of the built-in audit capability of the data science technology stack’s components supply you
with a rapid and effective base for your auditing.
iv. In the data science ecosystem, the audit consists of a series of observers that record preapproved
processing indicators regarding the ecosystem.
2. Balance:
i. The balance sublayer ensures that the ecosystem is balanced across the accessible processing
capability or has the capability to top up capability during periods of extreme processing.
ii. The processing on-demand capability of a cloud ecosystem is highly desirable for this purpose.
iii. deploying a deep reinforced learning algorithm against the cause-and-effect analysis system can
handle any balance requirements dynamically.
3. Control:
i. The control sublayer controls the execution of the current active data science.
ii. The control elements are a combination of the control element within the Data Science Technology
Stack’s individual tools plus a custom interface to control the overarching work.
iii. The control sublayer also ensures that when processing experiences an error, it can try a recovery,
as per your requirements, or schedule a clean-up utility to undo the error.
iv. The cause-and-effect analysis system is the core data source for the distributed control system in
the ecosystem.

4) Explain the Fundamental data science process step.


Following are the five fundamental data science process steps that are the core of my approach to
practical data science.
1. Start with a What-if Question:
i. Even if it is only the subset of the data lake you want to use for your data science, which is a
good start.
ii. For example, let’s consider the example of a small car dealership. Suppose I have been informed
that Bob was looking at cars last weekend. Therefore, I ask: “What if I know what car my
customer Bob will buy next?”
2. Take a Guess at a Potential Pattern:
i. Use your experience or insights to guess a pattern you want to discover, to uncover additional
insights from the data you already have.
ii. For example, I guess Bob will buy a car every three years, and as he currently owns a three-year-
old Audi, he will likely buy another Audi. I have no proof; it’s just a guess or so-called gut feeling.
Something I could prove via my data science techniques.
3. Gather Observations and Use Them to Produce a Hypothesis:
i. So, I start collecting car-buying patterns on Bob and formulate a hypothesis about his future
behavior.
ii. For those of you who have not heard of a hypothesis, it is a proposed explanation, prepared on
the basis of limited evidence, as a starting point for further investigation.
4. Use Real-World Evidence to Verify the Hypothesis:
i. Now, we verify our hypothesis with real-world evidence.
ii. On our CCTV, I can see that Bob is looking only at Audis and returned to view a yellow Audi R8
five times over last two weeks. So, our hypothesis is verified. Bob wants to buy my yellow Audi
R8.
5. Collaborate Promptly and Regularly with Customers and Subject Matter Experts As You Gain Insights
i. These five steps work, but I will acknowledge that they serve only as my guide while
prototyping.
ii. Once you start working with massive volumes, velocities, and variance in data, you will need a
more structured framework to handle the data science.

5) List the Superstep for processing the data lake.


The processing algorithms and data models are spread across six supersteps for processing the data
lake.
1. Retrieve
 This superstep contains all the processing chains for retrieving data from the raw data lake
into a more structured format.
2. Assess
 This superstep contains all the processing chains for quality assurance and additional data
enhancements.
3. Process
 This superstep contains all the processing chains for building the data vault.
4. Transform
 This superstep contains all the processing chains for building the data warehouse from the
core data vault.
5. Organize
 This superstep contains all the processing chains for building the data marts from the core
data warehouse.
6. Report
 This superstep contains all the processing chains for building virtualization and reporting of
the actionable knowledge.
6) Explain the different way of implementing the built in logging in the audit face.
i. Built-in Logging recommend that you do not change the internal or built-in logging process of any of the
data science tools, as this will make any future upgrades complex and costly
ii. Normally, I build a controlled systematic and independent examination of all the built-in logging vaults.
iii. deploy five independent watchers in built-in logging
1. Debug Watcher:
 This is the maximum verbose logging level.
 If I discover any debug logs in my ecosystem, I normally raise an alarm, as this means that
the tool is using precise processing cycles to perform low-level debugging.
2. Information Watcher:
 The information level is normally utilized to output information that is beneficial to the
running and management of a system.
 I pipe these logs to the central Audit, Balance, and Control data store, using the ecosystem
as I would any other data source.
3. Warning Watcher:
 Warning is often used for handled “exceptions” or other important log events. Usually this
means that the tool handled the issue and took corrective action for recovery.
 I pipe these logs to the central Audit, Balance, and Control data store, using the ecosystem
as I would any other data source.
4. Error Watcher:
 Error is used to log all unhandled exceptions in the tool.
 This is not a good state for the overall processing to be in, as it means that a specific step in
the planned processing did not complete as expected.
 Now, the ecosystem must handle the issue and take corrective action for recovery.
 I pipe these logs to the central Audit, Balance, and Control data store, using the ecosystem
as I would any other data source.
5. Fatal Watcher:
 Fatal is reserved for special exceptions/conditions for which it is imperative that you quickly
identify these events.
 This is not a good state for the overall processing to be in, as it means a specific step in the
planned processing has not completed as expected.
 The ecosystem must now handle the issue and take corrective action for recovery.
 Once again, I pipe these logs to the central Audit, Balance, and Control data store, using the
ecosystem as I would any other data source.

7) Explain the different way of implementing basic audit in the logging face.
Every time a process is executed this logging allows you to log everything that occurs a central file
1. Process Tracking:
i. I normally build a controlled systematic and independent examination of the process for the
hardware logging.
ii. There is numerous server-based software that monitors temperature sensors, voltage, fan
speeds, and load and clock speeds of a computer system.
2. Data Provenance:
i. Keep records for every data entity in the data lake, by tracking it through all the transformations
in the system
ii. This ensures that you can reproduce the data, if needed, in the future and supplies a detailed
history of the data’s source in the system.
3. Data Lineage:
i. This involves keeping records of every change that happens to the individual data values in the
data lake.
ii. This enables you to know what the exact value of any data record was in the past.
iii. It is normally achieved by a valid-from and valid-to audit entry for each data set in the data
science environment.
8) List and explain the data structure of the functional layer of the ecosystem.
i. The functional layer of the data science ecosystem is the largest and most essential layer for
programming and modeling.
ii. It consists of several structures as follows:
1. Data schemas and data formats: Functional data schemas and data formats deploy onto the data
lake’s raw data, to perform the required schema-on-query via the functional layer
2. Data models: These form the basis for future processing to enhance the processing capabilities of
the data lake, by storing already processed data sources for future use by other processes against
the data lake.
3. Processing algorithms: The functional processing is performed via a series of well-designed
algorithms across the processing chain.
4. Provisioning of infrastructure: The functional infrastructure provision enables the framework to add
processing capability to the ecosystem, using technology such as Apache Mesos, which enables the
dynamic previsioning of processing work cells.
DATA SCIENCE UNIT II PART- II
1) Explain the Retrieve Superstep.
1. The Retrieve superstep is a practical method for importing completely into the processing ecosystem a data
lake consisting of various external data sources.
2. The Retrieve superstep is the first contact between your data science and the source systems.
3. The successful retrieval of the data is a major stepping-stone to ensuring that you are performing good data
science.
4. Data lineage delivers the audit trail of the data elements at the lowest granular level, to ensure full data
governance.
5. Data governance supports metadata management for system guidelines, processing strategies, policies
formulation, and implementation of processing.
6. Data quality and master data management helps to enrich the data lineage with more business values, if you
provide complete data source metadata.
7. The Retrieve superstep supports the edge of the ecosystem, where your data science makes direct contact
with the outside data world.
8. a current set of data structures that you can use to handle the deluge of data you will need to process to
uncover critical business knowledge.

2) Explain Data Lake and Data Swamp.


1. Data Lakes:
i. The contents of the data lake stream in from a source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples.
ii. A significant categorization of research and commercial activity have made the term famous and
even notorious.
iii. the term is so all-encompassing that you could describe any data that is on the edge of the
ecosystem and your business as part of your data lake
iv. The data lake is the complete data world your company interacts with during its business life span.
v. In simple terms, if you generate data or consume data to perform your business tasks
vi. As a lake needs rivers and streams to feed it, the data lake will consume an unavoidable deluge of
data sources from upstream and deliver it to downstream partners.
2. Data Swamps:
i. Data swamps are simply data lakes that are not managed. They are not to be feared. They need to
be tamed.
ii. Following are four critical steps to avoid a data swamp.
 Start with Concrete Business Questions
 Data Governance
 Data Source Catalog
 Business Glossary

3) State and explain the four critical steps to avoid data swamp.
Following are four critical steps to avoid data swamp:-
1. Start with Concrete Business Questions:
i. Simply dumping a horde of data into a data lake, with no tangible purpose in mind,will result in a big
business risk.
ii. The data lake must be enabled to collect the data required to answer your business questions.
iii. It is suggested to perform a comprehensive analysis of the entire set of data, stating full data lineage
for allowing it into the data lake
2. Data Quality:
i. More data points do not mean that data quality is less relevant.
ii. Data quality can cause the invalidation of a complete data set, if not dealt with correctly.
3. Audit and Version Management:
i. You must always report the following:
ii. Who used the process?
iii. When was it used?
iv. Which version of code was used?
4. Data Governance:
i. The role of data governance, data access, and data security does not go away with the volume of
data in the data lake.
ii. It simply collects together into a worse problem, if not managed.
iii. Data Governance can be implemented by the following ways:
a. Data Source Catalog
b. Business Glossary
c. Analytical Model Usage

4) Explain the general rules for data source catalog.


1. Unique data catalog number: I normally use YYYYMMDD/NNNNNN/NNN. E.g. 20171230/000000001/001 for
data first registered into the metadata registers on December 30, 2017, as data source 1 of data type 1. This
is a critical requirement.
2. Short description (keep it under 100 characters): Country codes and country names (Country Codes—ISO
3166)
3. Long description (keep it as complete as possible): Country codes and country names used by VKHC as
standard for country entries
4. Contact information for external data source: ISO 3166-1:2013 code lists
5. Expected frequency: Irregular (i.e., no fixed frequency, also known as ad hoc). Other options are near-real-
time, every 5 seconds, every minute, hourly, daily, weekly, monthly, or yearly.
6. Internal business purpose: Validate country codes and names.

5) Explain the following shipping term.


1. Seller
i. The person/company sending the products on the shipping manifest is the seller.
ii. In our case, there will be warehouses, shops, and customers. Note that this is not a location but a
legal entity sending the products.
2. Carrier
i. The person/company that physically carries the products on the shipping manifest is the carrier.
ii. Note that this is not a location but a legal entity transporting the products.
3. Port
i. A port is any point from which you have to exit or enter a country. Normally, these are shipping
ports or airports but can also include border crossings via road.
ii. Note that there are two ports in the complete process. This is important. There is a port of exit and a
port of entry
4. Ship.
i. Ship is the general term for the physical transport method used for the goods.
ii. This can refer to a cargo ship, airplane, truck, or even person, but it must be identified by a unique
allocation number
5. Terminal
i. A terminal is the physical point at which the goods are handed off for the next phase of the physical
shipping.
6. Named Place
i. This is the location where the ownership is legally changed from seller to buyer. This is a specific
location in the overall process.
ii. Remember this point, as it causes many legal disputes in the logistics industry.
7. Buyer
i. The person/company receiving the products on the shipping manifest is the buyer.
ii. In our case, there will be warehouses, shops, and customers. Note that this is not a location but a
legal entity receiving the products.

6) Explain the following shipping terms with examples.


1. EXW—Ex Works (Named Place of Delivery)
i. In this term, the seller makes the goods available at its premises or at another named place.
ii. This term places the maximum obligation on the buyer and minimum obligations on the seller.

2. FCA—Free Carrier (Named Place of Delivery)


i. Under this condition, the seller delivers the goods, cleared for export, at a named place.
ii. An overseas duty-free shop and then pick it up at the duty-free desk before taking it home, and the
shop has shipped it FCA—Free Carrier—to the duty-free desk

3. CPT—Carriage Paid To (Named Place of Destination)


i. The seller, under this term, pays for the carriage of the goods up to the named place of destination.
ii. Eg : If I were to buy Practical Data Science at an overseas bookshop and then pick it up at the export
desk before taking it home and the shop shipped it CPT—Carriage Paid To—the duty desk for free,
the moment I pay at the register, the ownership is transferred to me, but if anything happens to the
book between the shop and the duty desk of the shop, I will have to pay.

4. CIP—Carriage and Insurance Paid To (Named Place of Destination)


i. This term is generally similar to the preceding CPT, with the exception that the seller is required to
obtain insurance for the goods while in transit.
ii. Eg : CIP New York means the seller pays freight and insurance charges to new York

5. DAT—Delivered at Terminal (Named Terminal at Port or Place of Destination)


i. This Incoterm requires that the seller deliver the goods, unloaded, at the named terminal. The seller
covers all the costs of transport and assumes all risks until arrival at the destination port or terminal.
ii. The terminal can be a port, airport, or inland freight interchange, but it must be a facility with the
capability to receive the shipment.
iii. For example import duty, taxes, customs and on-carriage costs
6. DAP—Delivered at Place (Named Place of Destination)
i. It means that, at the disposal of the buyer, the seller delivers when the goods are placed on the
arriving means of transport, ready for unloading at the named place of destination.
ii. Eg A buyer in London enters into a DAP deal with a seller from New York to purchase a consignment
of goods
7. DDP—Delivered Duty Paid (Named Place of Destination)
i. In this term, the seller is responsible for delivering the goods to the named place in the country of
the buyer and pays all costs in bringing the goods to the destination, including import duties and
taxes
ii. The seller is not responsible for unloading.
iii. Eg A buyer in New York enters into a DDP deal with a seller from London to purchase a consignment
of goods.

7) List and explain the different data store used in DS.


While performing data retrieval you may have to work with one of the following data store
1. SQLite
i. This requires a package named sqlite3.

2. Microsoft SQL Server


i. Microsoft SQL server is common in companies, and this connector supports your connection to the
database.
ii. There are two options:-
 Via the ODBC interface
 Via the direct connection

3. MySQL
i. MySQL is widely used by lots of companies for storing data.
ii. This opens that data to your data science with the change of a simple connection string.
iii. There are two options:-
 For direct connect to the database
 For connection via the DSN service

4. Oracle
i. Oracle is a common database storage option in bigger companies.
ii. It enables you to load data from the following data source with ease:

5. Microsoft Excel
i. Excel is common in the data sharing ecosystem, and it enables you to load files using this format
with ease.
6. Apache Spark
i. Apache Spark is now becoming the next standard for distributed data processing.
ii. The universal acceptance and support of the processing ecosystem is starting to turn mastery of this
technology into a must-have skill.
7. Apache Cassandra
8. Apache Hive
9. Apache Hadoop
10. PyDoop
11. Amazon Web Services

8) Why it is necessary to train the data science team.


1. To present a data swamp, it is essential that you train your team also, data science is a team effort
2. People, process and technology are there cornerstones to ensures that data science is curate and protected
3. You are responsible for your people; share the knowledge you acquire from this book. The process I teach
you, need to teach them alone, you cannot achieve success
4. Technology requires that you invest time to understood in fully we are only at the down of major
development in the field of data engineering and data science
5. Remember: A big part of this process is to ensure that business users and data scientist the need to start
small, have concrete questions in mind, and realize tha is work to do with all data to achieve success

You might also like