Data-Science MUMBAI
Data-Science MUMBAI
Rapid Information Factory (RIF) System is a technique and tool which is used for processing the data in the
development. The Rapid Information Factory is a massive parallel data processing platform capable of processing
theoretical unlimited size data sets.
The Rapid Information Factory (RIF) platform supports five high-level layers:
• Functional Layer.
The retrieve super step supports the interaction between external data sources and the factory.
The assess super step supports the data quality clean-up in the factory.
The transform super step converts data vault via sun modeling into dimensional modeling to form a
data warehouse.
The organize super step sub-divides the data warehouse into data marts.
o Maintenance Utilities.
o Data Utilities.
o Processing Utilities.
• Business Layer.
❖ Data Science ecosystem has a bunch of series of tools which are used to build your solution. By using this
tools and techniques you will get rapid information in advanced for its better capability and new
development will occur each day.
❖ There are two basic data processing tools to perform the practical of data science as given below:
o Traditional Relational Database Management System requires a schema before loading the data.
Schema basically denotes the organizational data which is like a blueprint, describing how the
database should be constructed.
o Schema is a single structure which represents logical view of entire database. It represents how the
data is organized and related between them.
o It is responsible of the database designer to design the database perfect to understand the logic and
structure with the help of programmer.
o Relational Database Management System is used and designed to store the data.
o To Retrieve the data from the relational database system, you need to run the specific structure
query language to perform these tasks.
o It stores a dense of data and all the data are stored into the datastore and schema on write widely
use methodology to store the dense data.
o Schema on write schemas are build with the purpose which makes them change and maintain the
data into the database.
o When there is a lot of raw data which are available for the processing, during, some of the data are
lost and it makes them weak for future analysis.
o If some important data are not stored into the database then you cannot process the data for further
data analysis.
o Schema on read ecosystem does not need schema, without this you can load the data into the
database.
o This type of schema stores the minimal data with values into the database and some of the
important schema are applied during the query phase.
o These types of ecosystem are applicable for both experimental and exploration of data to retrieve
the data from the schema or structure.
o Schema on read generate the fresh and new data and increase the speed of data generation as well
as reduce the cycle time of data availability of actionable information.
o These types of ecosystem that means schema on read and schema on write are very useful and
essential for data scientist and engineering personal for better understanding about data
preparation, modeling, development, and deployment of data into the production.
o When you apply schema on read on structure, un-structure, and semi-structure, it would generate
very slow result because it does not have the schema fast retrieval of data into the data warehouse.
o Schema on read follow the agile way of working and it has capabilities and potential to work like
NoSQL database as it works in the environment.
o Some time schema on read through the error during the query time because there are three type of
data stored into the database like structure, un-structure, and semi-structure. There is no better
process and rule and regulation for fast and better retrieval of data from structure database.
❖ A Data Lake is storage repository of large amount of raw data that means structure, semi-structure,
unstructured data.
❖ This is the place where you can store three types of data structure, semi-structure, unstructured
data with no fix amount of limit and storage to store the data.
❖ If we compare schema on write and data lake then we will find that schema on write store the data
into the data warehouse in predefined database on the other hand data lake store the less data
structure to store the data into the database .
❖ Data Lake follow to store less data into the structure database because it follows the schema on read
process architecture to store the data.
❖ Data Lake allow us to transform the raw data that means structure, semi-structure, unstructured
data into the structure data format so that SQL query could be performed for the analysis.
❖ Most of the time data lake are deployed by using the distributed data object storage database which
enable the schema on read so that business analytics and data mining tools and algorithms can be
applied on the data.
❖ Retrieval of data is so fast because there is no schema applied. Data must be access without any
failure or any complex reason.
❖ It is low cost and effective way to store the large amount of data stored into centralized database for
further organizational analysis and deployment.
❖ Data Vault is a database modeling method which is designed to store the long-term historical storage
amount of data and it can controlled by using the data vault.
o In Data Vault, data must come from different sources and it is designed in such a ways that data
could be loaded in parallel ways so that large amount of data implementation can be done without
any failure or any major design.
o Data Vault is the process of transforming the schema on read data lake into schema on write data
lake.
o Data Vault are designed schema on read query request and after that it would be converted into the
data lake because schema on read increase the speed of generating new data for the better analysis
and implementation.
o Data Vault store a single version of data and does not distinguish between good data and bad data.
o Data Lake and Data Vault are built by using the three main component or structure of data i.e. Hub,
Link and satellite.
1. Hub :
❖ Hub has unique business key with low amount of data to be changed and meta data that means
data is the main source of generating the hubs.
❖ Hub contains a set of unique business key that will never change over a period manner.
o There are different types of hubs like person hub, time hub, object hub, event hub, locations hub.
The Time hub contains ID Number, ID Time Number, ZoneBasekey , DateTimekey, DateTimeValue
and all these links are interconnected to each other like Time-Person, Time-Object, Time-Event,
Time-Location, Time-Links etc.
o The Person hub contains IDPersonNumber, FirstName, SecondName, LastName, Gender, TimeZone,
BirthDateKey, BirthDate and all these links are interconnected to each other like Person- Time,
Person-Object, Person-Location, Person-Event, Person-Link etc.
o The Object hub contains IDObjectNumber, ObjectBaseKey, ObjectNumber, ObjectValue and all these
links are interconnected to each other like Object-Time, Object-Link,Object-Event,Object-
Location,Object-Person etc.
o The Eventhub contains IDEventNumber, EventType, EventDescription and all these links are
interconnected to each other like Event-Person, Event-Location, Event-Object, Event-Time etc.
❖ Link plays a very important role during transaction and association of business key. The Table
relate to each other depending upon the attribute of table like that one to one relationship, one
to many relationships, Many to One relationship, Many to many relationships.
❖ Link represent and connect only element in the business relationships because when one node or
link relate to one or another link on that time data transfers smoothly.
Satellites :
❖ When the hubs and links produce and form the structure of satellites which store no
chronological structure of data means then it would not provide the information about the mean,
median, mode, maximum, minimum, sum of the data.
❖ Satellites are the strong structure of data that store a detailed information about the related data
or business characteristics key and stores large volume of data vault.
❖ The combinations of all these three i.e. hub, link, and satellites are formed together to help the
data analytics and data scientists and data engineer to store the business structure, types of
information or data into it.
❖ The process of transforming the data, data lake to data vault and then transferring the data vault
into data warehouse.
❖ Most of the data scientist and data analysis, data engineer uses these data science processing tool
to process and transfer the data vault into data warehouse.
➢ Apache Spark is an open source clustering computing framework. The word open source means
it is freely available on internet and just go on internet and type apache spark and you can get
freely source code, you can download and use according to your wish.
➢ Apache Spark was developed at AMP Lab of university of California, Berkeley and after that all
the code and data was donated to Apache Software Foundation to keep doing changes over a
time and make it more effective, reliable, portable that will run on all the platform.
➢ Apache Spark provide an interface for the programmer and developer to directly interact with
the system and make data parallel and compatible with data scientist and data engineer.
➢ Apache Spark has the capabilities and potential, process all types and variety of data with
repositories including Hadoop distributed file system, NoSQL database as well as apache spark.
➢ IBM are hiring most of the data scientist and data engineer, who has more knowledge and
information about apache spark project so that innovation could perform an easy way and will
come up with more feature and changing.
➢ Apache Spark has potential and capabilities to process the data very fast and hold the data in
memory and transfer the data into memory data processing engine.
➢ Spark Core is base and foundation for over all of the project development and provide some
most important Information like distributed task, dispatching, scheduling and basic Input and
output functionalities.
➢ By using spark core, you can have more complex queries that will help us to work with complex
environment.
➢ The distributed nature of spark ecosystem enables you the same processing data on a small
cluster, to go for hundreds or thousand of nodes without making any changes.
➢ Apache Spark uses Hadoop in two ways one is storage and second one is for processing
purpose.
➢ Spark is not a modified version of Hadoop distributed file system, because it depend upon the
Hadoop which has its own feature and tool for data storage and data processing.
➢ Apache Spark has a lot of feature which makes it compatible and reliable. Speed is one the
most important feature of spark that means with the help of spark, your application are able to
run on directly on Hadoop and it is 100 times much faster in the memory.
➢ Apache Spark Core support many more language and it has its own built in function and API in
java, Scala, python that means you can write the application by using the java, python, C++,
Scala etc.
3. Spark SQL:
➢ Spark SQL is a component on top of the Spark Core that presents data abstraction called Data
Frames.
➢ Spark SQL is fast clustering data abstraction, so that data manipulation can be done for fast
computation.
➢ Apache Spark SQL provide a much relationship between relational database and procedural
processing. This comes, when we want to load the data from traditional way into data lake
ecosystem.
➢ Spark SQL is Apache Spark’s module for working with structured and semi-structured data and
it originated to overcome the limitation of apache hive.
➢ It always dependent upon the MapReduce engine of Hadoop for execution and processing of
data and allows the batch-oriented operation.
➢ Hive lags in performance uses to MapReduce jobs for executing ad-hoc process and hive does
not allow you to resume a job processing, if it fails in the middle.
➢ Spark performs better operation than hive in many situation. Latency in the terms of hours
and CPU reservation time.
➢ You can integrate the Spark SQL and querying structured, semi-structured data inside the
apache spark.
➢ Spark SQL follows the RDD Model and it also support large job and middle query fault
tolerance.
➢ You can easily connect the Spark SQL with the JDBC and ODBC for better connectivity of
business purpose.
➢ Apache Spark Streaming enables powerful interactive and data analytics application for live
streaming data. In Streaming, data is not fixed and data comes from different source
continuously.
➢ Stream divide the incoming input data into the small-small unit of data for further data
analytics and data processing for next level.
➢ There are multilevel of processing involved in it. Live streaming data are received and divided
into small-small parts or batches and these small-small of data or batches are then processed
or mixed by the spark engine to generate or produced the final level of streaming of data.
➢ Processing of data in the system in Hadoop has very high latency means that data is not
received on timely manner and it is not suitable for real time processing requirement.
➢ Processing of data is generated by storm, if it is not happened again. But this type of mistake
and latency give the data loss and repetition of records processing.
➢ Most of scenario, Hadoop are used for data batching and Apache Spark are used for the live
streaming of data.
➢ Apache Streaming provide and help us to fix these types of issue and provides reliable,
portable, scalable, efficiency, and integration of the system.
➢ The usage can be seen in the Facebooks, LinkedIn connection, google map, and
internet routers use these types of tool for better response and analysis.
➢ Graph is an abstract data types that means it is used to implement the directed and
undirected graph concepts from the mathematics in the graph theory concept.
➢ In the graph theory concept, each data associate with some other data with edge
like numeric value.
➢ Every edge and node or vertex have user defined properties and values associated
with it.
➢ Speed is one of the most important point in the point of Graph and it is comparable
with the fastest graph system while when there is any fault tolerance and provide
ease of use.
➢ We can choose lots of more feature that comes with more flexibilities and
reliability and it provide library of graph algorithms.
➢ Apache Mesos is an open source cluster manager and it was developed by the
universities of California, Berkeley.
➢ It provides all the required resource for the isolation and sharing purpose across all
the distributed application.
➢ The software we are using for Mesos, provide resources sharing in a fine-grained
manner so that improving can be done with the help of this.
➢ Mesosphere enterprises DC/OS is the enterprise version of Mesos and this run
specially on Kafka, Cassandra, spark and Akka .
➢ It can handle the workload in distributed environments by suing the dynamic
sharing and isolation manner.
➢ Whatever the data are available in the existing system, it will be grouped together
with the machine or node of the cluster into a single cluster so that load could be
optimized.
7. Akka:
➢ Akka is an actor-based message driven runtime for running concurrency, elasticity, and
resilience processes.
➢ The actor can be controlled and limited to perform the intended task only. Akka is an open
source library or toolkit.
➢ Apache Akka is used to create distributed and fault tolerance and it can be integrated to this
library into the java virtual machine or JVM to support the language.
➢ The Actor is an entity which communicate with another actor by passing the massage to each
other and it has its own state and behavior.
➢ In object-oriented programming like that everything is an object same thing is here like Akka is
an actor based driven system.
➢ In other way we can say that Actor is an object that include and incapsulate it states and
behavior.
➢ Cassandra can be used for both real time operational data store for online
transaction data application.
➢ Cassandra is designed for to have peer to peer process continues nodes instead of
master or named nodes to ensure that there should not be any single point of
failure.
➢ A NoSQL database is a database that provide mechanism to store and retrieve the
data from the database than relational database.
➢ NoSQL database uses different data structure compared to relational database and
it support very simple query language.
➢ NoSQL Database has no Schema and does not provide the data transaction.
➢ Kafka is a high messaging backbone that enables communication between data processing
entities and Kafka is written in java and Scala language.
➢ Apache Kafka is highly scalable, reliable, fast, and distributed system. Kafka is suitable for both
offline and online message consumption.
➢ Kafka messages are stored on the hard disk and replicated within the cluster to prevent the
data loss.
➢ Kafka is distributed, partitioned, replicated and fault tolerant which make it more reliable.
➢ Kafka messaging system scales easily without down time which make it more scalable. Kafka
has high throughput for both publishing and subscribing messages, and it can store data up to
TB.
➢ Kafka has unique platform for handling the real time data for feedback and it can handle large
amount of data to diverse consumers.
➢ Kafka persists all data to the disk, which essentially means that all the writes go to the page
cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a
network socket.
1. Elastic Search:
➢ Elastic Search is an open source, distributed search and analytical engine designed.
➢ Scalability mean that it can scale any point of view, reliability means that it should be
trustable, stress free management.
➢ Combine the power of search and power of analytics so that developers, programmers, data
engineer and data scientist could work with very smoothly with structures, un-structured,
and time series data.
➢ Elastics search is an open source that means anyone can download and work with it and it is
developed by using java language and most of the big organization are using this search
engine for their need.
➢ It enables the user to expand the very large amount of data at very high speed.
➢ It is used for the replacement of the documents and data store in the database like mongo
dB etc.
➢ Elastic search is one of the popular search engines and mostly used by the recent
organization like google, stack Overflow, GitHub and much more.
➢ Elastic Search is an open source search engine and is available under the hive version 2.0.
➢ R is a programming language and it is used for statistical computing and graphics purpose.
➢ R Language are used by data engineer, data scientist, statisticians, and data miners for
developing the software and performing data analytics.
➢ There is core requirement before learning the R Language and some depend on library and
package concept that you should know about it and know how to work upon it easily.
➢ The related packages are of R Language is sqldf, forecast, dplyr, stringer, lubridate, ggplot2,
reshape etc.
➢ R language is freely available, and it comes with General Public License and it supports many
of the platform like windows, Linux/Unix, Mac.
➢ R language has built in capability to support and can be implemented and integrated with
procedural language written in c, c++, java, .Net, and python.
➢ R Language has capacity and potential for handling data and data storage.
➢ Most of the data science project and framework are build by using the Scala programming
language because it has so many capabilities and potential to work with it.
➢ Scala integrate the feature of object-oriented language and its function because Scala can be
written in java, c++, python language.
➢ Types and behavior of objects are described by the class and class can be extended by another
class by using its properties.
➢ Scala support the high-level functions and function can be called by another function by using and
written the function in a code.
➢ Once the Scala program is ready to compile and executive, Scala program convert into the byte
code (machine understandable language) with the help of java virtual machine.
➢ This means that Scala and Java Programs can be complied and executed by using the JVM. So, we
can easily say that it can be moved from Java to Scala and vice-versa.
➢ Scala enables you to use and import all the class, object and its behavior and function because
Scala and java run with the help of Java Virtual Machine and you can create its own class and
object.
4 Python:
➢ Python is a programming language and it can used on a server to create web application.
➢ Python can be used for web development, mathematics, software development and it is used to
connect the database and create and modify the data.
➢ Python can handle the large amount of data and it is capable and potential to perform the
complex task on data.
➢ Python is reliable, portable, and flexible to work on different platform like windows, mac and
Linux etc.
➢ As compare to the other programming language , python is easy to learn and can perform the
simple as well as complex task and it has the capabilities to reduce the line of code and help the
programmer and developers to work with is easily friendly manner.
➢ Python support object-oriented programming language, functional and work with structure data.
➢ Python support dynamics data type and can be supported by dynamics type checking.
➢ Python is an interpreter and it has the philosophy and statements that it reduces the line of code.
1. Vermeulen PLC:
➢ Vermeulen PLC is a data processing company which process all the data within the group
companies.
➢ This is the company for which we hire most of the data engineer and data scientist to work
with it.
➢ This company supplies data science tool, Network, server and communication system ,
internal and external web sites, decision science and process automation.
2. Krennwallner AG:
➢ This is an advertising and media company which prepares advertising and media
information which is required for the customers.
➢ By using the survey, it specifies and choose content for the billboards, make and
understand how many times customer are visited for which channel.
3. Hillman Ltd:
➢ This is logistic and supply chain company and it is used to supply the data around the
worldwide for the business purpose.
4. Clark Ltd:
➢ This is the financial company which process all financial data which is required for financial
purpose includes Support Money, Venture Capital planning and allow to put your money on
share market.
➢ Most of the data science project and framework are built by using the Scala programming
language because it has so many capabilities and potential to work with it.
➢ Scala integrate the feature of object-oriented language and its function because Scala can be
written in java, C++, python language.
➢ Types and behavior of objects are described by the class and class can be extended by another
class by using its properties.
➢ Scala support the high-level functions and function can be called by another function by using and
written the function in a code.
Apache Spark:
o Apache Spark is an open source clustering computing framework. The word open source means it is
freely available on internet and just go on internet and type apache spark and you will get freely
source code are available there, you can download and according to your wish.
o Apache Spark was developed at AMP Lab of university of California, Berkeley and after that all the
code and data was donated to Apache Software Foundation for keep doing changes over a time and
make it more effective , reliable, portable that will run all the platform.
▪ Apache Spark has the capabilities and potential, process all types and variety of data with
repositories including Hadoop distributed file system, NoSQL database as well as apache spark.
▪ IBM are hiring most of the data scientist and data engineer to whom has more knowledge and
information about apache spark project so that innovation could be perform an easy way and
will come up with more feature and changing.
Apache Mesos:
➢ Apache Mesos is an open source cluster manager and it was developed by the universities of
California, Berkeley.
➢ It provides all the required resource for the isolation and sharing purpose across all the
distributed application.
➢ The software we are using for Mesos, provide resources sharing in a fine-grained manner so
that improvement can be done with the help of this.
➢ Mesosphere enterprises DC/OS is the enterprise version of Mesos and this run specially on
Kafka, Cassandra, spark and Akka.
➢ Akka is an actor-based message driven runtime for running concurrency, elasticity, and
resilience processes.
➢ The actor can be controlled and limited to perform the intended task only. Akka is an open
source library or toolkit.
➢ Apache Akka is used to create distributed and fault tolerant and it can be integrated to the
library into the java virtual machine or JVM to support the language.
➢ Akka could be integrated with the Scala programming language and it is written in the Scala
and it help us and developers to deal with external locking and threat management.
Apache Cassandra:
➢ Apache Cassandra is an open source distributed database system that is designed for storing
and managing large amount of data across commodity servers.
➢ Cassandra can be used for both real time operational data store for online transaction data
application.
➢ Cassandra is designed to have peer to peer process continuing nodes instead of master or
named nodes to ensure that there should not be any single point of failure.
➢ A NoSQL database is a database that provide mechanism to store and retrieve the data from
the database than relational database.
Kafka:
➢ Kafka is a high messaging backbone that enables communication between data processing
entities and Kafka is written in java and Scala language.
➢ Apache Kafka is highly scalable, reliable, fast, and distributed system. Kafka is suitable for both
offline and online message consumption.
➢ Kafka messages are stored on the hard disk and replicated within the cluster to prevent the
data loss.
➢ Kafka is distributed, partitioned, replicated and fault tolerant which make it more reliable.
➢ Kafka messaging system scales easily without down time which make it more scalable. Kafka
has high throughput for both publishing and subscribing messages, and it can store data up to
TB.
o Python is a programming language and it can used on a server to create web application.
o Python can be used for web development, mathematics, software development and it is used to
connect the database and create and modify the data.
o Python can handle the large amount of data and it is capable to perform the complex task on data.
o Python is reliable, portable, and flexible to work on different platform like windows, mac, and Linux
etc.
o Python can be installed on all the operating system example windows, Linux and mac operating
system and it can work on all these platforms for better understanding and learning purpose.
o You can earn much more knowledge by installing and working all three platform for data science and
data engineering.
o To working and installing the data science required package in python, Ubuntu run the following
command below:
o To working and installing the data science required package in python, Linux run the following
command below:
➢ https://www.python.org/downloads/
➢ Python Libraries:
➢ Python library is a collection of functions and methods that allows you to perform many actions
without writing your code.
➢ Pandas:
➢ Pandas stands for panel data and it is the core library for data manipulation and data analysis.
Matplotlib:
➢ Matplotlib is used for data visualization and is one of the most important packages of python.
➢ Matplotlib is used to display and visualize the 2D data and it is written in python.
➢ It can be used for python, Jupiter, notebook and web application server also.
➢ How to install Matplotlib Library for UBUNTU in python by using the following command:
➢ How to install Matplotlib Library for LINUX in python by using the following command:
➢ How to install Matplotlib Library for WINDOWS in python by using the following command:
➢ NumPy is the fundamental package of python language and is used for the numerical purpose.
➢ NumPy is used with the SciPy and Matplotlib package of python and it is freely available on
internet.
SymPy:
➢ Sympy is a python library and which is used for symbolic mathematics and it can be used
with complex algebra formula.
R:
➢ R is a programming language and it is used for statistical computing and graphics purpose.
➢ R Language is used by data engineer, data scientist, statisticians, and data miners for
developing the software and performing data analytics.
➢ There is core requirement before learning the R Language and some depend on library and
package concept that you should know about it and know how to work upon it easily.
➢ The related packages are of R Language is sqldf, forecast, dplyr, stringer, lubridate, ggplot2,
reshape etc.
Unit Structure
3.0 Objectives
3.1 Introduction
3.2 Operational Management Layer
3.0 Objectives
• The objective is to explain in detail the core operations of the Three
Management Layers i.e. Operational Management Layer, Audit, Balance,
and Control Layer & the Functional Layers
3.1 Introduction
• The Three Management Layers are a very important part of the framework.
• They watch the overall operations in the data science ecosystem and make
sure that things are happening as per plan.
• Overall Communication
• The Operations management handles all communication from the
system, it makes sure that any activities that are happening are
communicated to the system.
• To make sure that we have all our data science processes tracked we
may use a complex communication process.
• Overall Alerting
3.3.1 Audit
• An audit refers to an examination of the ecosystem that is systematic and
independent
• This sublayer records which processes are running at any given specific point
within the ecosystem.
• Data scientists and engineers use this information collected to better
understand and plan future improvements to the processing to be done.
• the audit in the data science ecosystem, contain of a series of observers which
record prespecified processing indicators related to the ecosystem.
• The following are good indicators for audit purposes:
• Built-in Logging • Basic Logging
• Debug Watcher • Process Tracking
• Information Watcher • Data Provenance
• Warning Watcher • Data Lineage
• Error Watcher
• Fatal Watcher
• Information Watcher
• The information watcher logs information that is beneficial to the running
and management of a system.
• It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
• Warning Watcher
• Warning is usually used for exceptions that are handled or other
important log events.
• Usually this means that the issue was handled by the tool and also took
corrective action for recovery.
• It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
• Error Watcher
• The processing algorithms and data models are spread across six super
steps for processing the data lake.
1. Retrieve: This super step contains all the processing chains for
retrieving data from the raw data lake into a more structured format.
2. Assess: This super step contains all the processing chains for quality
assurance and additional data enhancements.
3. Process: This super step contains all the processing chains for
building the data vault.
4. Transform: This super step contains all the processing chains for
building the data warehouse from the core data vault.
5. Organize: This super step contains all the processing chains for
building the data marts from the core data warehouse.
6. Report: This super step contains all the processing chains for building
virtualization and reporting of the actionable knowledge.
3.8 References
Andreas François Vermeulen, “Practical Data Science - A Guide to Building
the Technology Stack for Turning Data Lakes into Business Assets”
Unit Structure
4.0 Objectives
4.1 Introduction
4.8 References
4.0 Objectives
• The objective of this chapter is to explain in detail the core operations in the
Retrieve Super step.
• This chapter explains important guidelines which if followed will prevent the
data lake turning into a data swamp.
4.1 Introduction
o The Retrieve super step is a practical method for importing a data lake consisting
of different external data sources completely into the processing ecosystem.
o The Retrieve super step is the first contact between your data science and the
source systems.
o The successful retrieval of the data is a major stepping-stone to ensuring that you
are performing good data science.
o Data lineage delivers the audit trail of the data elements at the lowest granular
level, to ensure full data governance.
o Data quality and master data management helps to enrich the data lineage with
more business values, if you provide complete data source metadata.
o The Retrieve super step supports the edge of the ecosystem, where your data
science makes direct contact with the outside data world. I will recommend a
current set of data structures that you can use to handle the deluge of data you will
need to process to uncover critical business knowledge.
• A company’s data lake covers all data that your business is authorized to process,
to attain an improved profitability of your business’s core accomplishments.
• The data lake is the complete data world your company interacts with during its
business life span.
• In simple terms, if you generate data or consume data to perform your business
tasks, that data is in your company’s data lake.
• Data swamps are simply data lakes that are not managed.
• Simply dumping a horde of data into a data lake, with no tangible purpose in
mind, will result in a big business risk.
• The data lake must be enabled to collect the data required to answer your
business questions.
• More data points do not mean that data quality is less relevant.
• Data quality can cause the invalidation of a complete data set, if not dealt with
correctly.
• Metadata that link ingested data-to-data sources are a must-have for any data
lake.
o Expected frequency
• The business glossary maps the data-source fields and classifies them into
respective lines of business.
• The business glossary records the data sources ready for the retrieve
processing to load the data.
o internal field 1
o External data source field name: States the field as found in the raw
data source
o External data source field type: Records the full set of the field’s data
types when loading the data lake
o Internal data source field name: Records every internal data field
name to use once loaded from the data lake
o Internal data source field type: Records the full set of the field’s types
to use internally once loaded
• The following data analytical models should be executed on every data set in
the data lake by default.
o This is used to validate and verify the data field’s names in the retrieve
processing in an easy manner.
o Example
library(table)
o This ensures that the system can handle different files from different
paths and keep track of all data entries in an effective manner.
o Determine the best data type for each column, to assist you in
completing the business glossary, to ensure that you record the correct
import processing rules.
sapply(INPUT_DATA_with_ID, typeof)
library(data.table)
country_histogram=data.table(Country=unique(INPUT_DATA_with_ID[is.na
(INPUT_DATA_with_ID ['Country']) == 0, ]$Country))
• Minimum Value
• Maximum Value
• Mean
• Median
o Determine the value that splits the data set into two parts in a specific
column.
• Mode
INPUT_DATA_COUNTRY_FREQ =
data.table(with(INPUT_DATA_with_ID, table(Country)))
• Range
o For numeric values, you determine the range of the values by taking the
maximum value and subtracting the minimum value.
• Quartiles
o These are the base values that divide a data set in quarters. This is done
by sorting the data column first and then splitting it in groups of four equal
parts.
sapply(lattitue_histogram_with_id[,'Latitude'], quantile,
na.rm=TRUE)
• Standard Deviation
• Skewness
• Data Pattern
o Replace all alphabet values with an uppercase case A, all numbers with
an uppercase N, and replace any spaces with a lowercase letter band
all other unknown characters with a lowercase u.
• To prevent a data swamp, it is essential that you train your team also.
Data science is a team effort.
• People, process, and technology are the three cornerstones to ensure that
data is curated and protected.
• You are responsible for your people; share the knowledge you acquire
from this book. The process I teach you, you need to teach them. Alone,
you cannot achieve success.
• Technology requires that you invest time to understand it fully. We are only
at the dawn of major developments in the field of data engineering and
data science.
• Remember: A big part of this process is to ensure that business users and
data scientists understand the need to start small, have concrete
questions in mind, and realize that there is work to do with all data to
achieve success.
In this section we discuss two things : shipping terms and Incoterm 2010.
• These determine the rules of the shipment, the conditions under which it is
made. Normally, these are stated on the shipping manifest.
o Port - A Portis any point from which you have to exit or enter a country.
Normally, these are shipping ports or airports but can also include border
crossings via road. Note that there are two ports in the complete process.
This is important. There is a port of exit and a port of entry.
o Ship - Ship is the general term for the physical transport method used
for the goods. This can refer to a cargo ship, airplane, truck, or even
person, but it must be identified by a unique allocation number.
o Terminal - A terminalis the physical point at which the goods are handed
off for the next phase of the physical shipping.
• This option specifies which party has an obligation to pay if something happens to
the product being shipped (i.e. if the product is damaged or destroyed in route
before it reaches to the buyer)
• EXW—Ex Works
o Here the seller will make the product or goods available at his premises or
at another named place. This term EXW puts the minimum obligations on
the seller of the product /item and maximum obligation on the buyer.
o Here is the data science version: If I were to buy an item a local store and
take it home, and the shop has shipped it EXW—Ex Works, the moment I
pay at the register, the ownership is transferred to me. If anything happens
to the book, I would have to pay to replace it.
• FCA—Free Carrier
o In this condition, the seller is expected to deliver the product or goods, that
are cleared for export, at a named place.
• CPT—Carriage Paid To
o Under this term, the seller is expected to pay for the carriage of product or
goods up to the named place of destination.
o The moment the product or goods are delivered to the first carrier they are
considered to be delivered, and the risk gets transferred to the buyer.
o All the costs including origin costs, clearance of export and freight costs for
carriage till the place of named destination have to be paid by the seller to
the named place of destination. This is could be anything like the final
destination like the buyer's facility, or a port of at the destination country.
This has to be agreed upon by both seller and buyer in advance.
o The data science version: : If I were to buy an item at an overseas store and
then pick it up at the export desk before taking it home and the shop shipped
it CPT—Carriage Paid To—the duty desk for free, the moment I pay at the
register, the ownership is transferred to me, but if anything happens to the
book between the shop and the duty desk of the shop, I will have to pay.
o It is only once I have picked up the book at the desk that I have to pay if
anything happens. So, the moment I take the book, the transaction becomes
EXW, so I must pay any required export and import duties on arrival in my
home country.
o The seller has to get insurance for the goods for shipping the goods.
o The data science version If I were to buy an item at an overseas store and
then pick it up at the export desk before taking it home, and the shop has
shipped it CPT—Carriage Paid To— to the duty desk for free, the moment I
o It is only once I have picked it up at the desk that I have to pay if anything
happens. So, the moment I take the book, it becomes EXW, so I have to
pay any export and import duties on arrival in my home country. Note that
insurance only covers that portion of the transaction between the shop and
duty desk.
• DAT—Delivered at a Terminal
o According to this term the seller has to deliver and unload the goods at a
named terminal. The seller assumes all risks till the delivery at the
destination and has to pay all incurred costs of transport including export
fees, carriage, unloading from the main carrier at destination port, and
destination port charges.
o The terminal can be a port, airport, or inland freight interchange, but it must
be a facility with the capability to receive the shipment. If the seller is not
able to organize unloading, it should consider shipping under DAP terms
instead. All charges after unloading (for example, import duty, taxes,
customs and on-carriage costs) are to be borne by buyer.
o The data science version. If I were to buy an item at an overseas store and
then pick it up at a local store before taking it home, and the overseas shop
shipped it—Delivered at Terminal (Local Shop)—the moment I pay at the
register, the ownership is transferred to me.
o However, if anything happens to the book between the payment and the
pickup, the local shop pays. It is picked up only once at the local shop. I
have to pay if anything happens. So, the moment I take it, the transaction
becomes EXW, so I have to pay any import duties on arrival in my home.
• DAP—Delivered at Place
o Packaging cost at the origin has to be paid by the seller also all the legal
formalities in the exporting country will be carried out by the seller at his own
expense.
o Once the goods are delivered in the destination country the buyer has to
pay for the customs clearance.
o Here is the data science version. If I were to buy 100 pieces of a particular
item from an overseas web site and then pick up the copies at a local store
before taking them home, and the shop shipped the copies DAP-Delivered
At Place (Local Shop)— the moment I paid at the register, the ownership
would be transferred to me. However, if anything happened to the item
between the payment and the pickup, the web site owner pays. Once the
100 pieces are picked up at the local shop, I have to pay to unpack them at
store. So, the moment I take the copies, the transaction becomes EXW, so
I will have to pay costs after I take the copies.
o Here the seller is responsible for the delivery of the products or goods to an
agreed destination place in the country of the buyer. The seller has to pay
for all expenses like packing at origin, delivering the goods to the
destination, import duties and taxes, clearing customs etc.
o The seller is not responsible for unloading. This term DDP will place the
minimum obligations on the buyer and maximum obligations on the seller.
Neither the risk nor responsibility is transferred to the buyer until delivery of
the goods is completed at the named place of destination.
o Here is the data science version. If I were to buy an item in quantity 100 at
an overseas web site and then pick them up at a local store before taking
them home, and the shop shipped DDP—Delivered Duty Paid (my home)—
• While performing data retrieval you may have to work with one of the
following data stores
• SQLite
engine =
create_engine('mssql+pymssql://scott:tiger@hostname:port/folder')
• Oracle
engine =
create_engine('oracle://andre:[email protected]:1521/vermeulen')
• MySQL
engine =
create_engine('mysql+mysqldb://scott:tiger@localhost/vermeulen')
• Apache Cassandra
cluster = Cluster()
session = cluster.connect(‘vermeulen’)
• Apache Hadoop
o The pydoop package includes a Python MapReduce and HDFS API for
Hadoop.
• Pydoop
• Microsoft Excel
• Apache Spark
o Apache Spark is now becoming the next standard for distributed data
processing. The universal acceptance and support of the processing
• Apache Hive
o Access to Hive opens its highly distributed ecosystem for use by data
scientists
• Luigi
• Amazon S3 Storage
• Amazon Redshift
4. State and explain the four critical steps to avoid data swamps.
i. Seller,
ii. Carrier,
iii. Port,
iv. Ship,
vi. Buyer.
i. Ex Works
v. Delivered at Terminal
8. List and explain the different data stores used in data science.
4.8 References
Books:
Websites:
• https://www.aitworldwide.com/incoterms
• Incoterm: https://www.ntrpco.com/what-is-incoterms-part2/
5.0 Objectives
This chapter makes you understand the following concepts:
• Dealing with errors in data
• Principles of data analysis
• Different ways to correct errors in data
5.2 Errors
Errors are the norm, not the exception, when working with data. By now, you’ve probably
heard the statistic that 88% of spreadsheets contain errors. Since we cannot safely assume
that any of the data we work with is error-free, our mission should be to find and tackle
errors in the most efficient way possible.
5.3.1 Completeness:
Completeness is defined as expected comprehensiveness. Data can be complete even if
optional data is missing. As long as the data meets the expectations then the data is
considered complete.
For example, a customer’s first name and last name are mandatory but middle name is
optional; so a record can be considered complete even if a middle name is not available.
Questions you can ask yourself: Is all the requisite information available? Do any data values
have missing elements? Or are they in an unusable state?
5.3.2 Consistency
Consistency means data across all systems reflects the same information and are in synch
with each other across the enterprise.
Examples:
A business unit status is closed but there are sales for that business unit.
Employee status is terminated but pay status is active.
5.3.3 Timeliness
Timeliness referes to whether information is available when it is expected and needed.
Timeliness of data is very important. This is reflected in:
• Companies that are required to publish their quarterly results within a given frame of
time
• Customer service providing up-to date information to the customers
• Credit system checking in real-time on the credit card account activity
The timeliness depends on user expectation. Online availability of data could be required for
room allocation system in hospitality, but nightly data could be perfectly acceptable for a
billing system.
5.3.4 Conformity
Conformity means the data is following the set of standard data definitions like data type,
size and format. For example, date of birth of customer is in the format “mm/dd/yyyy”
Questions you can ask yourself: Do data values comply with the specified formats? If so, do
all the data values comply with those formats?
Maintaining conformance to specific formats is important.
5.3.5 Accuracy
Accuracy is the degree to which data correctly reflects the real world object OR an event
being described. Examples:
• Sales of the business unit are the real value.
• Address of an employee in the employee database is the real address.
Questions you can ask yourself: Do data objects accurately represent the “real world” values
they are expected to model? Are there incorrect spellings of product or person names,
addresses, and even untimely or not current data?
These issues can impact operational and advanced analytics applications.
5.3.6 Integrity
Integrity means validity of data across the relationships and ensures that all data in a
database can be traced and connected to other data.
For example, in a customer database, there should be a valid customer, addresses and
relationship between them. If there is an address relationship data without a customer then
that data is not valid and is considered an orphaned record.
Ask yourself: Is there are any data missing important relationship linkages?
The inability to link related records together may actually introduce duplication across your
systems.
5.4.1.1. Drop the Columns Where All Elements Are Missing Values
Importing data
Step 1: Importing necessary libraries
import os
import pandas as pd
Pandas provides various data structures and operations for manipulating numerical data
and time series. However, there can be cases where some data might be missing. In Pandas
missing data is represented by two values:
• None: None is a Python singleton object that is often used for missing data in Python
code.
• NaN: NaN (an acronym for Not a Number), is a special floating-point value recognized by
all systems that use the standard IEEE floating-point representation
Pandas treat None and NaN as essentially interchangeable for indicating missing or null
values. In order to drop a null values from a dataframe, we used dropna() function this
function drop Rows/Columns of datasets with Null values in different ways.
Syntax:
DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
Parameters:
• axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and
‘index’ or ‘columns’ for String.
• how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops the row/column
if ANY value is Null and ‘all’ drops only if ALL values are null.
• thresh: thresh takes integer value which tells minimum amount of na values to drop.
• subset: It’s an array which limits the dropping process to passed rows/columns through
list.
• inplace: It is a boolean which makes the changes in data frame itself if True.
Here, column C is having all NaN values. Let’s drop this column. For this use the following
code.
Code:
import pandas as pd
import numpy as np
df.dropna(axis=1, how='all') # this code will delete the columns with all null values.
Here, axis=1 means columns and how=’all’ means drop the columns with all NaN values.
Output:
A B D
0 NaN 2.0 0
1 3.0 4.0 1
2 NaN NaN 5
5.4.1.2. Drop the Columns Where Any of the Elements Is Missing Values
Here, column A, B and C are having all NaN values. Let’s drop these columns. For this use
the following code.
Code:
import pandas as pd
import numpy as np
Output:
D
0 0
1 1
2 5
5.4.1.3. Keep Only the Rows That Contain a Maximum of Two Missing Values
Let’s consider the same dataframe again:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Here, row 2 is having more than 2 NaN values. So, this row will get dropped. For this use the
following code.
Code:
# importing pandas as pd
import pandas as pd
import numpy as np
df.dropna(thresh=2)
# this code will delete the rows with more than two null values.
Here, thresh=2 means maximum two NaN will be allowed per row.
Output:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
5.4.1.4. Fill All Missing Values with the Mean, Median, Mode, Minimum
Another approach to handling missing values is to impute or estimate them. Missing value
imputation has a long history in statistics and has been thoroughly researched. In essence,
A simple guess of a missing value is the mean, median, or mode (most frequently appeared
value) of that variable.
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8, 12],
[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df.fillna(df.mean())
Output:
Apple Orange Banana Pear
Basket 1 10 8.25 30 40
Basket 2 7 14 21 28
Basket 3 55 8.25 8 12
Basket 4 15 14 13.8 8
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8, 12],
[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df.fillna(df.median())
Output:
Apple Orange Banana Pear
Basket 1 10 9.0 30 40
Basket 2 7 14 21 28
Basket 3 55 9.0 8 12
Basket 4 15 14 9.0 8
Basket 5 7 1 1 12.0
Basket 6 10.0 4 9 2
Here, the median of Apple Column = (7, 7, 10, 15, 55) = 10. So, Nan value is replaced by 10.
Similarly, in Orange Column Nan’s are replaced with 9, in Banana’s column Nan replaced
with 9 and in Pear’s column it is replaced with 12.
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 8, 28], [55, np.nan, 8, 12],
[15, 14, np.nan, 12], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df
Output:
Apple Orange Banana Pear
Basket 1 10 14 30 40
Basket 2 7 14 8 28
Basket 3 55 14 8 12
Basket 4 15 14 8 12
Basket 5 7 1 1 12
Basket 6 7.0 4 9 2
Here, the mode of Apple Column = (10, 7, 55, 15, 7) = 7. So, Nan value is replaced by 7.
Similarly, in Orange Column Nan’s are replaced with 14, in Banana’s column Nan replaced
with 8 and in Pear’s column it is replaced with 12.
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8, 12],
[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df.fillna(df.min())
Output:
Apple Orange Banana Pear
Basket 1 10 1 30 40
Basket 2 7 14 21 28
Basket 3 55 1 8 12
Basket 4 15 14 1 8
Basket 5 7 1 1 2
Basket 6 7 4 9 2
Here, the minimum of Apple Column = (10, 7, 55, 15, 7) = 7. So, Nan value is replaced by 7.
Similarly, in Orange Column Nan’s are replaced with 1, in Banana’s column Nan replaced
with 1 and in Pear’s column it is replaced with 2.
Unit Structure
6.0 Objectives
6.1 Engineering a Practical Assess Superstep
6.2 References
6.3 Exercise Questions
6.0 Objectives
This chapter will make you understand the practical concepts of:
• Assess superstep
• Python NetworkX Library used to draw network routing graphs
• Python Schedule library to schedule various jobs
To use NetworkX library, first install the library on your machine by using following
command on your command prompt:
NetworkX is a Python package for the creation, manipulation, and study of the structure,
dynamics, and functions of complex networks.
NetworkX provides:
• tools for the study of the structure and dynamics of social, biological, and infrastructure
networks;
• a standard programming interface and graph implementation that is suitable for many
applications;
• a rapid development environment for collaborative, multidisciplinary projects;
• an interface to existing numerical algorithms and code written in C, C++, and FORTRAN;
and
• the ability to painlessly work with large nonstandard data sets.
With NetworkX you can load and store networks in standard and nonstandard data formats,
generate many types of random and classic networks, analyze network structure, build
network models, design new network algorithms, draw networks, and much more.
Graph Theory
In the Graph Theory, a graph has a finite set of vertices (V) connected to two-elements (E).
Each vertex (v) connecting two destinations, or nodes, is called a link or an edge. Consider
the Graph of bike paths below: sets {K,L}, {F,G}, {J,H}, {H,L}, {A,B}, and {C,E} are examples of
edges.
The total number of edges for each node is the degree of that node. In the Graph above, M
has a degree of 2 ({M,H} and {M,L}) while B has a degree of 1 ({B,A}). Degree is described
formally as:
For example:
# ### Creating a graph
# Create an empty graph with no nodes and no edges.
import networkx as nx
G = nx.Graph()
G.add_node(1)
G.add_nodes_from([2, 3])
H = nx.path_graph(10)
G.add_nodes_from(H)
G.add_node(H)
# The graph `G` now contains `H` as a node. This flexibility is very powerful as it allows
# graphs of graphs, graphs of files, graphs of functions and much more. It is worth thinking
# about how to structure # your application so that the nodes are useful entities. Of course
# you can always use a unique identifier # in `G` and have a separate dictionary keyed by
# identifier to the node information if you prefer.
# # Edges
# `G` can also be grown by adding one edge at a time,
G.add_edge(1, 2)
e = (2, 3)
G.add_edge(*e) # unpack edge tuple*
G.add_edges_from(H.edges)
G.clear()
# we add new nodes/edges and NetworkX quietly ignores any that are already present.
# At this stage the graph `G` consists of 8 nodes and 3 edges, as can be seen by:
G.number_of_nodes()
G.number_of_edges()
list(G.nodes)
list(G.edges)
list(G.adj[1]) # or list(G.neighbors(1))
G.degree[1] # the number of edges incident to 1
# One can specify to report the edges and degree from a subset of all nodes using an
#nbunch.
# An *nbunch* is any of: `None` (meaning all nodes), a node, or an iterable container of
nodes that is # not itself a node in the graph.
G.edges([2, 'm'])
G.degree([2, 3])
G.remove_node(2)
G.remove_nodes_from("spam")
list(G.nodes)
G.remove_edge(1, 3)
G.add_edge(1, 2)
H = nx.DiGraph(G) # create a DiGraph using the connections from G
list(H.edges())
edgelist = [(0, 1), (1, 2), (2, 3)]
H = nx.Graph(edgelist)
G.add_edge(1, 3)
G[1][3]['color'] = "blue"
G.edges[1, 2]['color'] = "red"
G.edges[1, 2]
FG = nx.Graph()
FG.add_weighted_edges_from([(1, 2, 0.125), (1, 3, 0.75), (2, 4, 1.2), (3, 4, 0.375)])
for n, nbrs in FG.adj.items():
for nbr, eattr in nbrs.items():
wt = eattr['weight']
if wt < 0.5: print(f"({n}, {nbr}, {wt:.3})")
G = nx.Graph(day="Friday")
G.graph
G.graph['day'] = "Monday"
G.graph
# # Node attributes
# Add node attributes using `add_node()`, `add_nodes_from()`, or `G.nodes`
G.add_node(1, time='5pm')
G.add_nodes_from([3], time='2pm')
G.nodes[1]
G.nodes[1]['room'] = 714
G.nodes.data()
# Note that adding a node to `G.nodes` does not add it to the graph, use
# `G.add_node()` to add new nodes. Similarly for edges.
# # Edge Attributes
# Add/change edge attributes using `add_edge()`, `add_edges_from()`,
# or subscript notation.
G.add_edge(1, 2, weight=4.7 )
G.add_edges_from([(3, 4), (4, 5)], color='red')
G.add_edges_from([(1, 2, {'color': 'blue'}), (2, 3, {'weight': 8})])
G[1][2]['weight'] = 4.7
G.edges[3, 4]['weight'] = 4.2
# The special attribute `weight` should be numeric as it is used by
# algorithms requiring weighted edges.
# Directed graphs
# The `DiGraph` class provides additional methods and properties specific
# to directed edges, e.g.,
# `DiGraph.out_edges`, `DiGraph.in_degree`,
# `DiGraph.predecessors()`, `DiGraph.successors()` etc.
# To allow algorithms to work with both classes easily, the directed versions of
# `neighbors()` is equivalent to `successors()` while `degree` reports
# the sum of `in_degree` and `out_degree` even though that may feel
# inconsistent at times.
DG = nx.DiGraph()
DG.add_weighted_edges_from([(1, 2, 0.5), (3, 1, 0.75)])
DG.out_degree(1, weight='weight')
DG.degree(1, weight='weight')
list(DG.successors(1))
list(DG.neighbors(1))
# Some algorithms work only for directed graphs and others are not well
# defined for directed graphs. Indeed the tendency to lump directed
# and undirected graphs together is dangerous. If you want to treat
# a directed graph as undirected for some measurement you should probably
# convert it using `Graph.to_undirected()` or with
# # Multigraphs
# NetworkX provides classes for graphs which allow multiple edges
# between any pair of nodes. The `MultiGraph` and
# `MultiDiGraph`
# classes allow you to add the same edge twice, possibly with different
# edge data. This can be powerful for some applications, but many
# algorithms are not well defined on such graphs.
# Where results are well defined,
# e.g., `MultiGraph.degree()` we provide the function. Otherwise you
# should convert to a standard graph in a way that makes the measurement well defined.
MG = nx.MultiGraph()
MG.add_weighted_edges_from([(1, 2, 0.5), (1, 2, 0.75), (2, 3, 0.5)])
dict(MG.degree(weight='weight'))
GG = nx.Graph()
for n, nbrs in MG.adjacency():
for nbr, edict in nbrs.items():
minvalue = min([d['weight'] for d in edict.values()])
GG.add_edge(n, nbr, weight = minvalue)
nx.shortest_path(GG, 1, 3)
K_5 = nx.complete_graph(5)
K_3_5 = nx.complete_bipartite_graph(3, 5)
barbell = nx.barbell_graph(10, 10)
lollipop = nx.lollipop_graph(10, 20)
er = nx.erdos_renyi_graph(100, 0.15)
ws = nx.watts_strogatz_graph(30, 3, 0.1)
ba = nx.barabasi_albert_graph(100, 5)
red = nx.random_lobster(100, 0.9, 0.9)
nx.write_gml(red, "path.to.file")
mygraph = nx.read_gml("path.to.file")
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3)])
G.add_node("spam") # adds node "spam"
list(nx.connected_components(G))
sorted(d for n, d in G.degree())
nx.clustering(G)
# Some functions with large output iterate over (node, value) 2-tuples.
# These are easily stored in a [dict](https://docs.python.org/3/library/stdtypes.html#dict)
# structure if you desire.
sp = dict(nx.all_pairs_shortest_path(G))
sp[3]
# To test if the import of `networkx.drawing` was successful draw `G` using one of
G = nx.petersen_graph()
plt.subplot(121)
nx.draw(G, with_labels=True, font_weight='bold')
plt.subplot(122)
nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True, font_weight='bold')
# when drawing to an interactive display. Note that you may need to issue a Matplotlib
plt.show()
options = {
'node_color': 'black',
'node_size': 100,
'width': 3,
}
plt.subplot(221)
nx.draw_random(G, **options)
plt.subplot(222)
nx.draw_circular(G, **options)
plt.subplot(223)
nx.draw_spectral(G, **options)
plt.subplot(224)
nx.draw_shell(G, nlist=[range(5,10), range(5)], **options)
G = nx.dodecahedral_graph()
shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12, 13]]
nx.draw_shell(G, nlist=shells, **options)
plt.show()
options = {
'node_color': 'black',
'node_size': 100,
'width': 3,
}
plt.subplot(221)
nx.draw_random(G, **options)
plt.subplot(222)
nx.draw_circular(G, **options)
plt.subplot(223)
nx.draw_spectral(G, **options)
plt.subplot(224)
nx.draw_shell(G, nlist=[range(5,10), range(5)], **options)
G = nx.dodecahedral_graph()
shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12, 13]]
nx.draw_shell(G, nlist=shells, **options)
nx.draw(G)
plt.savefig("path.png")
Schedule Library is used to schedule a task at a particular time every day or a particular day
of a week. We can also set time in 24 hours format that when a task should run. Basically,
Schedule Library matches your systems time to that of scheduled time set by you. Once the
scheduled time and system time matches the job function (command function that is
scheduled ) is called.
Installation
$ pip install schedule
schedule.Scheduler class
• schedule.every(interval=1) : Calls every on the default scheduler instance. Schedule a
new periodic job.
• schedule.run_pending() : Calls run_pending on the default scheduler instance. Run all
jobs that are scheduled to run.
• schedule.run_all(delay_seconds=0) : Calls run_all on the default scheduler instance. Run
all jobs regardless if they are scheduled to run or not.
• schedule.idle_seconds() : Calls idle_seconds on the default scheduler instance.
• schedule.next_run() : Calls next_run on the default scheduler instance. Datetime when
the next job should run.
• schedule.cancel_job(job) : Calls cancel_job on the default scheduler instance. Delete a
scheduled job.
• schedule.Job(interval, scheduler=None) class
A periodic job as used by Scheduler.
Parameters:
• interval: A quantity of a certain time unit
• scheduler: The Scheduler instance that this job will register itself with once it has been
fully configured in Job.do().
For example
# Schedule Library imported
import schedule
import time
# Functions setup
def placement():
print("Get ready for Placement at various companies")
def good_luck():
print("Good Luck for Test")
def work():
print("Study and work hard")
def bedtime():
print("It is bed time go rest")
def datascience():
print("Data science with python is fun")
# Task scheduling
# After every 10mins datascience() is called.
schedule.every(10).minutes.do(datascience)
6.2 References:
• Python for Data Science For Dummies, by Luca Massaron John Paul Mueller (Author),
ISBN-13 : 978-8126524938, Wiley
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd Edition
by William McKinney (Author), ISBN-13 : 978-9352136414 , Shroff/O'Reilly
• Data Science From Scratch: First Principles with Python, Second Edition by Joel Grus,
ISBN-13 : 978-9352138326, Shroff/O'Reilly
• Data Science from Scratch by Joel Grus, ISBN-13 : 978-1491901427 , O′Reilly
• Data Science Strategy For Dummies by Ulrika Jagare, ISBN-13 : 978-8126533367 , Wiley
• Pandas for Everyone: Python Data Analysis, by Daniel Y. Chen, ISBN-13 : 978-
9352869169, Pearson Education
• Practical Data Science with R (MANNING) by Nina Zumel, John Mount, ISBN-13 : 978-
9351194378, Dreamtech Press
Q.1 Write Python program to create the network routing diagram from the given data.
Q.3 Write a Python program to pick the content for Bill Boards from the given data.
Q.4 Write a Python program to generate visitors data from the given csv file.
UNIT IV
CHAPTER 7: PROCESS SUPERSTEP
Structure:
7.1 Objectives
7.2 Introduction
7.3 Data Vault
7.3.1 Hubs
7.3.2 Links
7.3.3 Satellites
7.3.4 Reference Satellites
7.4 Time-Person-Object-Location-Event Data Vault
7.5 Time Section
7.5.1 Time Hub
7.5.2 Time Links
7.5.3 Time Satellites
7.6 Person Section
7.6.1 Person Hub
7.6.2 Person Links
7.6.3 Person Satellites
7.7 Object Section
7.7.1 Object Hub
7.7.2 Object Links
7.7.3 Object Satellites
7.8 Location Section
7.8.1 Location Hub
7.8.2 Location Links
7.8.3 Location Satellites
7.9 Event Section
7.9.1 Event Hub
7.9.2 Event Links
7.9.3 Event Satellites
7.10 Engineering a Practical Process Superstep
7.11 Event
7.11.1 Explicit Event
7.11.2 Implicit Event
7.12 5-Whys Technique
7.12.1 Benefits of the 5 Whys
7.12.2 When Are the 5 Whys Most Useful?
7.12.3 How to Complete the 5 Whys
7.13 Fishbone Diagrams
7.14 Monte Carlo Simulation
7.15 Causal Loop Diagrams
7.16 Pareto Chart
7.17 Correlation Analysis
7.18 Forecasting
7.19 Data Science
7.1 Objectives
7.2 Introduction
The Process superstep uses the assess results of the retrieve versions of the data sources into a
highly structured data vault. These data vaults form the basic data structure for the rest of the
data science steps.
The Process superstep is the amalgamation procedure that pipes your data sources into five
primary classifications of data.
7.3.1 Hubs
Data vault hub is used to store business key. These keys do not change over time. Hub also
contains a surrogate key for each hub entry and metadata information for a business key.
7.3.2 Links
Data vault links are join relationship between business keys.
7.3.3 Satellites
Data vault satellites stores the chronological descriptive and characteristics for a specific
section of business data. Using hub and links we get model structure but no chronological
characteristics. Satellites consist of characteristics and metadata linking them to their specific
hub.
7.5.1Time Hub
This hub act as connector between time zones.
Following are the fields of time hub.
Time-Person Link
o This link connects date-time values from time hub to person hub.
o Dates such as birthdays, anniversaries, book access date, etc.
Time-Object Link
o This link connects date-time values from time hub to object hub.
o Dates such as when you buy or sell car, house or book, etc.
Time-Location Link
o This link connects date-time values from time hub to location hub.
o Dates such as when you moved or access book from post code, etc.
Time-Event Link
o This link connects date-time values from time hub to event hub.
o Dates such as when you changed vehicles, etc.
Time satellite can be used to move from one time zone to other very easily. This feature will
be used during Transform superstep.
Following are the person links that can be stored as separate links.
Person-Time Link
o This link contains relationship between person hub and time hub.
Person-Object Link
o This link contains relationship between person hub and object hub.
Person-Location Link
o This link contains relationship between person hub and location hub.
Person-Event Link
o This link contains relationship between person hub and event hub.
7.7.2Object Links
Object Links connect object hub to other hubs.
Following are the object links that can be stored as separate links.
Object-Time Link
o This link contains relationship between Object hub and time hub.
Object-Person Link
o This link contains relationship between Object hub and Person hub.
Object-Location Link
o This link contains relationship between Object hub and Location hub.
Object-Event Link
o This link contains relationship between Object hub and event hub.
7.8.2Location Links
Location Links connect location hub to other hubs.
Figure 7-6. Location Link
Following are the location links that can be stored as separate links.
Location-Time Link
o This link contains relationship between location hub and time hub.
Location-Person Link
o This link contains relationship between location hub and person hub.
Location-Object Link
o This link contains relationship between location hub and object hub.
Location-Event Link
o This link contains relationship between location hub and event hub.
Event-Time Link
o This link contains relationship between event hub and time hub.
Event-Person Link
o This link contains relationship between event hub and person hub.
Event-Object Link
o This link contains relationship between event hub and object hub.
Event-Location Link
o This link contains relationship between event hub and location hub.
Time
Time is most important characteristics of data used to record event time. ISO 8601-2004
defines an international standard for interchange formats for dates and times.
The following entities are part of ISO 8601-2004 standard:
Year, month, day, hour, minute, second, and fraction of a second
The data/time is recorded from largest (year) to smallest (fraction of second). These values
must have a pre-approved fixed number of digits that are padded with leading zeros.
Year
The standard uses four digits to represent year. The values ranges from 0000 to 9999.
AD/BC requires conversion
Year Conversion
N AD Year N
3 AD Year 3
1 AD Year 1
1 BC Year 0
2 BC Year – 1
2020AD +2020
2020BC -2019 (year -1 for BC)
Month
The standard uses two digits to represent month. The values ranges from 01 to 12.
The rule for a valid month is 12 January 2020 becomes 2020-11-12.
Above program can be updated to extract month value.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Month:',str(now_utc.strftime("%m")))
print('Month Name:',str(now_utc.strftime("%B")))
Output:
Day
The standard uses two digits to represent month. The values ranges from 01 to 31.
The rule for a valid month is 22 January 2020 becomes 2020-01-22 or +2020-01-22.
Hour
The standard uses two digits to represent hour. The values ranges from 00 to 24.
The valid format is hhmmss or hh:mm:ss. The shortened format hhmm or hh:mm is accepted
The use of 00:00:00 is the beginning of the calendar day. The use of 24:00:00 is only to indicate
the end of the calendar day.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Hour:',str(now_utc.strftime("%H")))
Output:
Minute
The standard uses two digits to represent minute. The values ranges from 00 to 59.
The standard minute must use two-digit values within the range of 00 through 59.
The valid format is hhmmss or hh:mm:ss.
Output:
Second
The standard uses two digits to represent second. The values ranges from 00 to 59.
The valid format is hhmmss or hh:mm:ss.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")))
print('Second:',str(now_utc.strftime("%S")))
Output:
7.11 Event
This structure records any specific event or action that is discovered in the data sources. An
event is any action that occurs within the data sources. Events are recorded using three main
data entities: Event Type, Event Group, and Event Code. The details of each event are recorded
as a set of details against the event code. There are two main types of events.
Problem Statement: Customers are unhappy because they are being shipped products
that don’t meet their specifications.
1. Why are customers being shipped bad products?
• Because manufacturing built the products to a specification that is different from what
the customer and the salesperson agreed to.
2. Why did manufacturing build the products to a different specification than that of sales?
• Because the salesperson accelerates work on the shop floor by calling the head of
manufacturing directly to begin work. An error occurred when the specifications were
being communicated or written down.
3. Why does the salesperson call the head of manufacturing directly to start work instead of
following the procedure established by the company?
• Because the “start work” form requires the sales director’s approval before work can
begin and slows the manufacturing process (or stops it when the director is out of the
office).
4. Why does the form contain an approval for the sales director?
• Because the sales director must be continually updated on sales for discussions with
the CEO, as my retailer customer was a top ten key account.
In this case, only four whys were required to determine that a non-value-added signature
authority helped to cause a process breakdown in the quality assurance for a key account! The
rest was just criminal.
The external buyer at the wholesaler knew this process was regularly by passed and started
buying the bad tins to act as an unofficial backfill for the failing process in the quality-assurance
process in manufacturing, to make up the shortfalls in sales demand. The wholesaler simply
relabelled the product and did not change how it was manufactured. The reason? Big savings
lead to big bonuses. A key client’s orders had to be filled. Sales are important!
Example: The challenge is to keep the “Number of Employees Available to Work and
Productivity” as high as possible.
Following Diagram shows how many customer complaints were received in each of five
categories.
import pandas as pd
a = [ [1, 2, 4], [5, 7, 9], [8, 3, 13], [4, 3, 19], [5, 6, 12], [5, 6, 11],[5, 6, 7], [4, 3, 6]]
df = pd.DataFrame(data=a)
cr=df.corr()
print(cr)
7.18 Forecasting
Forecasting is the ability to project a possible future, by looking at historical data. The data
vault enables these types of investigations, owing to the complete history it collects as it
processes the source’s systems data. You will perform many forecasting projects during your
career as a data scientist and supply answers to such questions as the following:
• What should we buy?
• What should we sell?
• Where will our next business come from?
People want to know what you calculate to determine what is about to happen
7.19 Data Science
Data Science work best when approved techniques and algorithms are followed.
After performing various experiments on data, the result must be verified and it must have
support.
Data sciences that work follow these steps:
Step 1: It begins with a question.
Step 2: Design a model, select prototype for the data and start a virtual simulation. Some
statistics and mathematical solutions can be added to start a data science model.
All questions must be related to customer's business, such a way that answer must provide an
insight of business.
Step3: Formulate a hypothesis based on collected observation. Based on model process the
observation and prove whether hypothesis is true or false.
Step4: Compare the above result with the real-world observations and provide these results to
real-life business.
Step 5: Communicate the progress and intermediate results with customers and subject expert
and involve them in the whole process to ensure that they are part of journey of discovery.
Model Questions:
1. Explain the process superstep.
2. Explain concept of data valut.
3. What are the different typical reference satellites? Explain.
4. Explain the TPOLE design principle.
5. Explain the Time section of TPOLE.
6. Explain the Person section of TPOLE.
7. Explain the Object section of TPOLE.
8. Explain the Location section of TPOLE.
9. Explain the Event section of TPOLE.
10. Explain the different date and time formats. What is leap year? Explain.
11. What is an event? Explain explicit and implicit events.
12. How to Complete the 5 Whys?
13. What is a fishbone diagram? Explain with example.
14. Explain the significance of Monte Carlo Simulation and Causal Loop Diagram.
15. What are pareto charts? What information can be obtained from pareto charts?
16. Explain the use of correlation and forecasting in data science.
17. State and explain the five steps of data science.
UNIT IV
CHAPTER 8: TRANSFORM SUPERSTEP
Structure:
8.1 Objectives
8.2 Introduction
8.3 Dimension Consolidation
8.4 Sun Model
8.4.1 Person-to-Time Sun Model
8.4.2Person-to-Object Sun Model
8.4.3Person-to-Location Sun Model
8.4.4Person-to-Event Sun Model
8.4.5 Sun Model to Transform Step
8.5 Transforming with Data Science
8.6 Common Feature Extraction Techniques
8.6.1 Binning
8.6.2 Averaging
8.7 Hypothesis Testing
8.7.1 T-Test
8.7.2 Chi-Square Test
8.8 Overfitting & Underfitting
8.8.1 Polynomial Features
8.8.2 Common Data-Fitting Issue
8.9 Precision-Recall
8.9.1 Precision-Recall Curve
8.9.2 Sensitivity & Specificity
8.9.3 F1-Measure
8.9.4 Receiver Operating Characteristic (ROC) Analysis Curves
8.10 Cross-Validation Test
8.11 Univariate Analysis
8.12 Bivariate Analysis
8.13 Multivariate Analysis
8.14 Linear Regression
8.14.1 Simple Linear Regression
8.14.2 RANSAC Linear Regression
8.14.3 Hough Transform
8.15 Logistic Regression
8.15.1 Simple Logistic Regression
8.15.2 Multinomial Logistic Regression
8.15.3 Ordinal Logistic Regression
8.16 Clustering Techniques
8.16.1 Hierarchical Clustering
8.16.2 Partitional Clustering
8.17 ANOVA
8.18 Decision Trees
8.2 Introduction
The Transform Superstep allow us to take data from data vault and answer the questions raised
by the investigation.
It takes standard data science techniques and methods to attain insight and knowledge about
the data that then can be transformed into actionable decisions. These results can be explained
to non-data scientist.
The Transform Superstep uses the data vault from the process step as its source data.
The sun model is constructed to show all the characteristics from the two data vault hub
categories you are planning to extract. It explains how you will create two dimensions and a
fact via the Transform step from above figure. You will create two dimensions (Person and
Time) with one fact (PersonBornAtTime) as shown in below figure,
import sys
import os
from datetime import datetime
from pytz import timezone
import pandas as pd
import sqlite3 as sq
import uuid
pd.options.mode.chained_assignment = None
################################################################
if sys.platform == 'linux' or sys.platform == ' Darwin':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
################################################################
Company='01-Vermeulen'
################################################################
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Vermeulen.db'
conn1 = sq.connect(sDatabaseName)
################################################################
sDataWarehousetDir=Base + '/99-DW'
if not os.path.exists(sDataWarehousetDir):
os.makedirs(sDataWarehousetDir)
sDatabaseName=sDataWarehousetDir + '/datawarehouse.db'
print('\n#################################')
print('Time Dimension')
BirthZone = 'Atlantic/Reykjavik'
BirthDateUTC = datetime(1960,12,20,10,15,0)
BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
################################################################
IDTimeNumber=str(uuid.uuid4())
TimeLine=[('TimeID', [IDTimeNumber]),
('UTCDate', [BirthDateZoneStr]),
('LocalTime', [BirthDateLocal]),
('TimeZone', [BirthZone])]
TimeFrame = pd.DataFrame.from_items(TimeLine)
################################################################
DimTime=TimeFrame
DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)
sTable = 'Dim-Time'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")
print('\n#################################')
print('Dimension Person')
print('\n#################################')
FirstName = 'Guðmundur'
LastName = 'Gunnarsson'
###############################################################
IDPersonNumber=str(uuid.uuid4())
PersonLine=[('PersonID', [IDPersonNumber]),
('FirstName', [FirstName]),
('LastName', [LastName]),
('Zone', ['UTC']),
('DateTimeValue', [BirthDateZoneStr])]
PersonFrame = pd.DataFrame.from_items(PersonLine)
################################################################
DimPerson=PersonFrame
DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
8.6.1 Binning
Binning technique is used to reduce the complexity of data sets, to enable the data scientist to
evaluate the data with an organized grouping technique.
Binning is a good way for you to turn continuous data into a data set that has specific features
that you can evaluate for patterns. For example, if you have data about a group of people, you
might want to arrange their ages into a smaller number of age intervals (for example, grouping
every five years together).
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
print(bin_means)
#The second is to use the histogram function.
bin_means2 = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
print(bin_means2)
8.6.2 Averaging
The use of averaging enables you to reduce the amount of records you require to report any
activity that demands a more indicative, rather than a precise, total.
Example:
Create a model that enables you to calculate the average position for ten sample points. First,
set up the ecosystem.
import numpy as np
import pandas as pd
#Create two series to model the latitude and longitude ranges.
LatitudeData = pd.Series(np.array(range(-90,91,1)))
LongitudeData = pd.Series(np.array(range(-180,181,1)))
#Select 10 samples for each range:
LatitudeSet=LatitudeData.sample(10)
LongitudeSet=LongitudeData.sample(10)
#Calculate the average of each data set
LatitudeAverage = np.average(LatitudeSet)
LongitudeAverage = np.average(LongitudeSet)
#See the results
8.7.1 T-Test
The t-test is one of many tests used for the purpose of hypothesis testing in statistics. A t-test
is a popular statistical test to make inferences about single means or inferences about two means
or variances, to check if the two groups’ means are statistically different from each other, where
n(sample size) < 30 and standard deviation is unknown.
The One Sample t Test determines whether the sample mean is statistically different from a
known or hypothesised population mean. The One Sample t Test is a parametric test.
H0: Mean age of given sample is 30.
H1: Mean age of given sample is not 30
#pip3 install scipy
#pip3 install numpy
from scipy.stats import ttest_1samp
import numpy as np
ages = np.genfromtxt('ages.csv')
print(ages)
ages_mean = np.mean(ages)
print("Mean age:",ages_mean)
print("Test 1: m=30")
tset, pval = ttest_1samp(ages, 30)
print('p-values - ',pval)
if pval< 0.05:
print("we reject null hypothesis")
else:
print("we fail to reject null hypothesis")
Overfitting occurs when the model or the algorithm fits the data too well. When a model gets
trained with so much of data, it starts learning from the noise and inaccurate data entries in our
data set. But the problem then occurred is, the model will not be able to categorize the data
correctly, and this happens because of too much of details and noise.
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
def f(x):
""" function to approximate by polynomial interpolation"""
return x * np.sin(x)
# generate points used to plot
x_plot = np.linspace(0, 10, 100)
# generate points and keep a subset of them
x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0)
rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)
# create matrix versions of these arrays
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]
colors = ['teal', 'yellowgreen', 'gold']
lw = 2
plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw, label="Ground Truth")
plt.scatter(x, y, color='navy', s=30, marker='o', label="training points")
for count, degree in enumerate([3, 4, 5]):
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, np.newaxis], y,
scoring="neg_mean_squared_error", cv=10)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
8.9 Precision-Recall
Precision-recall is a useful measure for successfully predicting when classes are extremely
imbalanced. In information retrieval,
• Precision is a measure of result relevancy.
• Recall is a measure of how many truly relevant results are returned.
A system with high recalls but low precision returns many results, but most of its predicted
labels are incorrect when compared to the training labels. A system with high precision but low
recall is just the opposite, returning very few results, but most of its predicted labels are correct
when compared to the training labels. An ideal system with high precision and high recall will
return many results, with all results labelled correctly.
Precision (P) is defined as the number of true positives (Tp) over the number of true
positives (Tp) plus the number of false positives (Fp).
Recall (R) is defined as the number of true positives (Tp) over the number of true positives
(Tp) plus the number of false negatives (Fn).
The true negative rate (TNR) is the rate that indicates the recall of the negative items.
8.9.3 F1-Measure
The F1-score is a measure that combines precision and recall in the harmonic mean of precision
and recall.
The following sklearn functions are useful when calculating these measures:
• sklearn.metrics.average_precision_score
• sklearn.metrics.recall_score
• sklearn.metrics.precision_score
• sklearn.metrics.f1_score
You will find the ROC analysis curves useful for evaluating whether your classification or
feature engineering is good enough to determine the value of the insights you are finding. This
helps with repeatable results against a real-world data set. So, if you suggest that your
customers should take a specific action as a result of your findings, ROC analysis curves will
support your advice and insights but also relay the quality of the insights at given parameters.
Example:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
import matplotlib.pyplot as plt
digits = datasets.load_digits()
X = digits.data
y = digits.target
Let’s pick three different kernels and compare how they will perform.
Title="Kernel:>" + kernel
fig=plt.figure(1, figsize=(8, 6))
plt.clf()
fig.suptitle(Title, fontsize=20)
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('Cross-Validation Score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()
Suppose that the heights of seven students of a class is recorded (in the above figure), there is
only one variable that is height and it is not dealing with any cause or relationship. The
description of patterns found in this type of data can be made by drawing conclusions using
central tendency measures (mean, median and mode), dispersion or spread of data (range,
minimum, maximum, quartiles, variance and standard deviation) and by using frequency
distribution tables, histograms, pie charts, frequency polygon and bar charts.
Suppose the temperature and ice cream sales are the two variables of a bivariate data (in the
above figure). Here, the relationship is visible from the table that temperature and sales are
directly proportional to each other and thus related because as the temperature increases, the
sales also increase. Thus, bivariate data analysis involves comparisons, relationships, causes
and explanations. These variables are often plotted on X and Y axis on the graph for better
understanding of data and one of these variables is independent while the other is dependent.
It is similar to bivariate but contains more than one dependent variable. The ways to perform
analysis on this data depends on the goals to be achieved. Some of the techniques are regression
analysis, path analysis, factor analysis and multivariate analysis of variance (MANOVA).
Linear regression is often used in business, government, and other scenarios. Some common
practical applications of linear regression in the real world include the following:
• Real estate: A simple linear regression analysis can be used to model residential home prices
as a function of the home's living area. Such a model helps set or evaluate the list price of a
home on the market. The model could be further improved by including other input variables
such as number of bathrooms, number of bedrooms, lot size, school district rankings, crime
statistics, and property taxes
• Demand forecasting: Businesses and governments can use linear regression models to
predict demand for goods and services. For example, restaurant chains can appropriately
Before attempting to fit a linear model to observed data, a modeler should first determine
whether or not there is a relationship between the variables of interest. This does not necessarily
imply that one variable causes the other (for example, higher SAT scores do not cause higher
college grades), but that there is some significant association between the two variables. A
scatterplot can be a helpful tool in determining the strength of the relationship between two
variables. If there appears to be no association between the proposed explanatory and
dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends),
then fitting a linear regression model to the data probably will not provide a useful model. A
valuable numerical measure of association between two variables is the correlation coefficient,
which is a value between -1 and 1 indicating the strength of the association of the observed
data for the two variables.
The process that is used to determine inliers and outliers is described below.
1. The algorithm randomly selects a random number of samples to be inliers in the model.
2. All data is used to fit the model and samples that fall with a certain tolerance are relabelled
as inliers.
3. Model is refitted with the new inliers.
4. Error of the fitted model vs the inliers is calculated.
5. Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.
With the help of the Hough transformation, this regression improves the resolution of the
RANSAC technique, which is extremely useful when using robotics and robot vision in which
the robot requires the regression of the changes between two data frames or data sets to move
through an environment.
For example, a logistic regression model can be built to determine if a person will or will not
purchase a new automobile in the next 12 months. The training set could include input variables
for a person's age, income, and gender as well as the age of an existing automobile. The training
set would also include the outcome variable on whether the person purchased a new automobile
over a 12-month period. The logistic regression model provides the likelihood or probability
of a person making a purchase in the next 12 months.
The logistic regression model is applied to a variety of situations in both the public and the
private sector. Some common ways that the logistic regression model is used include the
following:
• Medical: Develop a model to determine the likelihood of a patient's successful response to a
specific medical treatment or procedure. Input variables could include age, weight, blood
pressure, and cholesterol levels.
• Finance: Using a loan applicant's credit history and the details on the loan, determine the
probability that an applicant will default on the loan. Based on the prediction, the loan can be
approved or denied, or the terms can be modified.
• Marketing: Determine a wireless customer's probability of switching carriers (known as
churning) based on age, number of family members on the plan, months remaining on the
In linear regression modelling, the outcome variable is a continuous variable. When the
outcome variable is categorical in nature, logistic regression can be used to predict the
likelihood of an outcome based on the input variables. Although logistic regression can be
applied to an outcome variable that represents multiple values, but we will examine the case in
which the outcome variable represents two values such as true/false, pass/fail, or yes/no.
Simple logistic regression is analogous to linear regression, except that the dependent variable
is nominal, not a measurement. One goal is to see whether the probability of getting a particular
value of the nominal variable is associated with the measurement variable; the other goal is to
predict the probability of getting a particular value of the nominal variable, given the
measurement variable.
For example, a logistic regression model can be built to determine if a person will or will not
purchase a new automobile in the next 12 months. The training set could include input variables
for a person's age, income, and gender as well as the age of an existing automobile. The training
set would also include the outcome variable on whether the person purchased a new automobile
over a 12-month period. The logistic regression model provides the likelihood or probability
of a person making a purchase in the next 12 months.
Logistics Regression is based on the logistics function f(y), as given in the equation below,
For example, you could use ordinal regression to predict the belief that "tax is too high" (your
ordinal dependent variable, measured on a 4-point Likert item from "Strongly Disagree" to
"Strongly Agree"), based on two independent variables: "age" and "income". Alternately, you
could use ordinal regression to determine whether a number of independent variables, such as
"age", "gender", "level of physical activity" (amongst others), predict the ordinal dependent
variable, "obesity", where obesity is measured using three ordered categories: "normal",
"overweight" and "obese".
Clustering is often used as a lead-in to classification. Once the clusters are identified, labels
can be applied to each cluster to classify each group based on its characteristics. Some specific
applications of Clustering are image processing, medical and customer segmentation.
• Image Processing: Video is one example of the growing volumes of unstructured data being
collected. Within each frame of a video, k-means analysis can be used to identify objects in the
video. For each frame, the task is to determine which pixels are most similar to each other. The
attributes of each pixel can include brightness, color, and location, the x and y coordinates in
the frame. With security video images, for example, successive frames are examined to
For example: All files and folders on our hard disk are organized in a hierarchy.
The algorithm groups similar objects into groups called clusters. The endpoint is a set of
clusters or groups, where each cluster is distinct from each other cluster, and the objects within
each cluster are broadly similar to each other.
Many partitional clustering algorithms try to minimize an objective function. For example, in
K-means and K-medoids the function (also referred to as the distortion function) is:
8.17 ANOVA
The ANOVA test is the initial step in analysing factors that affect a given data set. Once the
test is finished, an analyst performs additional testing on the methodical factors that measurably
contribute to the data set's inconsistency. The analyst utilizes the ANOVA test results in an f-
test to generate additional data that aligns with the proposed regression models. The ANOVA
test allows a comparison of more than two groups at the same time to determine whether a
relationship exists between them.
Example:
A BOGOF (buy-one-get-one-free) campaign is executed on 5 groups of 100 customers each.
Each group is different in terms of its demographic attributes. We would like to determine
whether these five respond differently to the campaign. This would help us optimize the right
campaign for the right demographic group, increase the response rate, and reduce the cost of
the campaign.
There are two types of ANOVA: one-way (or unidirectional) and two-way. One-way or two-
way refers to the number of independent variables in your analysis of variance test. A one-way
ANOVA evaluates the impact of a sole factor on a sole response variable. It determines whether
all the samples are the same. The one-way ANOVA is used to determine whether there are any
statistically significant differences between the means of three or more independent (unrelated)
groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one
independent variable affecting a dependent variable. With a two-way ANOVA, there are two
independents. For example, a two-way ANOVA allows a company to compare worker
productivity based on two independent variables, such as salary and skill set. It is utilized to
observe the interaction between the two factors and tests the effect of two factors at the same
time.
The input values of a decision tree can be categorical or continuous. A decision tree employs a
structure of test points (called nodes) and branches, which represent the decision being made.
A node without further branches is called a leaf node. The leaf nodes return class labels and, in
some implementations, they return the probability scores. A decision tree can be converted into
a set of decision rules. In the following example rule, income and mortgage_amount are input
variables, and the response is the output variable default with a probability score.
Example:
The above figure shows an example of using a decision tree to predict whether customers will
buy a product. The term branch refers to the outcome of a decision and is visualized as a line
connecting two nodes. If a decision is numerical, the "greater than" branch is usually placed on
the right, and the "less than" branch is placed on the left. Depending on the nature of the
variable, one of the branches may need to include an "equal to" component.
Internal nodes are the decision or test points. Each internal node refers to an input variable or
an attribute. The top internal node is called the root. The decision tree in the above figure is a
binary tree in that each internal node has no more than two branches. The branching of a node
is referred to as a split.
The depth of a node is the minimum number of steps required to reach the node from the root.
In above figure for example, nodes Income and Age have a depth of one, and the four nodes
on the bottom of the tree have a depth of two. Leaf nodes are at the end of the last branches on
the tree. They represent class labels—the outcome of all the prior decisions. The path from the
root to a leaf node contains a series of decisions made at various internal nodes.
The decision tree in the above figure shows that females with income less than or equal to
$45,000 and males 40 years old or younger are classified as people who would purchase the
product. In traversing this tree, age does not matter for females, and income does not matter
for males.
Model Questions:
1. Explain the transform superstep.
2. Explain the Sun model for TPOLE.
3. Explain Person-to-Time Sun Model.
4. Explain Person-to-Object Sun Model.
5. Why does data have missing values? Why do missing values need treatment? What
methods treat missing values?
6. What is feature engineering? What are the common feature extraction techniques?
7. What is Binning? Explain with example.
8. Explain averaging and Latent Dirichlet Allocation with respect to the transform step
of data science.
9. Explain hypothesis testing, t-test and chi-square test with respect to data science.
10. Explain over fitting and underfitting. Discuss the common fitting issues.
11. Explain precision recall, precision recall curve, sensitivity, specificity and F1
measure.
12. Explain Univariate Analysis.
13. Explain Bivariate Analysis.
14. What is Linear Regression? Give some common application of linear regression in the
real world.
15. What is Simple Linear Regression? Explain.
16. Write a note on RANSAC Linear Regression.
17. Write a note on Logistic Regression.
18. Write a note on Simple Logistic Regression.
19. Write a note on Multinomial Logistic Regression.
20. Write a note on Ordinal Logistic Regression.
21. Explain CLustering techniques.
22. Explain Receiver Operating Characteristic (ROC) Analysis Curves and cross
validation test.
23. Write a note on ANOVA.
24. Write a note on Decision Trees.