8th Semester Report CRB ZIGRAM
8th Semester Report CRB ZIGRAM
Faculty Internship Guide: Mr. Yogesh Bharadwaj Industry Guide: Mr. Ankit Gadi
Designation: Assistant Professor Designation: Data Scientist
Department of Computer Science Engineering ZIGRAM
I hereby declare that the project works accomplished in this report named as “Cannabis Related
Businesses” is an authentic record of my own work carried out at “ZIGRAM” as requirements
of six months’ project semester for the award of degree of B.Tech. “Computer Science &
Engineering”, JECRC University, under the guidance of “Mr. Yogesh Bhardwaj” and “Mr.
Ankit Gadi”, during Final semester 2020.
Signature of student
Certified that the above statement made by the student is correct to the best of our knowledge
Signature
Abhishek Bali
CEO, ZIGRAM
ii
Acknowledgement
I would like to acknowledge our debt to each & every person associated in this Project
Development. The Project Development required huge Commitment from all the
Development. I am Thankful for the patience with which he stood by me till the end of
Queries & problems. The Project is a result of the efforts of all the peoples who are
Associated with the Project directly or indirectly, who helped me to successfully complete
Process. Without their Courage & Support, the Project Development would have been
Futile. It was only their building Support & Morale me in attaining the Successful
Project. Because of their Diligent & Hard Work, I wouldn’t have been able to complete
Encouragement & Support enabled the Project to Materialize & Contribute it to its
Success. I would like to express my Appreciation to all the people who have contributed
iv
December 15, 2019
Zigram Data Technologies Private Limited
Gurgaon, India
2. Date of Joining: Your internship will begin effective from January 6, 2020 or any other date
mutually agreed upon.
3. Location: Your current place of work is Gurgaon. Your place of work may be changed to
any other location within India or overseas as directed by the Company from time to time.
4. Period of Internship: The Internship period being offered is for a period of 6 months from the
date of your joining the company, extendable by 1 month in case of issues of performance or
deliverables.
5. Remuneration: The internship will carry a stipend of INR 10,000 per month. The company
will not be responsible for any taxation, accounting or other associated terms / policies and you
will not be eligible for any taxable and non-taxable allowances and benefits, and other
payments, payables or bonuses.
6. Project & Deliverables: You will be given an appropriate project title, scope and deliverables.
These will be tracked on a regular basis and deliverables assessed / defined monthly. In addition,
there will be a mid-term assessment of your work to assess performance, output and appropriate
steps of development. Deliverables, action items and requirements may change or be considered
dynamic or be prioritized based on new or emergent requirements or concerns.
7. Reporting: Your reporting within the organization may be towards two different leaders i.e.
Admin & Project Supervisor. They will be formally introduced to you on the day of joining.
8. Obligations of Tax: Any amount Payable by the Company to you towards compensation,
allowances and/or other payments shall be subject to deduction of withholding taxes and/or
any other taxes under applicable law. All requirements under the applicable tax laws in India
and outside India, including tax compliance and filing of tax returns, assessments etc. of your
personal income shall be fulfilled by you.
10. Whole time and attention: During your internship with the Company, you shall devote your
best efforts to promote the company's business and may not, without prior written consent of
the company, engage or be interested (directly or indirectly) in any other business or
employment.
11. Leave: In addition to public holidays reserved by the company, you shall also be entitled to
leaves as per the company leave policy and prorated from your date of joining. Further there
are other leave entitlements, details of which will be made available at the time of joining.
These are governed by the company's personnel policy, as applicable and can be modified from
time to time.
12. Termination: You may terminate this internship by giving 45 days of notice in writing or by
paying the equivalent of the TFC amount in lieu thereof. The company reserves the right to
terminate your services without any compensation or notice thereof, if you are found to be in
moral breach of your responsibilities, or following a code of conduct, not in line with the
company's guidelines and values, or if any information provided by you during the course of
your interview or otherwise, is at any time, found to be either wrong or not disclosed, or if you
are in contravention of the terms of this letter.
13. Disclosure: You shall disclose to the company your business interests, whether or not they are
similar to or in conflict with the businesses or activities of the company, and all circumstances
in respect thereof and whether there is or might be a conflict of interest between the company
and you.
14. Company Policies: You will be covered by the company’s policies, as applicable at your
level. The company reserves the right to amend the policies from time to time.
15. Return of Company property: You shall promptly upon request by the company and in any
event upon the termination of your internship deliver to the company all list of clients or
customers, correspondence and all other documents, papers and records in whatever form,
codes and technology related items, including but not limited to electronically held data
containing or referring to any trade secrets or confidential information concerning the business
of the company which may have been prepared by you or come into your position, custody or
control in the course of your internship (including any prior employment with the company).
You shall not keep copies of these items.
16. Compliance Rules: You shall be subject to compliance rules as determined by the company
from time to time or as may be imposed by any regulatory body. It is your responsibility to
ensure that you are aware of the compliance rules in force from time to time and that you adhere
to them. From time to time the company may require that you sign undertakings that you will
abide by the then existing rules and regulations.
In the event of your background verification report being unsatisfactory to the company, the company
reserves the right to revoke your offer of internship or terminate your internship in the event of you
having commenced your internship with the company prior to receiving your verification report.
Letter of Experience
15 June 2020
To whom it may concern
This letter is to certify that Abhishek Kumar Agarwal has successfully completed the internship program with
Zigram Data Technologies Private Limited. His internship tenure was from January 06, 2020 to June 08,
2020. Abhishek was actively and diligently involved in the projects and tasks assigned to him.
During the span, we found him to be a punctual and hardworking person. He was able to demonstrate his
knowledge by practically leveraging it in various aspects of the business.
Sincerely,
Abhishek Bali
CEO, ZIGRAM
Declaration……………………………………………………………………………………………..……..i
Acknowledgement………………………………………………………………………………………....ii
Offer Letter………………………………………………………………………………………………..iii
Joining Report……………………………………………………………………………………………vii
Completion Certificate……………………………………………………………………….................viii
List of Figures………………………………………………………………………………………….....xi
Abstract..……………………………………………………………………………………………...…xiii
Company Profile…………………………………………………………………………………………...1
Introduction To CRB……………………………………………………………………………………...4
Projects Assigned…………………………………………………………………………………..............8
1. Methodology……..…………………………………………………………………………....………...9
1.1 DataFrame…………..…………………………………………………………………………………9
1.2 Microsoft SQL Server……………………..…………………………………………………………10
1.3 Power BI………………………………………………………………………………………............14
1.4 Amazon s3 Bucket..…………………………………………………………………………………...15
1.5 Google Cloud Platform……………...………………………………………………………………..17
1.6 Amazon Linux….………………………………………….………………………………………….22
1.7 Google Sheets…………………………………………………………………………………………24
1.8 Pandas……………………….......……………...…………………………………………………….26
2. Project 1 – CRB Error Analysis...…………………………………………………………………….28
Executive summary………………………………………………………………………………............28
Solution …………………………………………………………………………………………………...29
Methodology adopted for solution……………………………………………………………………….30
Output sample ………………………………………………………………………………………........32
3. Project 2 – CRB Completion Rates.......……………………………………………………………....34
Executive Summary ……………………………………………………………………………………...34
Solution …………………………………………………………………………………………………...35
Methodology adopted for solution……………………………………………………………………….36
Fields for changes in Completion rate tab………………………………………………………………43
Order of Script Run………………………………………………………………………………………48
Output samples…………………………………………………………………………………………...49
x
4. Project -3 Work Done Analysis……………………………………………………………………….59
Executive Summary ……………………………………………………………………………………...59
Solution …………………………………………………………………………………………………...60
Methodology adopted for solution ………………………………………………………………………61
Output Sample……………………………………………………………………………………………67
5. Project -4 License Changes……………………………………………………………………………68
Executive Summary ……………………………………………………………………………………...68
Solution …………………………………………………………………………………………………...69
Methodology adopted for solution………………………………………………………………………70
Proposed Output …………………………………………………………………………………………72
Delivered Output…………………………………………………………………………………………72
6. Other Tasks Completed………………………………....…………………………………………….74
7. Conclusion………....…………………………………………………………………………………...75
8. Bibliography and References………………………………………………………………………….76
xi
List of Figures
xii
Figure 3.24 License Data Fields(Quarterly Filter)(Individual filter)............................................. 58
Figure 4.1 CRB Work Analysis- Supervisor Input Page.................................................................61
Figure 4.2 CRB Work Analysis- Analyst Input Page……............................................................ 61
Figure 4.3 CRB Work Allotted Output Email................................................................................ 62
Figure 4.4 CRB Work Done Output Table.................................................................................... 63
Figure 4.5 CRB Work Done Output Email Head........................................................................... 64
Figure 4.6 CRB Work Done Output Email Tail............................................................................. 64
Figure 4.7 CRB Weekly Work Done Output Activity Wise…...................................................... 65
Figure 4.8 CRB Weekly Work Done Output Researcher Activity Wise……................................ 66
Figure 4.9 CRB Weekly Work Done Output Email………………………………..…................. 67
Figure 5.1 CRB License Changes Tab Proposed Output………………..……..…........................ 72
Figure 5.2 CRB License Changes Tab Delivered Output snippet 1…………………………....... 72
Figure 5.3 CRB License Changes Tab Delivered Output snippet 2…………………………..…..73
xiii
Abstract
Cannabis has a huge market in countries where it has been legalised. But it can also be abused by
a lot of people. A lot of people are in one way or the other related to Cannabis. The project CRB
contains a huge database of people and businesses with cannabis licenses with a lot of minute
details. The details are routinely updated and new records are continuously made to keep database
up to date. This database has a Dashboard which is made on POWER BI.
This Dashboard is designed for the project’s clients who can see progress of the database.
The dashboard presently has 7 tabs. Another tab is required to showcase the changes in the licenses
to the Clients. The new tab of Dashboard should be so effective that all the relevant information
and insights are expressed in the dashboard tab. The name of the new tab is recognised as ‘License
Changes’.
Further a new tab is required to show completion rates of all the important fields of the database
which help in the in depth analysis of each entity. Completion rates tab would also help in giving
a comparative analysis of the work done by the CRB team in the timelines such as weekly, monthly
and quarterly.
There is also a requirement of Error Analysis on the existing dashboard page to know about some
common errors that currently exist in the database with the solutions so that the person who
rectifies the errors in the database has to just cut and paste error value with the right value.
The clients also need a thorough work analysis of the work done by the CRB Team. This needs to
be achieved through a structured data format which would be readily available for analysis.
During the internship Tenure I have solved all these issues and the solutions are deployed on
Amazon Linux Instance with fully automated scripts scheduled to run automatically.
xiv
Company Profile
ZIGRAM is a high impact organization which operates in the Data Asset space. The team is
made up of professionals from varied domains like data science, technology, sales, financial
services, research and business consulting. The aim is to deliver value to clients by Building and
Managing Data Assets across use cases - thereby boosting revenues and reducing the cost of
doing business, in a data driven world.
ZIGRAM
Z – Zeno - Founder of the Stoic school of philosophy
I - Issac Asimov – Finest writer & creator of Science Fiction
G - Sir Francis Galton - One of the greatest polymaths of our time
R - S. Ramanujan - Famous mathematician with almost no formal training
A – Augustus - Founder of the Roman Empire and Pax Romana
M - John McCarthy - Computer scientist known as the 'Father of Al'
What is a Data Asset?
A Data Asset is a structured, comprehensive and validated database of information which has
been built for a specific use case and in response to a problem.
Build For Purpose - Purpose-built Data Assets are designed to meet specific business
requirements as defined by the customer & use cases, oriented to solve specific problems.
Comprehensive - Data Assets which include all the data points which are necessary,
relevant to the specific use case.
Validated - Data Assets which have been created from valid sources which are relevant,
up to date and can be audited.
Structured - Data Assets which are constructed and designed according to a defined plan
in order to address a specific use case.
ZIGRAM’s Work
Data Applications
Applications created by using multiple technologies including automation, analytics, machine
learning and AI to help build Data Assets –Faster, Cheaper and Better
Data Products
Data Asset products built in-house, either in partnership with other players or wholly managed,
with subscription, application or API based access - Solutions For Specific Use-cases
Data Services
Deploying experienced resources, subject matter experts and specialists to execute projects and
operations across the Data Asset lifecycle –From Conceptualization To Delivery
1
ZIGRAM’s Expertise
Operations
Conducting Research, Developing Processes in order to build a validated, structured and
dependable Data Asset.
Core Data:
Profile Development
Online Research
Data Projects
Data Asset framework
Remediation
Enhancement
Enrichment
Maintenance
Data Science
Using Scientific Methods, Processes, Algorithms and Systems to extract knowledge and
insights from structured and unstructured data.
Insights & Efficiency:
Analytics
Machine Learning
NLP
Deep Learning
Statistical Learning
Hypothesis Testing
Data Wrangling
Predictive Modelling
Technology
Solutions for Data Management, Development and Products or services that are based
on Data generated by both human and machines.
Delivery & Development
Automation
Application Development
Cloud Technology
APIs
2
Data Architecture
Extraction/Mining
Scrapers & Crawlers
External Services
Representation
Use of Tools to effectively represent Data in order to build actionable insights, visually
communicate a quantitative message.
Reporting & Showcasing
Dashboards
Visual Analytics
Relationship Maps
Reporting
Infographics
Key Members
Mr. Abhishek Bali – CEO
Mr. Rahul Pagarre – CFO
Mr. Ankit Gadi – Data Scientist (Data Science Team Lead)
Mr. Ritesh Mohan – Engineering Team Lead
Ms. Jyoti Chakrabortty – Operations Team Lead
Mr. Siddharth Sabu - Manager
3
1. Introduction To CRB
The bill, Marijuana Businesses Access to Banking Act of 2015 provides a safe harbor for
depository institutions providing financial services to a marijuana related legitimate business to
the extent it prohibits a federal banking regulator from:
(1) Terminating or limiting the deposit or share insurance of a depository institution solely
because it provides financial services to a marijuana-related legitimate business; or
(2) Prohibiting, penalizing, or otherwise discouraging a depository institution from offering
such services.
4
Figure - Marijuana Life Cycle Flow Diagram
5
Banking MRBs
Banks or FIs who maintain account relationships with MRB should enhance policies,
procedures and monitoring controls to:
Identify marijuana-related relationships at account opening
Evaluate and document the potential risks posed by marijuana dispensaries
FI need to revise their AML program to address MRB, including SAR policies and
procedures addressing a three-tiered marijuana-specific reporting approach
Ensure MRB relationships are appropriately considered within the bank’s suspicious
activity monitoring and other applicable reporting systems
Periodically scrub customer base names and addresses against a listing of approved
marijuana dispensaries’ names, owners and addresses to identify any potential unknown
MRB accounts
Update the AML/BSA training program
In assessing the risk of providing services to a marijuana-related business, an FI should
conduct customer due diligence that includes:
(i) Verifying with the appropriate state authorities whether the business is duly licensed and
registered
(ii) Reviewing the license application (and related documentation) submitted by the business
for obtaining a state license to operate its marijuana-related business
(iii) Requesting from state licensing and enforcement authorities available information about
the business and related parties
(iv) Developing an understanding of the normal and expected activity for the business,
including the types of products to be sold and the type of customers to be served (e.g.,
medical versus recreational customers)
(v) Ongoing monitoring of publicly available sources for adverse information about the
business and related parties
(vi) Ongoing monitoring for suspicious activity, including for any of the red flags described
in this guidance
(vii) Refreshing information obtained as part of customer due diligence on a periodic basis
and commensurate with the risk.
6
What makes MRB Monitor do?
MRB Monitor helps financial institutions ("FIs") mitigate regulatory, reputational and
financial risk related to the marijuana industry with its industry-leading data and subject
matter expert
As the only data vendor solely focused on the marijuana industry, MRB Monitor has the
largest and most comprehensive database of marijuana-related businesses ("MRBs") and
beneficial owners (“BOs”) for use in relationship screening and due diligence
MRB Monitor is also helping to define the terminology and risk framework utilized by
FIs when developing marijuana-related policies and procedures
7
Projects Assigned:
1. Project 1 - The MRB database has very specific standards of input of data format and
its updation process and standards. However the input is done by Humans, so mistakes
are bound to happen. The work done by anyone is always checked by a checker but even
then mistakes are left unnoticed. Identify the most common mistakes in the database.
Create a script in python using Dataframe so that the most common mistakes are found
by using specific checks in the database. Use CRB extract as an input to the dataframe.
Also log process errors.
2. Project 2 - The CRB dashboard is made using POWER BI. Use s3 bucket, sql server
database, CRB extracts as input and create Completion rates tab for the CRB Dashboard
which will be used by clients to see insights related to the completion of fields. Reflect
the completion rates starting from 11th July 2019. Also log process errors.
3. Project 3 – The CRB team does a lot of work which needs to be quantified. Prepare a
Google sheet which can be used to analyse the work done by CRB Team. Create
automated scripts to Transfer the data everyday to SQL Server, trigger the scripts, empty
the sheets everyday, generate daily and weekly reports. Also log process errors.
4. Project 4 – The CRB dashboard is made using POWER BI. Use s3bucket logins as input
and create license changes tab for the CRB Dashboard which will be used by clients to
see insights related to the changes being done in MRB database. Use python, SQL server,
POWER BI to achieve the target. Reflect the changes in the dashboard from 1st October
2019. Also log process errors.
8
Methodology
1.1DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the
most commonly used pandas object. Like Series, DataFrame accepts many different kinds of
input:
If axis labels are not passed, they will be constructed from the input data based on common sense
rules.
9
1.2 Microsoft SQL Server
Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different
audiences and for workloads ranging from small single-machine applications to large Internet-
facing applications with many concurrent users.
Support tools SQL Server Profiler, BI tools, SQL Server Management Studio, and
Database Tuning Advisor.
Provides advanced customization choice for datatype mappings and erase and rename
objects.
Displays mistake, and cautioning messages about the relocation in an advanced window.
A single, coordinated condition for SQL Server Database Engine administration and
approving.
10
From SQL Server 2016 onward, the product is supported on x64 processors only.
The current version is Microsoft SQL Server 2019, released November 4, 2019.
Editions
Microsoft makes SQL Server available in multiple editions, with different feature sets and
targeting different users. These editions are:
Mainstream editions
1. Enterprise
SQL Server Enterprise Edition includes both the core database engine and add-on services, with
a range of tools for creating and managing a SQL Server cluster. It can manage databases as
large as 524 petabytes and address 12 terabytes of memory and supports 640 logical processors
(CPU cores).
2. Standard
SQL Server Standard edition includes the core database engine, along with the stand-alone
services. It differs from Enterprise edition in that it supports fewer active instances (number of
nodes in a cluster) and does not include some high-availability functions such as hot-add
memory (allowing memory to be added while the server is still running), and parallel indexes.
3. Web
4. Business Intelligence
Introduced in SQL Server 2012 and focusing on Self Service and Corporate Business
Intelligence. It includes the Standard Edition capabilities and Business Intelligence tools:
PowerPivot, Power View, the BI Semantic Model, Master Data Services, Data Quality Services
and xVelocity in-memory analytics.
5. Workgroup
11
SQL Server Workgroup Edition includes the core database functionality but does not include the
additional services. Note that this edition has been retired in SQL Server 2012.
6. Express
SQL Server Express Edition is a scaled down, free edition of SQL Server, which includes the
core database engine. While there are no limitations on the number of databases or users
supported, it is limited to using one processor, 1 GB memory and 10 GB database files (4 GB
database files prior to SQL Server Express 2008 R2). It is intended as a replacement for MSDE.
Two additional editions provide a superset of features not in the original Express Edition. The
first is SQL Server Express with Tools, which includes SQL Server Management Studio Basic.
SQL Server Express with Advanced Services adds full-text search capability and reporting
services.
Visual Studio
Microsoft Visual Studio includes native support for data programming with Microsoft SQL
Server. It can be used to write and debug code to be executed by SQL CLR. It also includes a
data designer that can be used to graphically create, view or edit database schemas. Queries can
be created either visually or using code. SSMS 2008 onwards, provides intellisense for SQL
queries as well.
SQL Server Management Studio is a GUI tool included with SQL Server 2005 and later for
configuring, managing, and administering all components within Microsoft SQL Server. The
tool includes both script editors and graphical tools that work with objects and features of the
server. SQL Server Management Studio replaces Enterprise Manager as the primary
management interface for Microsoft SQL Server since SQL Server 2005. A version of SQL
Server Management Studio is also available for SQL Server Express Edition, for which it is
known as SQL Server Management Studio Express (SSMSE).
A central feature of SQL Server Management Studio is the Object Explorer, which allows the
user to browse, select, and act upon any of the objects within the server. It can be used to visually
observe and analyze query plans and optimize the database performance, among others. SQL
Server Management Studio can also be used to create a new database, alter any existing database
12
schema by adding or modifying tables and indexes, or analyze performance. It includes the query
windows which provide a GUI based interface to write and execute queries.
SQL Server Operations Studio (Preview) is a cross platform query editor available as an optional
download. The tool allows users to write queries; export query results; commit SQL scripts to
Git repositories and perform basic server diagnostics. SQL Server Operations Studio supports
Windows, Mac and Linux systems.
It was released to General Availability in September 2018, at which point it was also renamed
to Azure Data Studio. The functionality remains the same as before.
13
1.3 Power BI
Key Components
Power BI Desktop
The Windows-desktop-based application for PCs and desktops, primarily for designing
and publishing reports to the Service.
Power BI Service
The SaaS (software as a service) based online service (formerly known as Power BI for
Office 365, now referred to as PowerBI.com (or simply Power BI).
Power BI Mobile Apps
The Power BI Mobile apps for Android and iOS devices, as well as for Windows phones
and tablets.
Power BI Gateway
Gateways used to sync external data in and out of Power BI. In Enterprise mode, can
also be used by Flows and PowerApps in Office 365.
Power BI Embedded
Power BI REST API can be used to build dashboards and reports into the custom
applications that serves Power BI users, as well as non-Power BI users.
Power BI Report Server
An On-Premises Power BI Reporting solution for companies that won't or can't store data
in the cloud-based Power BI Service.
Power BI Visuals Marketplace
A marketplace of custom visuals and R-powered visuals.
14
1.4 Amazon s3 Bucket
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services
(AWS) that provides object storage through a web service interface. Amazon S3 uses the same
scalable storage infrastructure that Amazon.com uses to run its global e-commerce network.
Amazon S3 can be employed to store any type of object which allows for uses like storage for
Internet applications, backup and recovery, disaster recovery, data archives, data lakes for
analytics, and hybrid cloud storage. AWS launched Amazon S3 in the United States on March
14, 2006, then in Europe in November 2007.
Although Amazon Web Services (AWS) does not publicly provide the details of S3's technical
design, Amazon S3 manages data with an object storage architecture which aims to provide
scalability, high availability, and low latency with 99.999999999% durability and between
99.95% to 99.99% availability (though there is no service-level agreement for durability).
The basic storage units of Amazon S3 are objects which are organized into buckets. Each object
is identified by a unique, user-assigned key. Buckets can be managed using either the console
provided by Amazon S3, programmatically using the AWS SDK, or with the Amazon S3 REST
application programming interface (API). Objects can be managed using the AWS SDK or with
the Amazon S3 REST API and can be up to five terabytes in size with two kilobytes of metadata.
Additionally, objects can be downloaded using the HTTP GET interface
Notable users
Photo hosting service SmugMug has used Amazon S3 since April 2006. They
experienced a number of initial outages and slowdowns, but after one year they described
it as being "considerably more reliable than our own internal storage" and claimed to
have saved almost $1 million in storage costs.
Netflix uses Amazon S3 as their system of record. Netflix implemented a tool, S3mper,
to address the Amazon S3 limitations of eventual consistency.[23] S3mper stores the file
system metadata: filenames, directory structure, and permissions in Amazon
DynamoDB.
reddit is hosted on Amazon S3.
Bitcasa, and Tahoe-LAFS-on-S3, among others, use Amazon S3 for online backup and
synchronization services. In 2016, Dropbox stopped using Amazon S3 services and
developed its own cloud server.
15
Mojang hosts Minecraft game updates and player skins on Amazon S3.
Tumblr, Formspring, and Pinterest host images on Amazon S3.
Swiftype's CEO has mentioned that the company uses Amazon S3.
Amazon S3 was used by some enterprises as a long term archiving solution until Amazon
Glacier was released in August 2012.
The API has become a popular method to store objects. As a result, many applications
have been built to natively support the Amazon S3 API which includes applications that
write data to Amazon S3 and Amazon S3-compatible object stores
16
1.5 Google Cloud Platform
Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that
runs on the same infrastructure that Google uses internally for its end-user products, such
as Google Search, Gmail and YouTube. Alongside a set of management tools, it provides a
series of modular cloud services including computing, data storage, data analytics and machine
learning. Registration requires a credit card or bank account details.
Google Cloud Platform provides infrastructure as a service, platform as a service, and serverless
computing environments.
In April 2008, Google announced App Engine, a platform for developing and hosting web
applications in Google-managed data centers, which was the first cloud computing service from
the company. The service became generally available in November 2011. Since the
announcement of the App Engine, Google added multiple cloud services to the platform.
Google Cloud Platform is a part of Google Cloud, which includes the Google Cloud Platform
public cloud infrastructure, as well as G Suite, enterprise versions of Android and Chrome OS,
and application programming interfaces (APIs) for machine learning and enterprise mapping
services.
Products
Compute
App Engine - Platform as a Service to deploy Java, PHP, Node.js, Python, C#, .Net, Ruby
and Go applications.
Compute Engine - Infrastructure as a Service to run Microsoft Windows and Linux
virtual machines.
Kubernetes Engine (GKE) or GKE on-prem offered as part of Anthos platform -
Containers as a Service based on Kubernetes.
Cloud Functions - Functions as a Service to run event-driven code written in Node.js,
Python or Go.
Cloud Run - Compute execution environment based on Knative. Offered as Cloud Run
(fully managed)[8] or as Cloud Run for Anthos.
17
Storage & Databases
Cloud Storage - Object storage with integrated edge caching to store unstructured data.
Cloud SQL - Database as a Service based on MySQL and PostgreSQL.
Cloud Bigtable - Managed NoSQL database service.
Cloud Spanner - Horizontally scalable, strongly consistent, relational database service.
Cloud Datastore - NoSQL database for web and mobile applications.
Persistent Disk - Block storage for Compute Engine virtual machines.
Cloud MemoryStore - Managed in-memory data store based on Redis.
Local SSD: High performance, transient, local block storage.
Filestore: High performance file storage for Google Cloud users.
Networking
VPC - Virtual Private Cloud for managing the software defined network of cloud
resources.
Cloud Load Balancing - Software-defined, managed service for load balancing the
traffic.
Cloud Armor - Web application firewall to protect workloads from DDoS attacks.
Cloud CDN - Content Delivery Network based on Google's globally distributed edge
points of presence.
Cloud Interconnect - Service to connect a data center with Google Cloud Platform
Cloud DNS - Managed, authoritative DNS service running on the same infrastructure as
Google.
Network Service Tiers - Option to choose Premium vs Standard network tier for higher-
performing network.
18
Big Data
Cloud AI
Cloud AutoML - Service to train and deploy custom machine, learning models. As of
September 2018, the service is in Beta.
Cloud TPU - Accelerators used by Google to train machine learning models.
Cloud Machine Learning Engine - Managed service for training and building machine
learning models based on mainstream frameworks.
Cloud Job Discovery - Service based on Google's search and machine learning
capabilities for the recruiting ecosystem.
Dialogflow Enterprise - Development environment based on Google's machine learning
for building conversational interfaces.
Cloud Natural Language - Text analysis service based on Google Deep Learning models.
Cloud Speech-to-Text - Speech to text conversion service based on machine learning.
Cloud Text-to-Speech - Text to speech conversion service based on machine learning.
Cloud Translation API - Service to dynamically translate between thousands of available
language pairs
Cloud Vision API - Image analysis service based on machine learning
Cloud Video Intelligence - Video analysis service based on machine learning
19
Management Tools
Cloud Identity - Single sign-on (SSO) service based on SAML 2.0 and OpenID.
Cloud IAM - Identity & Access Management (IAM) service for defining policies based
on role-based access control.
Cloud Identity-Aware Proxy - Service to control access to cloud applications running on
Google Cloud Platform without using a VPN.
Cloud Data Loss Prevention API - Service to automatically discover, classify, and redact
sensitive data.
Security Key Enforcement - Two-step verification service based on a security key.
Cloud Key Management Service - Cloud-hosted key management service integrated
with IAM and audit logging.
Cloud Resource Manager - Service to manage resources by project, folder, and
organization based on the hierarchy.
Cloud Security Command Center - Security and data risk platform for data and services
running in Google Cloud Platform.
Cloud Security Scanner - Automated vulnerability scanning service for applications
deployed in App Engine.
Access Transparency - Near real-time audit logs providing visibility to Google Cloud
Platform administrators.
VPC Service Controls - Service to manage security perimeters for sensitive data in
Google Cloud Platform services.
20
IoT
Cloud IoT Core - Secure device connection and management service for Internet of
Things.
Edge TPU - Purpose-built ASIC designed to run inference at the edge. As of September
2018, this product is in private beta.
Cloud IoT Edge - Brings AI to the edge computing layer.
API Platform
Maps Platform - APIs for maps, routes, and places based on Google Maps.
Apigee API Platform - Lifecycle management platform to design, secure, deploy,
monitor, and scale APIs.
API Monetization - Tool for API providers to create revenue models, reports, payment
gateways, and developer portal integrations.
Developer Portal - Self-service platform for developers to publish and manage APIs.
API Analytics - Service to analyze API-driven programs through monitoring, measuring,
and managing APIs.
Apigee Sense - Enables API security by identifying and alerting administrators to
suspicious API behaviors.
Cloud Endpoints - An NGINX-based proxy to deploy and manage APIs.
21
1.6 Amazon Linux
An Amazon Machine Image (AMI) is a special type of virtual appliance that is used to create a
virtual machine within the Amazon Elastic Compute Cloud ("EC2"). It serves as the basic unit
of deployment for services delivered using EC2. Like all virtual appliances, the main component
of an AMI is a read-only filesystem image that includes an operating system (e.g., Linux, Unix,
or Windows) and any additional software required to deliver a service or a portion of it.
A template for the root volume for the instance (for example, an operating system, an
application server, and applications)
Launch permissions that control which AWS accounts can use the AMI to launch
instances
A block device mapping that specifies the volumes to attach to the instance when it's
launched
The AMI filesystem is compressed, encrypted, signed, split into a series of 10 MB chunks and
uploaded into Amazon S3 for storage. An XML manifest file stores information about the AMI,
including name, version, architecture, default kernel id, decryption key and digests for all of the
filesystem chunks.
An AMI does not include a kernel image, only a pointer to the default kernel id, which can be
chosen from an approved list of safe kernels maintained by Amazon and its partners (e.g., Red
Hat, Canonical, Microsoft). Users may choose kernels other than the default when booting an
AMI.
When it launched in August 2006, the EC2 service offered Linux and later Sun Microsystems'
OpenSolaris and Solaris Express Community Edition. In October 2008, EC2 added the Windows
Server 2003 and Windows Server 2008 operating systems to the list of available operating
systems. As of December 2010, it has also been reported to run FreeBSD; in March 2011,
NetBSD AMIs became available. In November 2012, Windows Server 2012 support was added.
Amazon has its own Linux distribution that is largely binary compatible with Red Hat Enterprise
Linux, and therefore CentOS. This offering has been in production since September 2011, and
in development since 2010. The final release of the original Amazon Linux is version 2018.03
22
and uses version 4.14 of the Linux kernel. Amazon Linux 2 was announced in June 2018, and is
updated on a regular basis
Types of images
23
1.7 Google Sheets
Google Sheets is a spreadsheet program included as part of a free, web-based software office
suite offered by Google within its Google Drive service. The service also includes Google Docs
and Google Slides, a word processor and presentation program respectively. Google Sheets is
available as a web application, mobile app for Android, iOS, Windows, BlackBerry, and as a
desktop application on Google's ChromeOS. The app is compatible with Microsoft Excel file
formats. The app allows users to create and edit files online while collaborating with other users
in real-time. Edits are tracked by user with a revision history presenting changes. An editor's
position is highlighted with an editor-specific color and cursor and a permissions system
regulates what users can do. Updates have introduced features using machine learning, including
"Explore", offering answers based on natural language questions in a spreadsheet.
Google Sheets is available as a web application supported on Google Chrome, Mozilla Firefox,
Internet Explorer, Microsoft Edge, and Apple Safari web browsers. Users can access all
spreadsheets, among other files, collectively through the Google Drive website. In June 2014,
Google rolled out a dedicated website homepage for Sheets that contain only files created with
Sheets. In 2014, Google launched a dedicated mobile app for Sheets on the Android and iOS
mobile operating systems. In 2015, the mobile website for Sheets was updated with a "simpler,
more uniform" interface, and while users can read spreadsheets through the mobile websites,
users trying to edit will be redirected towards the mobile app to eliminate editing on the mobile
web.
Google Sheets serves as a collaborative tool for cooperative editing of spreadsheets in real-time.
Documents can be shared, opened, and edited by multiple users simultaneously and users are
able to see character-by-character changes as other collaborators make edits. Changes are
automatically saved to Google's servers, and a revision history is automatically kept so past edits
may be viewed and reverted to. An editor's current position is represented with an editor-specific
color/cursor, so if another editor happens to be viewing that part of the document they can see
edits as they occur. A sidebar chat functionality allows collaborators to discuss edits. The
revision history allows users to see the additions made to a document, with each author
distinguished by color. Only adjacent revisions can be compared, and users cannot control how
frequently revisions are saved. Files can be exported to a user's local computer in a variety of
24
formats such as PDF and Office Open XML. Sheets supports tagging for archival and
organizational purposes.
Other Functionalities
A simple find and replace tool is available. The service includes a web clipboard tool that allows
users to copy and paste content between Google Sheets and Docs, Slides, and Drawings. The
web clipboard can also be used for copying and pasting content between different computers.
Copied items are stored on Google's servers for up to 30 days. Google offers an extension for
the Google Chrome web browser called Office editing for Docs, Sheets and Slides that enables
users to view and edit Microsoft Excel documents on Google Chrome, via the Sheets app. The
extension can be used for opening Excel files stored on the computer using Chrome, as well as
for opening files encountered on the web (in the form of email attachments, web search results,
etc.) without having to download them. The extension is installed on Chrome OS by default. As
of June 2019, this extension is no longer required since the functionality exists natively.
Google Cloud Connect was a plug-in for Microsoft Office 2003, 2007 and 2010 that could
automatically store and synchronize any Excel document to Google Sheets (before the
introduction of Drive). The online copy was automatically updated each time the Microsoft
Excel document was saved. Microsoft Excel documents could be edited offline and synchronized
later when online. Google Cloud Connect maintained previous Microsoft Excel document
versions and allowed multiple users to collaborate by working on the same document at the same
time. However, Google Cloud Connect has been discontinued as of April 30, 2013, as, according
to Google, Google Drive achieves all of the above tasks, "with better results".
While Microsoft Excel maintains the 1900 Leap year bug. Google sheets 'fixes' this bug by
increasing all dates prior to March 1, 1900, so entering "0" and formatting it as a date returns
December 30, 1899. On the other hand. Excel treats "0" as December 31, 1899, which is
formatted to read January 0, 1900.
25
1.8 Pandas
In computer programming, pandas is a software library written for the Python programming
language for data manipulation and analysis. In particular, it offers data structures and operations
for manipulating numerical tables and time series. It is free software released under the three-
clause BSD license. The name is derived from the term "panel data", an econometrics term for
data sets that include observations over multiple time periods for the same individuals.
Library features
Dataframes
Pandas is mainly used for data analysis. Pandas allows importing data from various file formats
such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows various data
manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data
wrangling features.
26
History
Developer Wes McKinney started working on pandas in 2008 while at AQR Capital
Management out of the need for a high performance, flexible tool to perform quantitative
analysis on financial data. Before leaving AQR he was able to convince management to allow
him to open source the library.
Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor
to the library.
27
2. Project 1 – CRB Error Analysis
Executive Summary
CRB Data base has a huge amount of data which is added, updated and deleted manually. Since
the data is supplied manually there are many mistakes which cannot be identified easily. Thus
the common mistakes need to be pointed out with possible solutions so that the person rectifying
the errors has the exact idea of the position of error and its solution. This will help in saving time
and increase productivity.
The Script currently checks 16 different types of errors and has also been tested and deployed
on Amazon linux server for its results. Also the script is fully automated so that the user does
not even have to run the script or provide any input. The script has been scheduled to run at
12PM every Monday and E-Mail the Results to the concerned authorities.
28
Solution
The solution lies the usage of dataframe where there are multiple functionalities available in
order to manipulate and handle data. These functionalities help in manipulating data and
implementing python code.
The CRB extract has 20 tables with massive amounts of data embedded in 1 file. This data is so
massive that even with an i7 processor the time taken to read and manipulate will take about 12
hours. This is because all the 20 tables are embedded in 1 excel file distributed into multiple
sheets. Also the data will always be increasing in future so this fact has to be also kept in mind.
The student also needs to figure out how to get the error patterns and the look of final shape of
the error table created.
29
Methodology adopted for Solution
Separating all 20 tables into respective excel files to remove n^2 complexity for
manipulations done on the data. This is done by python by reading the excel file and then
saving each file to the sheet name. Now we can move towards the solution.
The first thing in final solution consisted of taking input from the user about the date of
crb extract so that it comes into logs on which date crb extract file was generated. Since
the extract is downloaded from s3 bucket, the extract downloaded is latest. Therefore the
backup date entered is the latest date of the extract. Also putting in the date of the creation
of errors file.
Now we take all the 20 excel files generated from crb extract and import them into python
dataframe. Now all 20 data frames are imported into the dataframe in about 4 minutes.
The time taken has reduced to just 4 minutes.
Now independent functions are defined for identifying errors.
The major errors checked in these functions include
1. Checking if spaces are present at the ends of a string. Also if there are
more than 1 space present at the same place anywhere in the required
columns for checking.
2. Checking if the zip code is present in right formats in lieu that the data of
multiple countries is present in the database.
3. Mapping of states to right respective countries and creating error flags
otherwise.
4. If locality is present in the database then state should be present. If state
is present then country should be present. The vice-versa is not true
always.
5. Checking of single space at specific places
6. Checking the right format of PO Box number.
7. Checking if salutations are present or not in the database
8. Checking the right syntax of phone number
9. Checking right syntax of date format
10. Checking of the right business identification formats
Now defining an excel sheet which contains list of all the checks on different columns
where no column is checked outside of this. This is good for a person who does not know
the code comes to know of the checks that have been done in the extract.
30
Now applying the created functions depending upon the type of check and its
applications the checks were implemented on different columns of different dataframes
and saved using apply() or lambda function or both while also taking care of the naming
convention required in the results. The problem of empty columns was also handled so
that it might not create problems in future.
Now the data format is prepared as to which columns are needed as their placement.
The melt function comes in handy to unpivot columns and convert the tables to required
format.
The problem of empty table is handled in order to stay away from errors while appending
files.
The files are now appended in such a way that the naming convention holds true and
errors do not get mixed up in shape issues
Concatenated all business error files to 1 dataframe and individual error files to another.
Provided solution to space issue right in the error files so it is easy to update the database
with the right set of values.
Sorted the database to unique keys so that all errors related to 1 row get clustered together
for the ease of the person who will remedy the errors in the database
Using excel writer wrote the errors in the repective excel files.
The final files are emailed to the concerned authorities for rectification of errors.
The script has been scheduled on an Amazon linux server which automatically runs the
script every Monday and sends the results directly to the authorities on email. Thus, the
script is fully automated.
31
Output Sample
32
Figure 2.3 Email Output CRB Error Analysis
33
3. Project 2 – CRB Completion Rates
Executive Summary
Aim of this Project is to showcase the completion rates of the fields of Database to the clients.
The Dashboard tab thus created show be so communicative that the Client gets a concrete idea
about the completion rates of the field. The dashboard will also contain historic data for client’s
reference. This means that old data will not get lost instead saved forever. The dashboard tab
will be incrementally updated every week.
The dashboard tab will also help client access the work done by team in increasing the percentage
of completion of the fields.
The final delivered dashboard tab runs on 8 scripts which in an overall executes 48 scripts which
are all completely automated, tested and deployed on amazon Linux server. All scripts are
scheduled to run on Saturday of every week so that the dashboard can be refreshed on Monday
Morning.
In Addition there are multiple checks assigned in the scripts which will act as failsafe in case
anything goes wrong the scripts will exit. Also there are options to roll back to previous data
whenever scripts fails as all previous data is uploaded on s3 bucket in compressed .paraquet
format.
34
Solution
The problem is huge and there is no well-defined final solution set at the beginning. To start
with the student had to survey among the team leaders of the project for weeks trying to come
up with an agreed set final output look. There is a concept of storyboard in the company where
we project our thinking of the dashboard tab in the form of a power point presentation and
showcase the logics that had to be used for manipulations to come up to the solution.
The end goal is to have a dashboard tab that is simple to understand and very effective in
communicating numbers in effective way to the clients of the project. The tab would showcase
the completion rates by ZIGRAM from 11th July 2019.
35
Methodology adopted for solution
Separating all 20 tables into respective excel files to remove n^2 complexity for
manipulations done on the data. This is done by python by reading the excel file and then
saving each file to the sheet name. Now we can move towards the solution.
Get the extracts of the latest Friday of the latest week and the previous week
36
Completion snapshot extract making process
6 CRB Extracts are downloaded from s3 bucket for completion snapshot
These are then split to their specific sheet excel files and saved in well-defined folders
The backup and master dates are saved in 3 dictionaries into pickle formats to be used
by future runner script
Once the extracts are downloaded each extract is picked up one by one and all its sheets
get saved as their sheet names in excel format
Now CRB Base data Script is run which shifts current data to backup tables and replaces
current CRB Database with new tables and adds a row in the table
dump_cycle_index_temp
Figure 3.2 CRB Completion Snapshot Extract Maker Algorithm Flow Diagram
37
Completion Rates Algorithm
Dates pickle is introduced and checked if it contains the date of Latest Friday dates as
master date .if not, the script exits
Extract excel files are picked up one by one , trackers are created and saved as pickles
Entity_Mapping_Data table is updated with the new entities present in the mastersheets
using sql query
Trackers created are concatenated and constraints are set in updated_at column so that
merging issue of entities is taken care of
In the merging issue, when the entities are deleted and are added in another entity. The
updated_at in the trackers comes for the created_at of the already present entity which is
out of range of the backup and master date
To solve this issue if updated_at is behind backup date it is scaled up to backup date and
if it is ahead of master date it is scaled back to master date. Example - if backup date is
“2020-06-06” and updated_at in tracker is “2018-06-07” then updated_at is scaled to
“2020-06-06”
Flags are now created for different set of features. Example – added but to null , changed
but to null etc.
Previous grouped sum max dates is taken as input and appended to the current
concatenated trackers
Using filters.xlsx completion rates pertaining to the filters of state, country, tier,
licensing_authority_id are generated for the dates present between backup date and
master date
The grouped sum has to be rectified as it now contains some fields which have more than
1 row for themselves. Example –
Input:
Output:
To solve this issue the grouped sum is sorted in ascending order including updated_at.
Then duplicates are removed without including updated_at and only the last values are
kept
38
Grouped sum is saved in .paraquet format to be used for next run and completion rates
are appended to u_completion rate and replaced in t_completion_rate tables.
The final mapped grouped sum appended with the date of script run is zipped and sent
to s3 bucket for safekeeping and as a failsafe for future purposes
39
Figure 3.4 Generalized Tracker Creation Process
40
Completion snapshot
Completion Snapshot determines the completion rates which were affected during a
specific timeline like weekly, monthly or quarterly
Dates pickle is introduced for weekly, monthly and quarterly processes which contains
the backup and master dates for which process will be run in the form of a dictionary
containing numbered dates as values and their keys as folder names
Dates pickle is calculated as dates on latest Thursday , penultimate Thursday , last
month’s last Thursday and latest quarter’s Thursday
Examples – dates_weekly={1: [30, 5, 2020], 2: [5, 6, 2020]}, dates_monthly={1: [30, 5,
2020], 2: [5, 6, 2020]}, dates_quarterly={1: [4, 4, 2020], 2: [5, 6, 2020]}
Extract files are picked up one by one , trackers are created for each process and saved
as pickles in loops
A total of 120 trackers are created which takes about 1 hour 30 min to complete
In the Tracker creation process firstly, the tracker between backup and master date file
is created. Thereafter tracker between an empty extract and backup date file is created
for only those entities which were present in trackers between backup date and master
date
A new loop is introduced to calculate completion rates for each process. The reason
behind introducing new loop is to help in debugging code for future purposes
In the merging issue, when the entities are deleted and are added in another entity. The
updated_at in the trackers comes for the created_at of the already present entity which is
out of range of the backup and master date
To solve this issue if updated_at is behind backup date it is scaled up to backup date and
if it is ahead of master date it is scaled back to master date. Example - if backup date is
“2020-06-06” and updated_at in tracker is “2018-06-07” then updated_at is scaled to
“2020-06-06”
If updated_at is behind backup date it is scaled up to backup date and if it is ahead of
master date it is scaled back to master date
Flags are now created for different set of features. Example – added but to null , changed
but to null etc.
Using filters.xlsx completion rates pertaining to the filters of state, country, tier,
licensing_authority_id are generated for the dates present between backup date and
master date
41
Mapping of state and country is done to the completion rates and timeline is added( i.e.
weekly, monthly or quarterly)
The completion rates for completion snapshot are replaced in the original table for each
timeline one by one
Completion snapshot does not require any previous grouped sum as all data required is
created in the script itself every time
42
Fields for changes in Completion rate tab
1. Business_addresses
Box_number
Country
Locality
Postal_code
Premise
Region
Street
Sub_locality
Sub_premise
Sub_region
2. Business_aliases
Created_at
Alias_name
Alias_name_type
Quality
3. Business_emails
Address
4. Business_licenses
License_identifier
Med_rec_classification
Status_date
Uniform_license_status_description
Uniform_license_type_description
43
5. Business_mastersheet
Business_license
Cik
Created_at
Fein
Name
Name_type
Stock_exchange
Stock_symbol
Tier
url
6. Business_phones
Classification
Number
7. Business_registrations
Classification
Number
State
8. Business_source_docs
Attachment_content_type
Attachment_file_name
Attachment_file_size
Attachment_updated_at
Description
Note
url
44
9. Business_subordinates
Role
Subordinate_link
Subordinate_name
Subordinate_type
Title
10. Business_superiors
Title
Superior_link
Superior_name
Superior_type
Role
11. Business_superiors
Classification
url
12. Individual_addresses
box_number
country
locality
postal_code
premise
region
street
sub_locality
sub_region
sub_premise
45
13. Individual_alias
created_at
education_suffix
first_name
generation_suffix
last_name
middle_name
prefix
quality
14. Individual_emails
address
15. Individual_licenses
License_identifier
Med_rec_classification
Status_date
Uniform_license_status_description
Uniform_license_type_description
16. Individual_master
Created_at
Date_of_birth
Education_suffix
First_name
Generation_suffix
Last_name
Middle_name
Prefix
url
46
17. Individual_phones
Classification
Number
18. Individual_phones
Attachment_content_type
Attachment_file_name
Attachment_file_size
Attachment_updated_at
Description
Note
url
19. Individual_subordinates
Role
Subordinate_link
Subordinate_name
Subordinate_type
Title
20. Individual_web_sites
Classification
url
47
Order of Script Run
1_CRB_Extract_Maker.py (45 min) (Scheduled at 12PM on saturday)
2_completion_snapshot_Extract_Maker.py (1 hour 30 min) (Scheduled at 1PM on
saturday)
3_CRB_Base_Data_Update_Script.py (10 min) (Scheduled at 3PM on saturday)
4_completion_rate_Runner.py (1 hour 30 min) (Scheduled at 3:45PM on saturday)
5_completion_snapshot_Runner.py (4 hours 20 min) (Scheduled at 6:30PM on saturday)
6_t_completion_rate.ipynb (5 min)
7_excel_to_sql.py (5 min)
8_ CRB_tables.py(5 min)
48
Output Samples
1. Tracker output
unique_id entity licensing_authority_id from to attribute data-field flag updated_at backup_date master_date
2453_addresses_2018-03-08T23:04:18.275Z 2453 # 715 business_addresses box_number deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
3206_addresses_2018-03-08T23:02:23.088Z 3206 PO Box 285 business_addresses box_number deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
83964_addresses_2020-03-12T17:55:21.764Z 83964 Park Park City business_addresses locality changed 2020-06-05T21:44:09.723Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
100547_addresses_2020-03-05T10:21:32.773Z 100547 Murray business_addresses locality deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
109749_addresses_2020-05-29T04:44:24.017Z 109749 Fairmont business_addresses locality deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
109749_addresses_2020-05-29T04:44:24.920Z 109749 Weirton business_addresses locality deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
2. Completion Rate
49
3. CRB Dashboard
1. Main Page
50
3. Completion rates Page (Business Filter)
51
5. License Datafields (Individual Filter)
4. Completion Snapshot
1. Completion rates(Weekly Filter)(Business filter)
52
2. Completion rates(Weekly Filter)(Individual filter)
53
4. Completion rates(Monthly Filter)(Individual filter)
54
6. Completion rates(Quarterly Filter)(Individual filter)
4
55
8. License Data Fields(weekly Filter)(Individual filter)
56
10. License Data Fields(monthly Filter)(Individual filter)
57
12. License Data Fields(Quarterly Filter)(Individual filter)
58
4. Project -3 Work Done Analysis
Executive Summary
CRB work Analysis is a ZIGRAM’s initiative to get a comprehensive analysis of the work being
done by the CRB Team on a day to day basis. Aim of this project is to access the productivity
of the team on the basis of tasks assigned, tasks completed and other different parameters. This
also helps access the supervisors of the team in an automated process.
The work done is now also shared with the client so to flaunt the work done by team and it has
also helped ZIGRAM increase the revenue from the clients.
The Project has helped standardize the process of reporting of work of the team and also helped
the operation leaders reward the good performers and help the weaker ones just by looking at
the Work Done reports thus saving their time and efforts too.
59
Solution
The problem is huge and there is no well-defined final solution set at the beginning.
At the beginning there is no solution available with the mentors since the idea is a bit new.
The project is built using google cloud platform. It uses google drive and google sheets api.
The project has a cloud project access which currently permits 100 requests per 100 seconds.
As of now a maximum threshold of 60 requests is being used in every run of the scripts.
60
Methodology adopted for solution
Input
1. Supervisor Page
61
Work Allotted Script
All the team Supervisors will fill their respective google sheets of the work allotted to
their specific analysts
At 10:15 AM their sheets will be compiled and a string will be generated based on their
inputs of the work allotted to each team member
This String after standardization will be e-mailed to each supervisor and to each ops
leader
The string contains comma separated work allotted to each team member with the state
and counts assigned in respective brackets
2nd string contains names of team members who are on leave
Output
62
Work Done Script
All the team Supervisors would have fill their respective google sheets of the work
allotted to their specific analysts
All the team Analysts and supervisors will fill their respective google sheets of the work
done by them during the day
At 10 PM their sheets will be compiled and a Table will be generated based on their
inputs of the work allotted and work done by each team member
Both Work Allotted and Work Done scrips are outer joined and blank spaces are marked
with ’-’
If the team member is marked absent by the supervisor he/she will be marked “Not
Available” in the Table
If subtask is not filled by analyst, comments will be considered as subtask and if that is
also not available then task is considered as subtask
Consequently If subtask is not available then the row is removed from consideration
Date fill is not important. it is created in code itself
if name is not filled in analyst sheets it will be substituted by the code at right place
The work done table will be shown as an HTML Table and a excel file is also sent as a
supplement
Output
63
Figure 4.5 CRB Work Done Output Email Head
64
Weekly Work Done
1. Activity Wise
Data is acquired for last 7 days via a SQL query
Activity wise Table is generated by groupby on task(Activity)
Allocated is the sum of counts allocated for the group
Completed is the sum of the counts completed for the group
Completed but not assigned is the sum of counts completed – counts assigned where
count completed – assigned>0
Total Completed is the sum of completed and completed but not assigned for the group
Pending is the sum of count assigned – counts completed for the group where counts
assigned >counts completed
The result is saved in an excel sheet named “Activity_Wise”
Output
65
2. Researcher Activity Wise
Data is acquired for last 7 days via a SQL query
Researcher Activity wise Table is generated by groupby on task and team Member
No. of Days is generated by aggregation sum on Date
Allocated is the sum of counts allocated for the group
Total Completed is the sum of the counts completed for the group
Work Done per man day is the division of count acompleted and No. of Days
The result is saved in 2nd excel sheet named “Researcher_Activity_Wise”
The initial result of the SQL query is saved in the 3rd excel sheet named “Base_data”
Output
Figure 4.8 CRB Weekly Work Done Output Researcher Activity Wise
66
Email
All the Sheets combined are saved in one excel file
This excel file is then sent via email to the respected relevant authorities
Output Sample
67
5. Project -4 License Changes
Executive Summary
Aim of this Project is to showcase the License Changes of the fields of Database to the clients.
The Dashboard tab thus created should be so communicative that the Client gets a concrete idea
about the License Changes of the field. The dashboard will also contain historic data for client’s
reference. This means that old data will not get lost instead saved forever. The dashboard tab
will be incrementally updated every week.
The dashboard tab will also help client access the work done by team in increasing the percentage
of completion of the fields.
The final delivered dashboard tab runs on a single script which is fully automated and tested. All
scripts are scheduled to run on Saturday of every week so that the dashboard can be refreshed
on Monday Morning.
In Addition there are multiple checks assigned in the scripts which will act as failsafe in case
anything goes wrong the scripts will exit. The dashboard tab will act as a means to get to know
the insights about the team’s efforts on the updation/changes of licenses on a regular basis. This
requirement is currently not being fulfilled by any dashboard tab.
Therefore a dashboard tab would communicate team’s efforts to the clients and also help increase
ZIGRAM’s revenues in the long term. The project also aims at reducing weekly client calls of
ZIGRAM with operation leaders thus saving time of both parties as the client will have access
to all license changes insights 24X7 and regular updates every week.
68
Solution
The problem is huge and there is no well defined final solution set at the beginning. To start with
the student had to survey among the team leaders of the project for weeks trying to come up with
an agreed set final output look. There is a concept of storyboard in the company where we project
our thinking of the dashboard tab in the form of a power point presentation and showcase the
logics that had to be used for manipulations to come up to the solution.
The end goal is to have a dashboard tab that is simple to understand and very effective in
communicating numbers in effective way to the clients of the project. The tab would showcase
the work done on the licenses by ZIGRAM from 1st October 2019. The tab would be refreshed
using static files in the future every week.
69
Methodology adopted for solution
Getting Data
get all links from batch exports api.
segregate only those links which are json.
decrypt them by replacing a fixed part of string with another.
go to specific links to get individual and business licenses file.
identify these files as either business or individual.
save these files by creating respective folders based on date of the crb extract.
convert the folder names to incremental numbers
Compare files received
now create a dictionary of dates involved in the individual and business licenses file.
use these dates to compare among the files to generate tracker.
append all individual trackers in 1 file.(complete individual tracker)
append all business trackers to one file(complete business tracker)
Manipulation on data
append these 2 files into 1 dataframe for further process
merge this result with a sql query to get a v_entity_overview to get region, country,
licensing authority id and name about licenses from db.
get some more information using entity_mapping_data to get address_region, country
using if not available in v_entity_overview
get end of week as thursday from update_at column
sort values
get FEID and its lag for comparison to get date difference
drop some unwanted columns
create a super unique id for unique row identification by joining unique entity
,updated_at and backup date.
get licenses created, inactive licenses, both changed, only date changed , only status
change by creating flags.
to get both status and date changed we made data in a way that values were in same row
by appending using super unique id
drop any unwanted columns.
70
filled empty feilds going to groupby to '0'
created 15 groupby aggregations inclusive of all possible combinations of groupby
columns(
appended and sorted all results to 1 dataframe
Outer Join to get other combinations
get total entities combination and licenses combination results using sql query to get all
combinations of entity and licenses gropuby and aggregated to results.
outer join of existing dataframe both results of the query one by one
replace '0' with empty string.
created aggregate_id.
sorted the data again
1 shift for aggregate id for filling empty entity values using bfill
point 32 to fill empty licenses values
dropped rows which logically did make sense according to power bi visuals
Reporting data
Imported latest data from an excel file
Send it to sql server
imported structured reporting data in script for current week and the previous week
This is done to give a comparative analysis for reporting data
formatted data
outer joined both the reporting dataframes to existing dataframe by creating an end of
week column inside reporting data to clear off future issues.
renamed few columns as required by power bi visual.
sent the table back to sql server
Final data
Update u_tracker_index with new backup_date and master date
Left join with state code to get state full state/Province names
Left join with country code to get full country names
Groupby to get cumulative records and licenses
Append the incremental data to SQL server in the table u_license_changes_aggregation
71
Proposed Output
72
Delivered Output (Filter- US, OR, all, OR-OLCC)
73
6. Other Tasks Completed
1. Weekly dashboard Update – CRB Dashboard has to be updated every Monday. It takes about
half day for the dashboard to get updated. The dashboard is a direct image of the efforts of
ZIGRAM to the clients therefore it has to be perfect. There is no room for mistakes here.
2. Generate test datasets – ZIGRAM has multiple projects and all the projects need a lot of data
to function and improve accuracy. In the first 3 weeks of my internship, I generated
approximately 5000 rows of test data for 2 different projects.
3. Storyboards – The dashboards deployed by me were also designed by me. There is a concept
of storyboards in ZIGRAM where the final output is first made on a ppt and then visualized. It
took us weeks to finally arrive at common understanding of the final view.
4. Monthly reports – I was also tasked with preparing monthly work done reports for the Month
of April and May which were directly shared with the client.
5. Setting up Amazon Linux – Since all the scripts are deployed on Amazon Linux Server I had
to set up its instance and all the requirements of the softwares and libraries required to function
run and schedule properly.
74
7. Conclusion
Working with dataset as huge as of CRB is in itself is a challenge. The CRB has dozens of tables
already in use which makes it very fragile and there is no room for any mistakes. Since all my
projects were related to dashboards and numbers there were multiple rounds of quality check on
the results which sometime would even last a week. The most difficult part to go through was
the completion rates grouped sum logic and its quality check.
Overall I have delivered 53 scripts and 2 dashboard tabs during my tenure as a data science
intern at ZIGRAM. We have tried to fulfill all the demands of the client in the most cost effective
manner. CRB work analysis project was proposed to cost a monthly recurring cost of about
$1000 when a professional software company working on one such area was consulted. By
working on this project we have created the project completely free of cost by using free google
api’s without compromising security. Since the project is free of any subscriptions there won’t
be any recurring future costs.
There has been a lot of problems which were faced in getting latest CRB extracts on the api
deployed on s3 bucket. All the scripts are based on the consistency of crb extracts links. However
over the time the extracts have become pretty much consistent. The problem of not getting
extracts has been reduced from 1 in 7 days to 1 in 40 days as of now.
Overall working with such a huge data has its own ups and downs. The experience that I got
from working with such data will forever help me in analyzing data better and will help me
achieve targets faster since I now have thorough insight on the technical, logical and syntactical
problems faced by an analyst.
75
8. Bibliography and References
1. https://pandas.pydata.org/
Reference for Pandas library of Python
2. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Reference for Dataframes which are the backbone of all Analysis
3. https://aws.amazon.com/mp/linux/
Set up EC2 instance and libraries for smooth functioning of scripts and other Requirements
4. https://www.udemy.com/course/useful-excel-for-beginners/
Useful excel required for smooth handling of excel related issues in data Analysis
76