0% found this document useful (0 votes)

106 views90 pages

8th Semester Report CRB ZIGRAM

This document is an internship offer letter from Zigram Data Technologies Private Limited to Abhishek Kumar Agarwal. The letter outlines the terms of a 6-month internship, including responsibilities assisting with various technology, product development, and data-related assignments. It specifies a monthly stipend of INR 10,000 and details reporting structure, working hours, leave policies, obligations regarding taxes, confidentiality requirements, and conditions for termination.

Uploaded by

Mano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views90 pages

8th Semester Report CRB ZIGRAM

Uploaded by

Mano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

A PROJECT REPORT ON

Cannabis Related Businesses(CRB)

ZIGRAM
Submitted by
(Abhishek Kumar Agarwal)
Student ID: 16BCON453

Under the Guidance of

Faculty Internship Guide: Mr. Yogesh Bharadwaj Industry Guide: Mr. Ankit Gadi
Designation: Assistant Professor Designation: Data Scientist
Department of Computer Science Engineering ZIGRAM

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

JECRC University, Jaipur
January to June 2020
Declaration

I hereby declare that the project works accomplished in this report named as “Cannabis Related
Businesses” is an authentic record of my own work carried out at “ZIGRAM” as requirements
of six months’ project semester for the award of degree of B.Tech. “Computer Science &
Engineering”, JECRC University, under the guidance of “Mr. Yogesh Bhardwaj” and “Mr.
Ankit Gadi”, during Final semester 2020.

Abhishek Kumar Agarwal

16BCON453

Signature of student

Date: 16 June 2020

Certified that the above statement made by the student is correct to the best of our knowledge

Mr. Yogesh Bharadwaj Mr. Ankit Gadi

Assistant Professor Data Scientist

Signature

Abhishek Bali
CEO, ZIGRAM

ii
Acknowledgement
I would like to acknowledge our debt to each & every person associated in this Project

Development. The Project Development required huge Commitment from all the

Individuals involved in it.

I am also indebted to Mr. Ankit Gadi who has guided me throughout the Project

Development. I am Thankful for the patience with which he stood by me till the end of

My Project. I am very Thankful for him Bounteousness for standing by me in peak

movements of the Project Development.

I would like to acknowledge all the staffs for providing a helping hand to me in times of

Queries & problems. The Project is a result of the efforts of all the peoples who are

Associated with the Project directly or indirectly, who helped me to successfully complete

The Project within the specified Time Frame.

I’d like to thank my HOD, P r o f ( D r . ) Naveen Hemrajani Sir and our Mentor Mr.
Yogesh Bhardwaj Sir for their kind and active support and valuable guidance during
the work.

Process. Without their Courage & Support, the Project Development would have been

Futile. It was only their building Support & Morale me in attaining the Successful

Completion of the Project.

I would like to thank our colleagues for keeping my Sprits High while preparing the

Project. Because of their Diligent & Hard Work, I wouldn’t have been able to complete

The Project within the given Time Frame.

I am Thankful to each & every people involved with me in this case study, their

Encouragement & Support enabled the Project to Materialize & Contribute it to its

Success. I would like to express my Appreciation to all the people who have contributed

To the Successful completion of the Project.

With all Respects & Gratitude, I would like to Thanks to all the people, who have helped

In the Development of the Project.

iii
Abhishek Kumar Agarwal
Internship Letter (15 December 2019)

iv
December 15, 2019
Zigram Data Technologies Private Limited
Gurgaon, India

Dear Mr. Agarwal,

We are pleased to confirm your acceptance of an internship position with Zigram Data Technologies
Private Limited on the following terms and conditions:
1. Role: As a part of the team, you shall be responsible for assisting in various assignments that
the Company shall undertake in various areas of Technology, Product Development, Research,
Risk, Compliance, Data Assets, Data Operations, Analytics, Data Science, Code Development
and emerging technologies. You may be required to perform tasks, which are not directly
associated with your job title. Further your role and responsibilities may be reviewed by the
Company from time to time. Additional details of your role and expected outputs will be
provided to you on your joining the Company.

2. Date of Joining: Your internship will begin effective from January 6, 2020 or any other date
mutually agreed upon.

3. Location: Your current place of work is Gurgaon. Your place of work may be changed to
any other location within India or overseas as directed by the Company from time to time.

4. Period of Internship: The Internship period being offered is for a period of 6 months from the
date of your joining the company, extendable by 1 month in case of issues of performance or
deliverables.

5. Remuneration: The internship will carry a stipend of INR 10,000 per month. The company
will not be responsible for any taxation, accounting or other associated terms / policies and you
will not be eligible for any taxable and non-taxable allowances and benefits, and other
payments, payables or bonuses.

6. Project & Deliverables: You will be given an appropriate project title, scope and deliverables.
These will be tracked on a regular basis and deliverables assessed / defined monthly. In addition,
there will be a mid-term assessment of your work to assess performance, output and appropriate
steps of development. Deliverables, action items and requirements may change or be considered
dynamic or be prioritized based on new or emergent requirements or concerns.

7. Reporting: Your reporting within the organization may be towards two different leaders i.e.
Admin & Project Supervisor. They will be formally introduced to you on the day of joining.

8. Obligations of Tax: Any amount Payable by the Company to you towards compensation,
allowances and/or other payments shall be subject to deduction of withholding taxes and/or
any other taxes under applicable law. All requirements under the applicable tax laws in India
and outside India, including tax compliance and filing of tax returns, assessments etc. of your
personal income shall be fulfilled by you.

Internship Offer Letter 2

9. Hours of work: The office functions six days a week from Monday to Saturday, the timings
between 9.30 AM to 6.30 PM. However, the actual working hours, which you may be required
to work, will be agreed between your manager and you. You may be required to work additional
hours as necessary for the effective performance of your role.

10. Whole time and attention: During your internship with the Company, you shall devote your
best efforts to promote the company's business and may not, without prior written consent of
the company, engage or be interested (directly or indirectly) in any other business or
employment.

11. Leave: In addition to public holidays reserved by the company, you shall also be entitled to
leaves as per the company leave policy and prorated from your date of joining. Further there
are other leave entitlements, details of which will be made available at the time of joining.
These are governed by the company's personnel policy, as applicable and can be modified from
time to time.

12. Termination: You may terminate this internship by giving 45 days of notice in writing or by
paying the equivalent of the TFC amount in lieu thereof. The company reserves the right to
terminate your services without any compensation or notice thereof, if you are found to be in
moral breach of your responsibilities, or following a code of conduct, not in line with the
company's guidelines and values, or if any information provided by you during the course of
your interview or otherwise, is at any time, found to be either wrong or not disclosed, or if you
are in contravention of the terms of this letter.

13. Disclosure: You shall disclose to the company your business interests, whether or not they are
similar to or in conflict with the businesses or activities of the company, and all circumstances
in respect thereof and whether there is or might be a conflict of interest between the company
and you.

14. Company Policies: You will be covered by the company’s policies, as applicable at your
level. The company reserves the right to amend the policies from time to time.

15. Return of Company property: You shall promptly upon request by the company and in any
event upon the termination of your internship deliver to the company all list of clients or
customers, correspondence and all other documents, papers and records in whatever form,
codes and technology related items, including but not limited to electronically held data
containing or referring to any trade secrets or confidential information concerning the business
of the company which may have been prepared by you or come into your position, custody or
control in the course of your internship (including any prior employment with the company).
You shall not keep copies of these items.

16. Compliance Rules: You shall be subject to compliance rules as determined by the company
from time to time or as may be imposed by any regulatory body. It is your responsibility to
ensure that you are aware of the compliance rules in force from time to time and that you adhere
to them. From time to time the company may require that you sign undertakings that you will
abide by the then existing rules and regulations.

In the event of your background verification report being unsatisfactory to the company, the company
reserves the right to revoke your offer of internship or terminate your internship in the event of you
having commenced your internship with the company prior to receiving your verification report.

Internship Offer Letter 3

We take this opportunity to congratulate you on your selection and look forward to a mutually beneficial
and professional association. Please feel free to be in touch with Rashmi Singh for assisting you in your
transition. This offer is valid for a period of 15 (fifteen) days from date of issue.
Kindly return the attached copy of this letter, duly signed by you, in acceptance of the terms and conditions
set out.

I accept the offer on the above

Authorised Signatory terms and conditions

Name: Abhishek Kumar Agarwal

Location: Alwar, Rajasthan

Date: 16th December 2019

Internship Offer Letter 4

viii
A-57, DDA Sheds
Okhla Industrial Area, Phase-II
New Delhi, Delhi 110020 India
[email protected]
https://www.zigram.tech/

Letter of Experience

15 June 2020
To whom it may concern

This letter is to certify that Abhishek Kumar Agarwal has successfully completed the internship program with
Zigram Data Technologies Private Limited. His internship tenure was from January 06, 2020 to June 08,
2020. Abhishek was actively and diligently involved in the projects and tasks assigned to him.

During the span, we found him to be a punctual and hardworking person. He was able to demonstrate his
knowledge by practically leveraging it in various aspects of the business.

Sincerely,
Abhishek Bali
CEO, ZIGRAM

Zigram Data Technologies Private Limited | U74999DL2018PTC339087

Table of Contents
Topic Page No.

Declaration……………………………………………………………………………………………..……..i
Acknowledgement………………………………………………………………………………………....ii
Offer Letter………………………………………………………………………………………………..iii
Joining Report……………………………………………………………………………………………vii
Completion Certificate……………………………………………………………………….................viii
List of Figures………………………………………………………………………………………….....xi
Abstract..……………………………………………………………………………………………...…xiii
Company Profile…………………………………………………………………………………………...1
Introduction To CRB……………………………………………………………………………………...4
Projects Assigned…………………………………………………………………………………..............8
1. Methodology……..…………………………………………………………………………....………...9
1.1 DataFrame…………..…………………………………………………………………………………9
1.2 Microsoft SQL Server……………………..…………………………………………………………10
1.3 Power BI………………………………………………………………………………………............14
1.4 Amazon s3 Bucket..…………………………………………………………………………………...15
1.5 Google Cloud Platform……………...………………………………………………………………..17
1.6 Amazon Linux….………………………………………….………………………………………….22
1.7 Google Sheets…………………………………………………………………………………………24
1.8 Pandas……………………….......……………...…………………………………………………….26
2. Project 1 – CRB Error Analysis...…………………………………………………………………….28
Executive summary………………………………………………………………………………............28
Solution …………………………………………………………………………………………………...29
Methodology adopted for solution……………………………………………………………………….30
Output sample ………………………………………………………………………………………........32
3. Project 2 – CRB Completion Rates.......……………………………………………………………....34
Executive Summary ……………………………………………………………………………………...34
Solution …………………………………………………………………………………………………...35
Methodology adopted for solution……………………………………………………………………….36
Fields for changes in Completion rate tab………………………………………………………………43
Order of Script Run………………………………………………………………………………………48
Output samples…………………………………………………………………………………………...49
x
4. Project -3 Work Done Analysis……………………………………………………………………….59
Executive Summary ……………………………………………………………………………………...59
Solution …………………………………………………………………………………………………...60
Methodology adopted for solution ………………………………………………………………………61
Output Sample……………………………………………………………………………………………67
5. Project -4 License Changes……………………………………………………………………………68
Executive Summary ……………………………………………………………………………………...68
Solution …………………………………………………………………………………………………...69
Methodology adopted for solution………………………………………………………………………70
Proposed Output …………………………………………………………………………………………72
Delivered Output…………………………………………………………………………………………72
6. Other Tasks Completed………………………………....…………………………………………….74
7. Conclusion………....…………………………………………………………………………………...75
8. Bibliography and References………………………………………………………………………….76

xi
List of Figures

Figure 1 Marijuana Life Cycle Flow Diagram............................................................................... 11

Figure 2.1 Error List CRB Error Analysis …................................................................................. 32
Figure 2.2 Pivot CRB Error Analysis…………………………………………………………….32
Figure 2.3 Email Output CRB Error Analysis …........................................................................... 33
Figure 3.1 CRB Extract Maker Algorithm Flow Diagram............................................................. 36
Figure 3.2 CRB Completion Snapshot Extract Maker Algorithm Flow Diagram.......................... 37
Figure 3.3 General Process Overview Completion Rates……....................................................... 39
Figure 3.4 Generalized Tracker Creation Process.......................................................................... 40
Figure 3.5 Completion Rates Runner Script Process...................................................................... 40
Figure 3.6 Tracker Output Completion Rates................................................................................ 49
Figure 3.7 Completion Rates Table Output ................................................................................... 49
Figure 3.8 CRB Dashboard Main Page ......................................................................................... 50
Figure 3.9 Completion Rates Page (Individual Filter).................................................................... 50
Figure 3.10 Completion Rates Page (Business Filter)…............................................................... 51
Figure 3.11 License Datafields (Business Filter) …................................................................... 51
Figure 3.12 License Datafields (Individual Filter)......................................................................... 52
Figure 3.13 Completion rates(Weekly Filter)(Business filter)....................................................... 52
Figure 3.14 Completion rates(Weekly Filter)(Individual filter)………......................................... 53
Figure 3.15 Completion rates(Monthly Filter)(Business filter)..................................................... 53
Figure 3.16 Completion rates(Monthly Filter)(Individual filter)................................................... 54
Figure 3.17 Completion rates(Quarterly Filter)(Business filter).................................................... 54
Figure 3.18 Completion rates(Quarterly Filter)(Individual filter).................................................. 55
Figure 3.19 License Data Fields(weekly Filter)(Business filter).................................................... 55
Figure 3.20 License Data Fields(weekly Filter)(Individual filter)................................................. 56
Figure 3.21 License Data Fields(monthly Filter)(Business filter).................................................. 56
Figure 3.22 License Data Fields(monthly Filter)(Individual filter)................................................ 57
Figure 3.23 License Data Fields(Quarterly Filter)(Business filter)................................................ 57

xii
Figure 3.24 License Data Fields(Quarterly Filter)(Individual filter)............................................. 58
Figure 4.1 CRB Work Analysis- Supervisor Input Page.................................................................61
Figure 4.2 CRB Work Analysis- Analyst Input Page……............................................................ 61
Figure 4.3 CRB Work Allotted Output Email................................................................................ 62
Figure 4.4 CRB Work Done Output Table.................................................................................... 63
Figure 4.5 CRB Work Done Output Email Head........................................................................... 64
Figure 4.6 CRB Work Done Output Email Tail............................................................................. 64
Figure 4.7 CRB Weekly Work Done Output Activity Wise…...................................................... 65
Figure 4.8 CRB Weekly Work Done Output Researcher Activity Wise……................................ 66
Figure 4.9 CRB Weekly Work Done Output Email………………………………..…................. 67
Figure 5.1 CRB License Changes Tab Proposed Output………………..……..…........................ 72
Figure 5.2 CRB License Changes Tab Delivered Output snippet 1…………………………....... 72
Figure 5.3 CRB License Changes Tab Delivered Output snippet 2…………………………..…..73

xiii
Abstract

Cannabis has a huge market in countries where it has been legalised. But it can also be abused by
a lot of people. A lot of people are in one way or the other related to Cannabis. The project CRB
contains a huge database of people and businesses with cannabis licenses with a lot of minute
details. The details are routinely updated and new records are continuously made to keep database
up to date. This database has a Dashboard which is made on POWER BI.
This Dashboard is designed for the project’s clients who can see progress of the database.
The dashboard presently has 7 tabs. Another tab is required to showcase the changes in the licenses
to the Clients. The new tab of Dashboard should be so effective that all the relevant information
and insights are expressed in the dashboard tab. The name of the new tab is recognised as ‘License
Changes’.
Further a new tab is required to show completion rates of all the important fields of the database
which help in the in depth analysis of each entity. Completion rates tab would also help in giving
a comparative analysis of the work done by the CRB team in the timelines such as weekly, monthly
and quarterly.
There is also a requirement of Error Analysis on the existing dashboard page to know about some
common errors that currently exist in the database with the solutions so that the person who
rectifies the errors in the database has to just cut and paste error value with the right value.
The clients also need a thorough work analysis of the work done by the CRB Team. This needs to
be achieved through a structured data format which would be readily available for analysis.
During the internship Tenure I have solved all these issues and the solutions are deployed on
Amazon Linux Instance with fully automated scripts scheduled to run automatically.

xiv
Company Profile

ZIGRAM is a high impact organization which operates in the Data Asset space. The team is
made up of professionals from varied domains like data science, technology, sales, financial
services, research and business consulting. The aim is to deliver value to clients by Building and
Managing Data Assets across use cases - thereby boosting revenues and reducing the cost of
doing business, in a data driven world.
ZIGRAM
Z – Zeno - Founder of the Stoic school of philosophy
I - Issac Asimov – Finest writer & creator of Science Fiction
G - Sir Francis Galton - One of the greatest polymaths of our time
R - S. Ramanujan - Famous mathematician with almost no formal training
A – Augustus - Founder of the Roman Empire and Pax Romana
M - John McCarthy - Computer scientist known as the 'Father of Al'
What is a Data Asset?
A Data Asset is a structured, comprehensive and validated database of information which has
been built for a specific use case and in response to a problem.
 Build For Purpose - Purpose-built Data Assets are designed to meet specific business
requirements as defined by the customer & use cases, oriented to solve specific problems.
 Comprehensive - Data Assets which include all the data points which are necessary,
relevant to the specific use case.
 Validated - Data Assets which have been created from valid sources which are relevant,
up to date and can be audited.
 Structured - Data Assets which are constructed and designed according to a defined plan
in order to address a specific use case.

ZIGRAM’s Work
Data Applications
Applications created by using multiple technologies including automation, analytics, machine
learning and AI to help build Data Assets –Faster, Cheaper and Better
Data Products
Data Asset products built in-house, either in partnership with other players or wholly managed,
with subscription, application or API based access - Solutions For Specific Use-cases
Data Services
Deploying experienced resources, subject matter experts and specialists to execute projects and
operations across the Data Asset lifecycle –From Conceptualization To Delivery

1
ZIGRAM’s Expertise
 Operations
Conducting Research, Developing Processes in order to build a validated, structured and
dependable Data Asset.
Core Data:
 Profile Development
 Online Research
 Data Projects
 Data Asset framework
 Remediation
 Enhancement
 Enrichment
 Maintenance
 Data Science
Using Scientific Methods, Processes, Algorithms and Systems to extract knowledge and
insights from structured and unstructured data.
Insights & Efficiency:
 Analytics
 Machine Learning
 NLP
 Deep Learning
 Statistical Learning
 Hypothesis Testing
 Data Wrangling
 Predictive Modelling

 Technology
Solutions for Data Management, Development and Products or services that are based
on Data generated by both human and machines.
 Delivery & Development
 Automation
 Application Development
 Cloud Technology
 APIs

2
 Data Architecture
 Extraction/Mining
 Scrapers & Crawlers
 External Services

 Representation
Use of Tools to effectively represent Data in order to build actionable insights, visually
communicate a quantitative message.
 Reporting & Showcasing
 Dashboards
 Visual Analytics
 Relationship Maps
 Reporting
 Infographics

Major Projects undertaken by ZIGRAM-

 Cannabis Related Businesses (CRB)
 Insight led Sales (ILS)
 Open search Diligence (OSD)
 Profile Builder
 Anti-Money laundering Zone (AML Zone)
 Project Doctor

Key Members
 Mr. Abhishek Bali – CEO
 Mr. Rahul Pagarre – CFO
 Mr. Ankit Gadi – Data Scientist (Data Science Team Lead)
 Mr. Ritesh Mohan – Engineering Team Lead
 Ms. Jyoti Chakrabortty – Operations Team Lead
 Mr. Siddharth Sabu - Manager

3
1. Introduction To CRB

What is a Marijuana Related Business (MRB)?

There is no legal definition for Marijuana Related Business. Per the proposed federal bill,
Marijuana Businesses Access to Banking Act of 2015,
“Marijuana Related Legitimate Business” is defined as a manufacturer, producer, or any person
that participates in any business or organized activity that involves handling marijuana or
marijuana products, including cultivating, producing, manufacturing, selling, transporting,
displaying, dispensing, distributing, or purchasing marijuana or marijuana products pursuant to
a law established by a State or a political subdivision of a State.”

Difference between Manufacturer and Producer:

The term “manufacturer” means a person who manufactures, compounds, converts, processes,
prepares, or packages marijuana or marijuana products.
The term “producer” means a person who plants, cultivates, harvests, or in any way facilitates
the natural growth of marijuana.

The bill, Marijuana Businesses Access to Banking Act of 2015 provides a safe harbor for
depository institutions providing financial services to a marijuana related legitimate business to
the extent it prohibits a federal banking regulator from:
(1) Terminating or limiting the deposit or share insurance of a depository institution solely
because it provides financial services to a marijuana-related legitimate business; or
(2) Prohibiting, penalizing, or otherwise discouraging a depository institution from offering
such services.

4
Figure - Marijuana Life Cycle Flow Diagram

Based on licensing, MRBs are classified as medical and recreational.

 Medical marijuana businesses such as registered medical dispensaries (RMD) are
authorized to sell medical marijuana to patients suffering from certain approved medical
conditions. Medical marijuana products are available with a limited range of THC and
CBD concentrations.
 Recreational marijuana businesses sell legal marijuana at retail stores to adults above
a certain age limit. Recreational marijuana is used without medical justification.
Recreational marijuana usually has more THC content than the medicinal variety.

5
Banking MRBs
 Banks or FIs who maintain account relationships with MRB should enhance policies,
procedures and monitoring controls to:
 Identify marijuana-related relationships at account opening
 Evaluate and document the potential risks posed by marijuana dispensaries
 FI need to revise their AML program to address MRB, including SAR policies and
procedures addressing a three-tiered marijuana-specific reporting approach
 Ensure MRB relationships are appropriately considered within the bank’s suspicious
activity monitoring and other applicable reporting systems
 Periodically scrub customer base names and addresses against a listing of approved
marijuana dispensaries’ names, owners and addresses to identify any potential unknown
MRB accounts
 Update the AML/BSA training program
In assessing the risk of providing services to a marijuana-related business, an FI should
conduct customer due diligence that includes:
(i) Verifying with the appropriate state authorities whether the business is duly licensed and
registered
(ii) Reviewing the license application (and related documentation) submitted by the business
for obtaining a state license to operate its marijuana-related business
(iii) Requesting from state licensing and enforcement authorities available information about
the business and related parties
(iv) Developing an understanding of the normal and expected activity for the business,
including the types of products to be sold and the type of customers to be served (e.g.,
medical versus recreational customers)
(v) Ongoing monitoring of publicly available sources for adverse information about the
business and related parties
(vi) Ongoing monitoring for suspicious activity, including for any of the red flags described
in this guidance
(vii) Refreshing information obtained as part of customer due diligence on a periodic basis
and commensurate with the risk.

6
What makes MRB Monitor do?
 MRB Monitor helps financial institutions ("FIs") mitigate regulatory, reputational and
financial risk related to the marijuana industry with its industry-leading data and subject
matter expert
 As the only data vendor solely focused on the marijuana industry, MRB Monitor has the
largest and most comprehensive database of marijuana-related businesses ("MRBs") and
beneficial owners (“BOs”) for use in relationship screening and due diligence
 MRB Monitor is also helping to define the terminology and risk framework utilized by
FIs when developing marijuana-related policies and procedures

7
Projects Assigned:
1. Project 1 - The MRB database has very specific standards of input of data format and
its updation process and standards. However the input is done by Humans, so mistakes
are bound to happen. The work done by anyone is always checked by a checker but even
then mistakes are left unnoticed. Identify the most common mistakes in the database.
Create a script in python using Dataframe so that the most common mistakes are found
by using specific checks in the database. Use CRB extract as an input to the dataframe.
Also log process errors.
2. Project 2 - The CRB dashboard is made using POWER BI. Use s3 bucket, sql server
database, CRB extracts as input and create Completion rates tab for the CRB Dashboard
which will be used by clients to see insights related to the completion of fields. Reflect
the completion rates starting from 11th July 2019. Also log process errors.
3. Project 3 – The CRB team does a lot of work which needs to be quantified. Prepare a
Google sheet which can be used to analyse the work done by CRB Team. Create
automated scripts to Transfer the data everyday to SQL Server, trigger the scripts, empty
the sheets everyday, generate daily and weekly reports. Also log process errors.
4. Project 4 – The CRB dashboard is made using POWER BI. Use s3bucket logins as input
and create license changes tab for the CRB Dashboard which will be used by clients to
see insights related to the changes being done in MRB database. Use python, SQL server,
POWER BI to achieve the target. Reflect the changes in the dashboard from 1st October
2019. Also log process errors.

8
Methodology

1.1DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the
most commonly used pandas object. Like Series, DataFrame accepts many different kinds of
input:

 Dict of 1D ndarrays, lists, dicts, or Series

 2-D numpy.ndarray
 Structured or record ndarray
 A Series
 Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels)
arguments. If you pass an index and / or columns, you are guaranteeing the index and / or
columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all
data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense
rules.

9
1.2 Microsoft SQL Server

Microsoft SQL Server is a relational database management system developed by Microsoft. As

a database server, it is a software product with the primary function of storing and retrieving
data as requested by other software applications—which may run either on the same computer
or on another computer across a network (including the Internet).

Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different
audiences and for workloads ranging from small single-machine applications to large Internet-
facing applications with many concurrent users.

Features of Microsoft SQL Server:

 Support tools SQL Server Profiler, BI tools, SQL Server Management Studio, and
Database Tuning Advisor.

 Offers online help and documentation, and live item support.

 Provides advanced customization choice for datatype mappings and erase and rename
objects.

 Displays mistake, and cautioning messages about the relocation in an advanced window.

 A single, coordinated condition for SQL Server Database Engine administration and
approving.

 Resizable discoursed enable access to different instruments when an exchange is open.

 An action screen highlight with separating and programmed revive.

 Importing and Exporting from SQL Server Management Studio.

As of May 2020, the following versions are supported by Microsoft:

 SQL Server 2012

 SQL Server 2014
 SQL Server 2016
 SQL Server 2017
 SQL Server 2019

10
From SQL Server 2016 onward, the product is supported on x64 processors only.

The current version is Microsoft SQL Server 2019, released November 4, 2019.

Editions

Microsoft makes SQL Server available in multiple editions, with different feature sets and
targeting different users. These editions are:

Mainstream editions

1. Enterprise

SQL Server Enterprise Edition includes both the core database engine and add-on services, with
a range of tools for creating and managing a SQL Server cluster. It can manage databases as
large as 524 petabytes and address 12 terabytes of memory and supports 640 logical processors
(CPU cores).

2. Standard

SQL Server Standard edition includes the core database engine, along with the stand-alone
services. It differs from Enterprise edition in that it supports fewer active instances (number of
nodes in a cluster) and does not include some high-availability functions such as hot-add
memory (allowing memory to be added while the server is still running), and parallel indexes.

3. Web

SQL Server Web Edition is a low-TCO option for Web hosting.

4. Business Intelligence

Introduced in SQL Server 2012 and focusing on Self Service and Corporate Business
Intelligence. It includes the Standard Edition capabilities and Business Intelligence tools:
PowerPivot, Power View, the BI Semantic Model, Master Data Services, Data Quality Services
and xVelocity in-memory analytics.

5. Workgroup

11
SQL Server Workgroup Edition includes the core database functionality but does not include the
additional services. Note that this edition has been retired in SQL Server 2012.

6. Express

SQL Server Express Edition is a scaled down, free edition of SQL Server, which includes the
core database engine. While there are no limitations on the number of databases or users
supported, it is limited to using one processor, 1 GB memory and 10 GB database files (4 GB
database files prior to SQL Server Express 2008 R2). It is intended as a replacement for MSDE.
Two additional editions provide a superset of features not in the original Express Edition. The
first is SQL Server Express with Tools, which includes SQL Server Management Studio Basic.
SQL Server Express with Advanced Services adds full-text search capability and reporting
services.

Visual Studio

Microsoft Visual Studio includes native support for data programming with Microsoft SQL
Server. It can be used to write and debug code to be executed by SQL CLR. It also includes a
data designer that can be used to graphically create, view or edit database schemas. Queries can
be created either visually or using code. SSMS 2008 onwards, provides intellisense for SQL
queries as well.

SQL Server Management Studio

SQL Server Management Studio is a GUI tool included with SQL Server 2005 and later for
configuring, managing, and administering all components within Microsoft SQL Server. The
tool includes both script editors and graphical tools that work with objects and features of the
server. SQL Server Management Studio replaces Enterprise Manager as the primary
management interface for Microsoft SQL Server since SQL Server 2005. A version of SQL
Server Management Studio is also available for SQL Server Express Edition, for which it is
known as SQL Server Management Studio Express (SSMSE).

A central feature of SQL Server Management Studio is the Object Explorer, which allows the
user to browse, select, and act upon any of the objects within the server. It can be used to visually
observe and analyze query plans and optimize the database performance, among others. SQL
Server Management Studio can also be used to create a new database, alter any existing database

12
schema by adding or modifying tables and indexes, or analyze performance. It includes the query
windows which provide a GUI based interface to write and execute queries.

SQL Server Operations Studio

SQL Server Operations Studio (Preview) is a cross platform query editor available as an optional
download. The tool allows users to write queries; export query results; commit SQL scripts to
Git repositories and perform basic server diagnostics. SQL Server Operations Studio supports
Windows, Mac and Linux systems.

It was released to General Availability in September 2018, at which point it was also renamed
to Azure Data Studio. The functionality remains the same as before.

13
1.3 Power BI

Power BI is a business analytics service by Microsoft. It aims to provide

interactive visualizations and business intelligence capabilities with an interface simple enough
for end users to create their own reports and dashboards. Power BI provides cloud-based BI
services, known as "Power BI Services", along with a desktop based interface, called "Power BI
Desktop". It offers data warehouse capabilities including data preparation, data discovery and
interactive dashboards.[2] In March 2016, Microsoft released an additional service called Power
BI Embedded on its Azure cloud platform. One main differentiator of the product is the ability
to load custom visualizations.

Key Components

 Power BI Desktop
The Windows-desktop-based application for PCs and desktops, primarily for designing
and publishing reports to the Service.
 Power BI Service
The SaaS (software as a service) based online service (formerly known as Power BI for
Office 365, now referred to as PowerBI.com (or simply Power BI).
 Power BI Mobile Apps
The Power BI Mobile apps for Android and iOS devices, as well as for Windows phones
and tablets.
 Power BI Gateway
Gateways used to sync external data in and out of Power BI. In Enterprise mode, can
also be used by Flows and PowerApps in Office 365.
 Power BI Embedded
Power BI REST API can be used to build dashboards and reports into the custom
applications that serves Power BI users, as well as non-Power BI users.
 Power BI Report Server
An On-Premises Power BI Reporting solution for companies that won't or can't store data
in the cloud-based Power BI Service.
 Power BI Visuals Marketplace
A marketplace of custom visuals and R-powered visuals.

14
1.4 Amazon s3 Bucket

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services
(AWS) that provides object storage through a web service interface. Amazon S3 uses the same
scalable storage infrastructure that Amazon.com uses to run its global e-commerce network.
Amazon S3 can be employed to store any type of object which allows for uses like storage for
Internet applications, backup and recovery, disaster recovery, data archives, data lakes for
analytics, and hybrid cloud storage. AWS launched Amazon S3 in the United States on March
14, 2006, then in Europe in November 2007.

Although Amazon Web Services (AWS) does not publicly provide the details of S3's technical
design, Amazon S3 manages data with an object storage architecture which aims to provide
scalability, high availability, and low latency with 99.999999999% durability and between
99.95% to 99.99% availability (though there is no service-level agreement for durability).

The basic storage units of Amazon S3 are objects which are organized into buckets. Each object
is identified by a unique, user-assigned key. Buckets can be managed using either the console
provided by Amazon S3, programmatically using the AWS SDK, or with the Amazon S3 REST
application programming interface (API). Objects can be managed using the AWS SDK or with
the Amazon S3 REST API and can be up to five terabytes in size with two kilobytes of metadata.
Additionally, objects can be downloaded using the HTTP GET interface

Notable users

 Photo hosting service SmugMug has used Amazon S3 since April 2006. They
experienced a number of initial outages and slowdowns, but after one year they described
it as being "considerably more reliable than our own internal storage" and claimed to
have saved almost $1 million in storage costs.
 Netflix uses Amazon S3 as their system of record. Netflix implemented a tool, S3mper,
to address the Amazon S3 limitations of eventual consistency.[23] S3mper stores the file
system metadata: filenames, directory structure, and permissions in Amazon
DynamoDB.
 reddit is hosted on Amazon S3.
 Bitcasa, and Tahoe-LAFS-on-S3, among others, use Amazon S3 for online backup and
synchronization services. In 2016, Dropbox stopped using Amazon S3 services and
developed its own cloud server.

15
 Mojang hosts Minecraft game updates and player skins on Amazon S3.
 Tumblr, Formspring, and Pinterest host images on Amazon S3.
 Swiftype's CEO has mentioned that the company uses Amazon S3.
 Amazon S3 was used by some enterprises as a long term archiving solution until Amazon
Glacier was released in August 2012.
 The API has become a popular method to store objects. As a result, many applications
have been built to natively support the Amazon S3 API which includes applications that
write data to Amazon S3 and Amazon S3-compatible object stores

16
1.5 Google Cloud Platform

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that
runs on the same infrastructure that Google uses internally for its end-user products, such
as Google Search, Gmail and YouTube. Alongside a set of management tools, it provides a
series of modular cloud services including computing, data storage, data analytics and machine
learning. Registration requires a credit card or bank account details.

Google Cloud Platform provides infrastructure as a service, platform as a service, and serverless
computing environments.

In April 2008, Google announced App Engine, a platform for developing and hosting web
applications in Google-managed data centers, which was the first cloud computing service from
the company. The service became generally available in November 2011. Since the
announcement of the App Engine, Google added multiple cloud services to the platform.

Google Cloud Platform is a part of Google Cloud, which includes the Google Cloud Platform
public cloud infrastructure, as well as G Suite, enterprise versions of Android and Chrome OS,
and application programming interfaces (APIs) for machine learning and enterprise mapping
services.

Products

Compute

 App Engine - Platform as a Service to deploy Java, PHP, Node.js, Python, C#, .Net, Ruby
and Go applications.
 Compute Engine - Infrastructure as a Service to run Microsoft Windows and Linux
virtual machines.
 Kubernetes Engine (GKE) or GKE on-prem offered as part of Anthos platform -
Containers as a Service based on Kubernetes.
 Cloud Functions - Functions as a Service to run event-driven code written in Node.js,
Python or Go.
 Cloud Run - Compute execution environment based on Knative. Offered as Cloud Run
(fully managed)[8] or as Cloud Run for Anthos.

17
Storage & Databases

 Cloud Storage - Object storage with integrated edge caching to store unstructured data.
 Cloud SQL - Database as a Service based on MySQL and PostgreSQL.
 Cloud Bigtable - Managed NoSQL database service.
 Cloud Spanner - Horizontally scalable, strongly consistent, relational database service.
 Cloud Datastore - NoSQL database for web and mobile applications.
 Persistent Disk - Block storage for Compute Engine virtual machines.
 Cloud MemoryStore - Managed in-memory data store based on Redis.
 Local SSD: High performance, transient, local block storage.
 Filestore: High performance file storage for Google Cloud users.

Networking

 VPC - Virtual Private Cloud for managing the software defined network of cloud
resources.
 Cloud Load Balancing - Software-defined, managed service for load balancing the
traffic.
 Cloud Armor - Web application firewall to protect workloads from DDoS attacks.
 Cloud CDN - Content Delivery Network based on Google's globally distributed edge
points of presence.
 Cloud Interconnect - Service to connect a data center with Google Cloud Platform
 Cloud DNS - Managed, authoritative DNS service running on the same infrastructure as
Google.
 Network Service Tiers - Option to choose Premium vs Standard network tier for higher-
performing network.

18
Big Data

 BigQuery - Scalable, managed enterprise data warehouse for analytics.

 Cloud Dataflow - Managed service based on Apache Beam for stream and batch data
processing.
 Cloud Dataproc - Big data platform for running Apache Hadoop and Apache Spark jobs.
 Cloud Composer - Managed workflow orchestration service built on Apache Airflow.
 Cloud Datalab - Tool for data exploration, analysis, visualization and machine learning.
This is a fully managed Jupyter Notebook service.
 Cloud Dataprep - Data service based on Trifacta to visually explore, clean, and prepare
data for analysis.
 Cloud Pub/Sub - Scalable event ingestion service based on message queues.
 Cloud Data Studio - Business intelligence tool to visualize data through dashboards and
reports.

Cloud AI

 Cloud AutoML - Service to train and deploy custom machine, learning models. As of
September 2018, the service is in Beta.
 Cloud TPU - Accelerators used by Google to train machine learning models.
 Cloud Machine Learning Engine - Managed service for training and building machine
learning models based on mainstream frameworks.
 Cloud Job Discovery - Service based on Google's search and machine learning
capabilities for the recruiting ecosystem.
 Dialogflow Enterprise - Development environment based on Google's machine learning
for building conversational interfaces.
 Cloud Natural Language - Text analysis service based on Google Deep Learning models.
 Cloud Speech-to-Text - Speech to text conversion service based on machine learning.
 Cloud Text-to-Speech - Text to speech conversion service based on machine learning.
 Cloud Translation API - Service to dynamically translate between thousands of available
language pairs
 Cloud Vision API - Image analysis service based on machine learning
 Cloud Video Intelligence - Video analysis service based on machine learning

19
Management Tools

 Stackdriver - Monitoring, logging, and diagnostics for applications on Google Cloud

Platform and AWS.
 Cloud Deployment Manager - Tool to deploy Google Cloud Platform resources defined
in templates created in YAML, Python or Jinja2.
 Cloud Console - Web interface to manage Google Cloud Platform resources.
 Cloud Shell - Browser-based shell command-line access to manage Google Cloud
Platform resources.
 Cloud Console Mobile App - Android and iOS application to manage Google Cloud
Platform resources.
 Cloud APIs - APIs to programmatically access Google Cloud Platform resources

Identity & Security

 Cloud Identity - Single sign-on (SSO) service based on SAML 2.0 and OpenID.
 Cloud IAM - Identity & Access Management (IAM) service for defining policies based
on role-based access control.
 Cloud Identity-Aware Proxy - Service to control access to cloud applications running on
Google Cloud Platform without using a VPN.
 Cloud Data Loss Prevention API - Service to automatically discover, classify, and redact
sensitive data.
 Security Key Enforcement - Two-step verification service based on a security key.
 Cloud Key Management Service - Cloud-hosted key management service integrated
with IAM and audit logging.
 Cloud Resource Manager - Service to manage resources by project, folder, and
organization based on the hierarchy.
 Cloud Security Command Center - Security and data risk platform for data and services
running in Google Cloud Platform.
 Cloud Security Scanner - Automated vulnerability scanning service for applications
deployed in App Engine.
 Access Transparency - Near real-time audit logs providing visibility to Google Cloud
Platform administrators.
 VPC Service Controls - Service to manage security perimeters for sensitive data in
Google Cloud Platform services.

20
IoT

 Cloud IoT Core - Secure device connection and management service for Internet of
Things.
 Edge TPU - Purpose-built ASIC designed to run inference at the edge. As of September
2018, this product is in private beta.
 Cloud IoT Edge - Brings AI to the edge computing layer.

API Platform

 Maps Platform - APIs for maps, routes, and places based on Google Maps.
 Apigee API Platform - Lifecycle management platform to design, secure, deploy,
monitor, and scale APIs.
 API Monetization - Tool for API providers to create revenue models, reports, payment
gateways, and developer portal integrations.
 Developer Portal - Self-service platform for developers to publish and manage APIs.
 API Analytics - Service to analyze API-driven programs through monitoring, measuring,
and managing APIs.
 Apigee Sense - Enables API security by identifying and alerting administrators to
suspicious API behaviors.
 Cloud Endpoints - An NGINX-based proxy to deploy and manage APIs.

 Service Infrastructure - A set of foundational services for building Google Cloud

products.

21
1.6 Amazon Linux

An Amazon Machine Image (AMI) is a special type of virtual appliance that is used to create a
virtual machine within the Amazon Elastic Compute Cloud ("EC2"). It serves as the basic unit
of deployment for services delivered using EC2. Like all virtual appliances, the main component
of an AMI is a read-only filesystem image that includes an operating system (e.g., Linux, Unix,
or Windows) and any additional software required to deliver a service or a portion of it.

An AMI includes the following:

 A template for the root volume for the instance (for example, an operating system, an
application server, and applications)
 Launch permissions that control which AWS accounts can use the AMI to launch
instances
 A block device mapping that specifies the volumes to attach to the instance when it's
launched

The AMI filesystem is compressed, encrypted, signed, split into a series of 10 MB chunks and
uploaded into Amazon S3 for storage. An XML manifest file stores information about the AMI,
including name, version, architecture, default kernel id, decryption key and digests for all of the
filesystem chunks.

An AMI does not include a kernel image, only a pointer to the default kernel id, which can be
chosen from an approved list of safe kernels maintained by Amazon and its partners (e.g., Red
Hat, Canonical, Microsoft). Users may choose kernels other than the default when booting an
AMI.

When it launched in August 2006, the EC2 service offered Linux and later Sun Microsystems'
OpenSolaris and Solaris Express Community Edition. In October 2008, EC2 added the Windows
Server 2003 and Windows Server 2008 operating systems to the list of available operating
systems. As of December 2010, it has also been reported to run FreeBSD; in March 2011,
NetBSD AMIs became available. In November 2012, Windows Server 2012 support was added.

Amazon has its own Linux distribution that is largely binary compatible with Red Hat Enterprise
Linux, and therefore CentOS. This offering has been in production since September 2011, and
in development since 2010. The final release of the original Amazon Linux is version 2018.03

22
and uses version 4.14 of the Linux kernel. Amazon Linux 2 was announced in June 2018, and is
updated on a regular basis

Types of images

 Public: an AMI that can be used by anyone.

 Paid: a for-pay AMI that is registered with Amazon DevPay and can be used by anyone
who subscribes for it. DevPay allows developers to mark-up Amazon's usage fees and
optionally add monthly subscription fees.
 Shared: a private AMI that can only be used by Amazon EC2 users who are allowed
access to it by the developer.

23
1.7 Google Sheets

Google Sheets is a spreadsheet program included as part of a free, web-based software office
suite offered by Google within its Google Drive service. The service also includes Google Docs
and Google Slides, a word processor and presentation program respectively. Google Sheets is
available as a web application, mobile app for Android, iOS, Windows, BlackBerry, and as a
desktop application on Google's ChromeOS. The app is compatible with Microsoft Excel file
formats. The app allows users to create and edit files online while collaborating with other users
in real-time. Edits are tracked by user with a revision history presenting changes. An editor's
position is highlighted with an editor-specific color and cursor and a permissions system
regulates what users can do. Updates have introduced features using machine learning, including
"Explore", offering answers based on natural language questions in a spreadsheet.

Google Sheets is available as a web application supported on Google Chrome, Mozilla Firefox,
Internet Explorer, Microsoft Edge, and Apple Safari web browsers. Users can access all
spreadsheets, among other files, collectively through the Google Drive website. In June 2014,
Google rolled out a dedicated website homepage for Sheets that contain only files created with
Sheets. In 2014, Google launched a dedicated mobile app for Sheets on the Android and iOS
mobile operating systems. In 2015, the mobile website for Sheets was updated with a "simpler,
more uniform" interface, and while users can read spreadsheets through the mobile websites,
users trying to edit will be redirected towards the mobile app to eliminate editing on the mobile
web.

Collaboration and revision history

Google Sheets serves as a collaborative tool for cooperative editing of spreadsheets in real-time.
Documents can be shared, opened, and edited by multiple users simultaneously and users are
able to see character-by-character changes as other collaborators make edits. Changes are
automatically saved to Google's servers, and a revision history is automatically kept so past edits
may be viewed and reverted to. An editor's current position is represented with an editor-specific
color/cursor, so if another editor happens to be viewing that part of the document they can see
edits as they occur. A sidebar chat functionality allows collaborators to discuss edits. The
revision history allows users to see the additions made to a document, with each author
distinguished by color. Only adjacent revisions can be compared, and users cannot control how
frequently revisions are saved. Files can be exported to a user's local computer in a variety of

24
formats such as PDF and Office Open XML. Sheets supports tagging for archival and
organizational purposes.

Other Functionalities

A simple find and replace tool is available. The service includes a web clipboard tool that allows
users to copy and paste content between Google Sheets and Docs, Slides, and Drawings. The
web clipboard can also be used for copying and pasting content between different computers.
Copied items are stored on Google's servers for up to 30 days. Google offers an extension for
the Google Chrome web browser called Office editing for Docs, Sheets and Slides that enables
users to view and edit Microsoft Excel documents on Google Chrome, via the Sheets app. The
extension can be used for opening Excel files stored on the computer using Chrome, as well as
for opening files encountered on the web (in the form of email attachments, web search results,
etc.) without having to download them. The extension is installed on Chrome OS by default. As
of June 2019, this extension is no longer required since the functionality exists natively.

Google Cloud Connect was a plug-in for Microsoft Office 2003, 2007 and 2010 that could
automatically store and synchronize any Excel document to Google Sheets (before the
introduction of Drive). The online copy was automatically updated each time the Microsoft
Excel document was saved. Microsoft Excel documents could be edited offline and synchronized
later when online. Google Cloud Connect maintained previous Microsoft Excel document
versions and allowed multiple users to collaborate by working on the same document at the same
time. However, Google Cloud Connect has been discontinued as of April 30, 2013, as, according
to Google, Google Drive achieves all of the above tasks, "with better results".

While Microsoft Excel maintains the 1900 Leap year bug. Google sheets 'fixes' this bug by
increasing all dates prior to March 1, 1900, so entering "0" and formatting it as a date returns
December 30, 1899. On the other hand. Excel treats "0" as December 31, 1899, which is
formatted to read January 0, 1900.

25
1.8 Pandas

In computer programming, pandas is a software library written for the Python programming
language for data manipulation and analysis. In particular, it offers data structures and operations
for manipulating numerical tables and time series. It is free software released under the three-
clause BSD license. The name is derived from the term "panel data", an econometrics term for
data sets that include observations over multiple time periods for the same individuals.

Library features

 DataFrame object for data manipulation with integrated indexing.

 Tools for reading and writing data between in-memory data structures and different file
formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of data sets.
 Label-based slicing, fancy indexing, and subsetting of large data sets.
 Data structure column insertion and deletion.
 Group by engine allowing split-apply-combine operations on data sets.
 Data set merging and joining.
 Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional
data structure.
 Time series-functionality: Date range generation and frequency conversion, moving
window statistics, moving window linear regressions, date shifting and lagging.
 Provides data filtration.
 The library is highly optimized for performance, with critical code paths written in
Cython or C.

Dataframes

Pandas is mainly used for data analysis. Pandas allows importing data from various file formats
such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows various data
manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data
wrangling features.

26
History

Developer Wes McKinney started working on pandas in 2008 while at AQR Capital
Management out of the need for a high performance, flexible tool to perform quantitative
analysis on financial data. Before leaving AQR he was able to convince management to allow
him to open source the library.

Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor
to the library.

In 2015, pandas signed on as a fiscally sponsored project of NumFOCUS, a 501(c) nonprofit

charity in the United States.

27
2. Project 1 – CRB Error Analysis

Executive Summary

CRB Data base has a huge amount of data which is added, updated and deleted manually. Since
the data is supplied manually there are many mistakes which cannot be identified easily. Thus
the common mistakes need to be pointed out with possible solutions so that the person rectifying
the errors has the exact idea of the position of error and its solution. This will help in saving time
and increase productivity.

The Script currently checks 16 different types of errors and has also been tested and deployed
on Amazon linux server for its results. Also the script is fully automated so that the user does
not even have to run the script or provide any input. The script has been scheduled to run at
12PM every Monday and E-Mail the Results to the concerned authorities.

28
Solution

The solution lies the usage of dataframe where there are multiple functionalities available in
order to manipulate and handle data. These functionalities help in manipulating data and
implementing python code.

Decoding the project

The CRB extract has 20 tables with massive amounts of data embedded in 1 file. This data is so
massive that even with an i7 processor the time taken to read and manipulate will take about 12
hours. This is because all the 20 tables are embedded in 1 excel file distributed into multiple
sheets. Also the data will always be increasing in future so this fact has to be also kept in mind.
The student also needs to figure out how to get the error patterns and the look of final shape of
the error table created.

29
Methodology adopted for Solution
 Separating all 20 tables into respective excel files to remove n^2 complexity for
manipulations done on the data. This is done by python by reading the excel file and then
saving each file to the sheet name. Now we can move towards the solution.
 The first thing in final solution consisted of taking input from the user about the date of
crb extract so that it comes into logs on which date crb extract file was generated. Since
the extract is downloaded from s3 bucket, the extract downloaded is latest. Therefore the
backup date entered is the latest date of the extract. Also putting in the date of the creation
of errors file.
 Now we take all the 20 excel files generated from crb extract and import them into python
dataframe. Now all 20 data frames are imported into the dataframe in about 4 minutes.
The time taken has reduced to just 4 minutes.
 Now independent functions are defined for identifying errors.
 The major errors checked in these functions include
1. Checking if spaces are present at the ends of a string. Also if there are
more than 1 space present at the same place anywhere in the required
columns for checking.
2. Checking if the zip code is present in right formats in lieu that the data of
multiple countries is present in the database.
3. Mapping of states to right respective countries and creating error flags
otherwise.
4. If locality is present in the database then state should be present. If state
is present then country should be present. The vice-versa is not true
always.
5. Checking of single space at specific places
6. Checking the right format of PO Box number.
7. Checking if salutations are present or not in the database
8. Checking the right syntax of phone number
9. Checking right syntax of date format
10. Checking of the right business identification formats
 Now defining an excel sheet which contains list of all the checks on different columns
where no column is checked outside of this. This is good for a person who does not know
the code comes to know of the checks that have been done in the extract.

30
 Now applying the created functions depending upon the type of check and its
applications the checks were implemented on different columns of different dataframes
and saved using apply() or lambda function or both while also taking care of the naming
convention required in the results. The problem of empty columns was also handled so
that it might not create problems in future.

 Now the data format is prepared as to which columns are needed as their placement.

 The melt function comes in handy to unpivot columns and convert the tables to required
format.

 The problem of empty table is handled in order to stay away from errors while appending
files.

 The files are now appended in such a way that the naming convention holds true and
errors do not get mixed up in shape issues

 Concatenated all business error files to 1 dataframe and individual error files to another.

 Provided solution to space issue right in the error files so it is easy to update the database
with the right set of values.

 Sorted the database to unique keys so that all errors related to 1 row get clustered together
for the ease of the person who will remedy the errors in the database

 Using excel writer wrote the errors in the repective excel files.

 The final files are emailed to the concerned authorities for rectification of errors.

 The script has been scheduled on an Amazon linux server which automatically runs the
script every Monday and sends the results directly to the authorities on email. Thus, the
script is fully automated.

31
Output Sample

Figure 2.1 Error List CRB Error Analysis

The final output and the script has been delivered.

Figure 2.2 Pivot CRB Error Analysis

32
Figure 2.3 Email Output CRB Error Analysis

33
3. Project 2 – CRB Completion Rates

Executive Summary

Aim of this Project is to showcase the completion rates of the fields of Database to the clients.
The Dashboard tab thus created show be so communicative that the Client gets a concrete idea
about the completion rates of the field. The dashboard will also contain historic data for client’s
reference. This means that old data will not get lost instead saved forever. The dashboard tab
will be incrementally updated every week.
The dashboard tab will also help client access the work done by team in increasing the percentage
of completion of the fields.
The final delivered dashboard tab runs on 8 scripts which in an overall executes 48 scripts which
are all completely automated, tested and deployed on amazon Linux server. All scripts are
scheduled to run on Saturday of every week so that the dashboard can be refreshed on Monday
Morning.
In Addition there are multiple checks assigned in the scripts which will act as failsafe in case
anything goes wrong the scripts will exit. Also there are options to roll back to previous data
whenever scripts fails as all previous data is uploaded on s3 bucket in compressed .paraquet
format.

34
Solution
The problem is huge and there is no well-defined final solution set at the beginning. To start
with the student had to survey among the team leaders of the project for weeks trying to come
up with an agreed set final output look. There is a concept of storyboard in the company where
we project our thinking of the dashboard tab in the form of a power point presentation and
showcase the logics that had to be used for manipulations to come up to the solution.
The end goal is to have a dashboard tab that is simple to understand and very effective in
communicating numbers in effective way to the clients of the project. The tab would showcase
the completion rates by ZIGRAM from 11th July 2019.

Decoding the project

The s3 bucket contains url links to Crb extract in json formats stored in an api . But those links
are encrypted in a way. That means simply fetching those links will not help in opening url links
of the daily crb extract. One has to decrypt those links and then get real working active links.
The links are also cluttered. A lot of links are of no use. So we need to refine the links also. We
have to also update all the tables of the SQL server before updating the dashboard tab every
Monday. Getting the data from 11th July is also a challenge as the current date

35
Methodology adopted for solution
 Separating all 20 tables into respective excel files to remove n^2 complexity for
manipulations done on the data. This is done by python by reading the excel file and then
saving each file to the sheet name. Now we can move towards the solution.
 Get the extracts of the latest Friday of the latest week and the previous week

Extract Making Process

 CRB Extract JSON links are collected and refined from batch exports API
 Refined Links are decrypted to get actual links for getting extract
 CRB Extract is generated by a well defined set process of processing JSON links data to
different tables in well defined format
 The backup date and master date is saved in a dictionary sent to a pickle
 The CRB extract thus generated is now split into its specific excel sheets

Figure 3.1 CRB Extract Maker Algorithm Flow Diagram

36
Completion snapshot extract making process
 6 CRB Extracts are downloaded from s3 bucket for completion snapshot
 These are then split to their specific sheet excel files and saved in well-defined folders
 The backup and master dates are saved in 3 dictionaries into pickle formats to be used
by future runner script
 Once the extracts are downloaded each extract is picked up one by one and all its sheets
get saved as their sheet names in excel format
 Now CRB Base data Script is run which shifts current data to backup tables and replaces
current CRB Database with new tables and adds a row in the table
dump_cycle_index_temp

Figure 3.2 CRB Completion Snapshot Extract Maker Algorithm Flow Diagram

37
Completion Rates Algorithm
 Dates pickle is introduced and checked if it contains the date of Latest Friday dates as
master date .if not, the script exits
 Extract excel files are picked up one by one , trackers are created and saved as pickles
 Entity_Mapping_Data table is updated with the new entities present in the mastersheets
using sql query
 Trackers created are concatenated and constraints are set in updated_at column so that
merging issue of entities is taken care of
 In the merging issue, when the entities are deleted and are added in another entity. The
updated_at in the trackers comes for the created_at of the already present entity which is
out of range of the backup and master date
 To solve this issue if updated_at is behind backup date it is scaled up to backup date and
if it is ahead of master date it is scaled back to master date. Example - if backup date is
“2020-06-06” and updated_at in tracker is “2018-06-07” then updated_at is scaled to
“2020-06-06”
 Flags are now created for different set of features. Example – added but to null , changed
but to null etc.
 Previous grouped sum max dates is taken as input and appended to the current
concatenated trackers
 Using filters.xlsx completion rates pertaining to the filters of state, country, tier,
licensing_authority_id are generated for the dates present between backup date and
master date
 The grouped sum has to be rectified as it now contains some fields which have more than
1 row for themselves. Example –
Input:

Output:

 To solve this issue the grouped sum is sorted in ascending order including updated_at.
Then duplicates are removed without including updated_at and only the last values are
kept

38
 Grouped sum is saved in .paraquet format to be used for next run and completion rates
are appended to u_completion rate and replaced in t_completion_rate tables.
 The final mapped grouped sum appended with the date of script run is zipped and sent
to s3 bucket for safekeeping and as a failsafe for future purposes

Figure 3.3 General Process Overview Completion Rates

39
Figure 3.4 Generalized Tracker Creation Process

Figure 3.5 Completion Rates Runner Script Process

40
Completion snapshot
 Completion Snapshot determines the completion rates which were affected during a
specific timeline like weekly, monthly or quarterly
 Dates pickle is introduced for weekly, monthly and quarterly processes which contains
the backup and master dates for which process will be run in the form of a dictionary
containing numbered dates as values and their keys as folder names
 Dates pickle is calculated as dates on latest Thursday , penultimate Thursday , last
month’s last Thursday and latest quarter’s Thursday
 Examples – dates_weekly={1: [30, 5, 2020], 2: [5, 6, 2020]}, dates_monthly={1: [30, 5,
2020], 2: [5, 6, 2020]}, dates_quarterly={1: [4, 4, 2020], 2: [5, 6, 2020]}
 Extract files are picked up one by one , trackers are created for each process and saved
as pickles in loops
 A total of 120 trackers are created which takes about 1 hour 30 min to complete
 In the Tracker creation process firstly, the tracker between backup and master date file
is created. Thereafter tracker between an empty extract and backup date file is created
for only those entities which were present in trackers between backup date and master
date
 A new loop is introduced to calculate completion rates for each process. The reason
behind introducing new loop is to help in debugging code for future purposes
 In the merging issue, when the entities are deleted and are added in another entity. The
updated_at in the trackers comes for the created_at of the already present entity which is
out of range of the backup and master date
 To solve this issue if updated_at is behind backup date it is scaled up to backup date and
if it is ahead of master date it is scaled back to master date. Example - if backup date is
“2020-06-06” and updated_at in tracker is “2018-06-07” then updated_at is scaled to
“2020-06-06”
 If updated_at is behind backup date it is scaled up to backup date and if it is ahead of
master date it is scaled back to master date
 Flags are now created for different set of features. Example – added but to null , changed
but to null etc.
 Using filters.xlsx completion rates pertaining to the filters of state, country, tier,
licensing_authority_id are generated for the dates present between backup date and
master date

41
 Mapping of state and country is done to the completion rates and timeline is added( i.e.
weekly, monthly or quarterly)
 The completion rates for completion snapshot are replaced in the original table for each
timeline one by one
 Completion snapshot does not require any previous grouped sum as all data required is
created in the script itself every time

Incremental Updation Scripts

 6_t_completion_rate script is run which incrementally appends data in and
u_datafield_visual
 v_completion_rate_datafield_tracker and v_completion_rate_datafield_tracker_lic_auth
are the two views which are created on the basis of u_datafield_visual and
u_completion_rate_snapshot and therefore need to be refreshed
 7_excel_to_sql.py updates the static files for the views of v_listing, v_workdone,
maker_tracker_checker and v_entity_overview_final
 8_CRB_tables.py is run to refresh all other tables and views for the Dashboard
 Finally CRB dashboard is refreshed from POWER BI website

42
Fields for changes in Completion rate tab
1. Business_addresses
 Box_number
 Country
 Locality
 Postal_code
 Premise
 Region
 Street
 Sub_locality
 Sub_premise
 Sub_region
2. Business_aliases
 Created_at
 Alias_name
 Alias_name_type
 Quality
3. Business_emails
 Address
4. Business_licenses
 License_identifier
 Med_rec_classification
 Status_date
 Uniform_license_status_description
 Uniform_license_type_description

43
5. Business_mastersheet
 Business_license
 Cik
 Created_at
 Fein
 Name
 Name_type
 Stock_exchange
 Stock_symbol
 Tier
 url
6. Business_phones
 Classification
 Number
7. Business_registrations
 Classification
 Number
 State
8. Business_source_docs
 Attachment_content_type
 Attachment_file_name
 Attachment_file_size
 Attachment_updated_at
 Description
 Note
 url

44
9. Business_subordinates
 Role
 Subordinate_link
 Subordinate_name
 Subordinate_type
 Title
10. Business_superiors
 Title
 Superior_link
 Superior_name
 Superior_type
 Role
11. Business_superiors
 Classification
 url
12. Individual_addresses
 box_number
 country
 locality
 postal_code
 premise
 region
 street
 sub_locality
 sub_region
 sub_premise

45
13. Individual_alias
 created_at
 education_suffix
 first_name
 generation_suffix
 last_name
 middle_name
 prefix
 quality
14. Individual_emails
 address
15. Individual_licenses
 License_identifier
 Med_rec_classification
 Status_date
 Uniform_license_status_description
 Uniform_license_type_description
16. Individual_master
 Created_at
 Date_of_birth
 Education_suffix
 First_name
 Generation_suffix
 Last_name
 Middle_name
 Prefix
 url

46
17. Individual_phones
 Classification
 Number
18. Individual_phones
 Attachment_content_type
 Attachment_file_name
 Attachment_file_size
 Attachment_updated_at
 Description
 Note
 url
19. Individual_subordinates
 Role
 Subordinate_link
 Subordinate_name
 Subordinate_type
 Title
20. Individual_web_sites
 Classification
 url

47
Order of Script Run
 1_CRB_Extract_Maker.py (45 min) (Scheduled at 12PM on saturday)
 2_completion_snapshot_Extract_Maker.py (1 hour 30 min) (Scheduled at 1PM on
saturday)
 3_CRB_Base_Data_Update_Script.py (10 min) (Scheduled at 3PM on saturday)
 4_completion_rate_Runner.py (1 hour 30 min) (Scheduled at 3:45PM on saturday)
 5_completion_snapshot_Runner.py (4 hours 20 min) (Scheduled at 6:30PM on saturday)
 6_t_completion_rate.ipynb (5 min)
 7_excel_to_sql.py (5 min)
 8_ CRB_tables.py(5 min)

48
Output Samples

1. Tracker output
unique_id entity licensing_authority_id from to attribute data-field flag updated_at backup_date master_date
2453_addresses_2018-03-08T23:04:18.275Z 2453 # 715 business_addresses box_number deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
3206_addresses_2018-03-08T23:02:23.088Z 3206 PO Box 285 business_addresses box_number deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
83964_addresses_2020-03-12T17:55:21.764Z 83964 Park Park City business_addresses locality changed 2020-06-05T21:44:09.723Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
100547_addresses_2020-03-05T10:21:32.773Z 100547 Murray business_addresses locality deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
109749_addresses_2020-05-29T04:44:24.017Z 109749 Fairmont business_addresses locality deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z
109749_addresses_2020-05-29T04:44:24.920Z 109749 Weirton business_addresses locality deleted 2020-05-29T00:00:00.000Z 2020-05-29T00:00:00.000Z 2020-06-05T00:00:00.000Z

Figure 3.6 Tracker Output Completion Rates

2. Completion Rate

Figure 3.7 Completion Rates Table Output

49
3. CRB Dashboard
1. Main Page

Figure 3.8 CRB Dashboard Main Page

2. Completion Rates Page (Individual Filter)

Figure 3.9 Completion Rates Page (Individual Filter)

50
3. Completion rates Page (Business Filter)

Figure 3.10 Completion Rates Page (Business Filter)

4. License Datafields (Business Filter)

Figure 3.11 License Datafields (Business Filter)

51
5. License Datafields (Individual Filter)

Figure 3.12 License Datafields (Individual Filter)

4. Completion Snapshot
1. Completion rates(Weekly Filter)(Business filter)

Figure 3.13 Completion rates(Weekly Filter)(Business filter)

52
2. Completion rates(Weekly Filter)(Individual filter)

Figure 3.14 Completion rates(Weekly Filter)(Individual filter)

3. Completion rates(Monthly Filter)(Business filter)

Figure 3.15 Completion rates(Monthly Filter)(Business filter)

53
4. Completion rates(Monthly Filter)(Individual filter)

Figure 3.16 Completion rates(Monthly Filter)(Individual filter)

5. Completion rates(Quarterly Filter)(Business filter)

Figure 3.17 Completion rates(Quarterly Filter)(Business filter)

54
6. Completion rates(Quarterly Filter)(Individual filter)
4

Figure 3.18 Completion rates(Quarterly Filter)(Individual filter)

7. License Data Fields(weekly Filter)(Business filter)

Figure 3.19 License Data Fields(weekly Filter)(Business filter)

55
8. License Data Fields(weekly Filter)(Individual filter)

Figure 3.20 License Data Fields(weekly Filter)(Individual filter)

9. License Data Fields(monthly Filter)(Business filter)

Figure 3.21 License Data Fields(monthly Filter)(Business filter)

56
10. License Data Fields(monthly Filter)(Individual filter)

Figure 3.22 License Data Fields(monthly Filter)(Individual filter)

11. License Data Fields(Quarterly Filter)(Business filter)

Figure 3.23 License Data Fields(Quarterly Filter)(Business filter)

57
12. License Data Fields(Quarterly Filter)(Individual filter)

Figure 3.24 License Data Fields(Quarterly Filter)(Individual filter)

58
4. Project -3 Work Done Analysis

Executive Summary

CRB work Analysis is a ZIGRAM’s initiative to get a comprehensive analysis of the work being
done by the CRB Team on a day to day basis. Aim of this project is to access the productivity
of the team on the basis of tasks assigned, tasks completed and other different parameters. This
also helps access the supervisors of the team in an automated process.
The work done is now also shared with the client so to flaunt the work done by team and it has
also helped ZIGRAM increase the revenue from the clients.
The Project has helped standardize the process of reporting of work of the team and also helped
the operation leaders reward the good performers and help the weaker ones just by looking at
the Work Done reports thus saving their time and efforts too.

59
Solution
The problem is huge and there is no well-defined final solution set at the beginning.
At the beginning there is no solution available with the mentors since the idea is a bit new.
The project is built using google cloud platform. It uses google drive and google sheets api.
The project has a cloud project access which currently permits 100 requests per 100 seconds.
As of now a maximum threshold of 60 requests is being used in every run of the scripts.

Decoding the project

There are 3 scripts pertaining the project. These are as follows-
1. Work allotted script
2. Work done script
3. Weekly work done script

60
Methodology adopted for solution
Input
1. Supervisor Page

Figure 4.1 CRB Work Analysis- Supervisor Input Page

2. Analyst Page

Figure 4.2 CRB Work Analysis- Analyst Input Page

61
Work Allotted Script
 All the team Supervisors will fill their respective google sheets of the work allotted to
their specific analysts
 At 10:15 AM their sheets will be compiled and a string will be generated based on their
inputs of the work allotted to each team member
 This String after standardization will be e-mailed to each supervisor and to each ops
leader
 The string contains comma separated work allotted to each team member with the state
and counts assigned in respective brackets
 2nd string contains names of team members who are on leave
 Output

Figure 4.3 CRB Work Allotted Output Email

62
Work Done Script
 All the team Supervisors would have fill their respective google sheets of the work
allotted to their specific analysts
 All the team Analysts and supervisors will fill their respective google sheets of the work
done by them during the day
 At 10 PM their sheets will be compiled and a Table will be generated based on their
inputs of the work allotted and work done by each team member
 Both Work Allotted and Work Done scrips are outer joined and blank spaces are marked
with ’-’
 If the team member is marked absent by the supervisor he/she will be marked “Not
Available” in the Table
 If subtask is not filled by analyst, comments will be considered as subtask and if that is
also not available then task is considered as subtask
 Consequently If subtask is not available then the row is removed from consideration
 Date fill is not important. it is created in code itself
 if name is not filled in analyst sheets it will be substituted by the code at right place
 The work done table will be shown as an HTML Table and a excel file is also sent as a
supplement
 Output

Figure 4.4 CRB Work Done Output Table

63
Figure 4.5 CRB Work Done Output Email Head

Figure 4.6 CRB Work Done Output Email Tail

64
Weekly Work Done
1. Activity Wise
 Data is acquired for last 7 days via a SQL query
 Activity wise Table is generated by groupby on task(Activity)
 Allocated is the sum of counts allocated for the group
 Completed is the sum of the counts completed for the group
 Completed but not assigned is the sum of counts completed – counts assigned where
count completed – assigned>0
 Total Completed is the sum of completed and completed but not assigned for the group
 Pending is the sum of count assigned – counts completed for the group where counts
assigned >counts completed
 The result is saved in an excel sheet named “Activity_Wise”
Output

Figure 4.7 CRB Weekly Work Done Output Activity Wise

65
2. Researcher Activity Wise
 Data is acquired for last 7 days via a SQL query
 Researcher Activity wise Table is generated by groupby on task and team Member
 No. of Days is generated by aggregation sum on Date
 Allocated is the sum of counts allocated for the group
 Total Completed is the sum of the counts completed for the group
 Work Done per man day is the division of count acompleted and No. of Days
 The result is saved in 2nd excel sheet named “Researcher_Activity_Wise”
 The initial result of the SQL query is saved in the 3rd excel sheet named “Base_data”

Output

Figure 4.8 CRB Weekly Work Done Output Researcher Activity Wise

66
Email
 All the Sheets combined are saved in one excel file
 This excel file is then sent via email to the respected relevant authorities
Output Sample

Figure 4.9 CRB Weekly Work Done Output Email

67
5. Project -4 License Changes

Executive Summary

Aim of this Project is to showcase the License Changes of the fields of Database to the clients.
The Dashboard tab thus created should be so communicative that the Client gets a concrete idea
about the License Changes of the field. The dashboard will also contain historic data for client’s
reference. This means that old data will not get lost instead saved forever. The dashboard tab
will be incrementally updated every week.
The dashboard tab will also help client access the work done by team in increasing the percentage
of completion of the fields.
The final delivered dashboard tab runs on a single script which is fully automated and tested. All
scripts are scheduled to run on Saturday of every week so that the dashboard can be refreshed
on Monday Morning.
In Addition there are multiple checks assigned in the scripts which will act as failsafe in case
anything goes wrong the scripts will exit. The dashboard tab will act as a means to get to know
the insights about the team’s efforts on the updation/changes of licenses on a regular basis. This
requirement is currently not being fulfilled by any dashboard tab.
Therefore a dashboard tab would communicate team’s efforts to the clients and also help increase
ZIGRAM’s revenues in the long term. The project also aims at reducing weekly client calls of
ZIGRAM with operation leaders thus saving time of both parties as the client will have access
to all license changes insights 24X7 and regular updates every week.

68
Solution
The problem is huge and there is no well defined final solution set at the beginning. To start with
the student had to survey among the team leaders of the project for weeks trying to come up with
an agreed set final output look. There is a concept of storyboard in the company where we project
our thinking of the dashboard tab in the form of a power point presentation and showcase the
logics that had to be used for manipulations to come up to the solution.
The end goal is to have a dashboard tab that is simple to understand and very effective in
communicating numbers in effective way to the clients of the project. The tab would showcase
the work done on the licenses by ZIGRAM from 1st October 2019. The tab would be refreshed
using static files in the future every week.

Decoding the problem

69
Methodology adopted for solution
Getting Data
 get all links from batch exports api.
 segregate only those links which are json.
 decrypt them by replacing a fixed part of string with another.
 go to specific links to get individual and business licenses file.
 identify these files as either business or individual.
 save these files by creating respective folders based on date of the crb extract.
 convert the folder names to incremental numbers
Compare files received
 now create a dictionary of dates involved in the individual and business licenses file.
 use these dates to compare among the files to generate tracker.
 append all individual trackers in 1 file.(complete individual tracker)
 append all business trackers to one file(complete business tracker)

Manipulation on data
 append these 2 files into 1 dataframe for further process
 merge this result with a sql query to get a v_entity_overview to get region, country,
licensing authority id and name about licenses from db.
 get some more information using entity_mapping_data to get address_region, country
using if not available in v_entity_overview
 get end of week as thursday from update_at column
 sort values
 get FEID and its lag for comparison to get date difference
 drop some unwanted columns
 create a super unique id for unique row identification by joining unique entity
,updated_at and backup date.
 get licenses created, inactive licenses, both changed, only date changed , only status
change by creating flags.
 to get both status and date changed we made data in a way that values were in same row
by appending using super unique id
 drop any unwanted columns.

70
 filled empty feilds going to groupby to '0'
 created 15 groupby aggregations inclusive of all possible combinations of groupby
columns(
 appended and sorted all results to 1 dataframe
Outer Join to get other combinations
 get total entities combination and licenses combination results using sql query to get all
combinations of entity and licenses gropuby and aggregated to results.
 outer join of existing dataframe both results of the query one by one
 replace '0' with empty string.
 created aggregate_id.
 sorted the data again
 1 shift for aggregate id for filling empty entity values using bfill
 point 32 to fill empty licenses values
 dropped rows which logically did make sense according to power bi visuals
Reporting data
 Imported latest data from an excel file
 Send it to sql server
 imported structured reporting data in script for current week and the previous week
 This is done to give a comparative analysis for reporting data
 formatted data
 outer joined both the reporting dataframes to existing dataframe by creating an end of
week column inside reporting data to clear off future issues.
 renamed few columns as required by power bi visual.
 sent the table back to sql server
Final data
 Update u_tracker_index with new backup_date and master date
 Left join with state code to get state full state/Province names
 Left join with country code to get full country names
 Groupby to get cumulative records and licenses
 Append the incremental data to SQL server in the table u_license_changes_aggregation

71
Proposed Output

Figure 5.1 CRB License Changes Tab Proposed Output

Delivered Output (Filter- all, all, all, all)

Figure 5.2 CRB License Changes Tab Delivered Output snippet 1

72
Delivered Output (Filter- US, OR, all, OR-OLCC)

Figure 5.3 CRB License Changes Tab Delivered Output snippet 2

73
6. Other Tasks Completed

1. Weekly dashboard Update – CRB Dashboard has to be updated every Monday. It takes about
half day for the dashboard to get updated. The dashboard is a direct image of the efforts of
ZIGRAM to the clients therefore it has to be perfect. There is no room for mistakes here.

2. Generate test datasets – ZIGRAM has multiple projects and all the projects need a lot of data
to function and improve accuracy. In the first 3 weeks of my internship, I generated
approximately 5000 rows of test data for 2 different projects.

3. Storyboards – The dashboards deployed by me were also designed by me. There is a concept
of storyboards in ZIGRAM where the final output is first made on a ppt and then visualized. It
took us weeks to finally arrive at common understanding of the final view.

4. Monthly reports – I was also tasked with preparing monthly work done reports for the Month
of April and May which were directly shared with the client.

5. Setting up Amazon Linux – Since all the scripts are deployed on Amazon Linux Server I had
to set up its instance and all the requirements of the softwares and libraries required to function
run and schedule properly.

74
7. Conclusion

Working with dataset as huge as of CRB is in itself is a challenge. The CRB has dozens of tables
already in use which makes it very fragile and there is no room for any mistakes. Since all my
projects were related to dashboards and numbers there were multiple rounds of quality check on
the results which sometime would even last a week. The most difficult part to go through was
the completion rates grouped sum logic and its quality check.
Overall I have delivered 53 scripts and 2 dashboard tabs during my tenure as a data science
intern at ZIGRAM. We have tried to fulfill all the demands of the client in the most cost effective
manner. CRB work analysis project was proposed to cost a monthly recurring cost of about
$1000 when a professional software company working on one such area was consulted. By
working on this project we have created the project completely free of cost by using free google
api’s without compromising security. Since the project is free of any subscriptions there won’t
be any recurring future costs.
There has been a lot of problems which were faced in getting latest CRB extracts on the api
deployed on s3 bucket. All the scripts are based on the consistency of crb extracts links. However
over the time the extracts have become pretty much consistent. The problem of not getting
extracts has been reduced from 1 in 7 days to 1 in 40 days as of now.
Overall working with such a huge data has its own ups and downs. The experience that I got
from working with such data will forever help me in analyzing data better and will help me
achieve targets faster since I now have thorough insight on the technical, logical and syntactical
problems faced by an analyst.

75
8. Bibliography and References

1. https://pandas.pydata.org/
Reference for Pandas library of Python
2. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Reference for Dataframes which are the backbone of all Analysis
3. https://aws.amazon.com/mp/linux/
Set up EC2 instance and libraries for smooth functioning of scripts and other Requirements
4. https://www.udemy.com/course/useful-excel-for-beginners/
Useful excel required for smooth handling of excel related issues in data Analysis

Intern Offer Letter
No ratings yet
Intern Offer Letter
4 pages
The Impact of Agile Project Management Practices and Success-Solomon
No ratings yet
The Impact of Agile Project Management Practices and Success-Solomon
101 pages
Internship Report Format 2021-22
No ratings yet
Internship Report Format 2021-22
8 pages
Internship Program Book
No ratings yet
Internship Program Book
37 pages
Priyanshu Project 6th Sem
No ratings yet
Priyanshu Project 6th Sem
33 pages
Nayan Pandey Offer Letter Gncipl
No ratings yet
Nayan Pandey Offer Letter Gncipl
2 pages
U ESWARAIAH Program - Book - Semester-Term - Internship - As - On - 18-10-2022-1
No ratings yet
U ESWARAIAH Program - Book - Semester-Term - Internship - As - On - 18-10-2022-1
156 pages
Ravi Resume Current
No ratings yet
Ravi Resume Current
5 pages
Final Internship Report
No ratings yet
Final Internship Report
37 pages
Intern - Handbook TalentServe
100% (1)
Intern - Handbook TalentServe
14 pages
3 11 SamayGada Report
No ratings yet
3 11 SamayGada Report
17 pages
SIST-2025-41110065-OL-HCL Tech-Akula Sai Durga Siva Rama Vinay PDF
No ratings yet
SIST-2025-41110065-OL-HCL Tech-Akula Sai Durga Siva Rama Vinay PDF
11 pages
Patch Installation MNS34 v2
No ratings yet
Patch Installation MNS34 v2
5 pages
ZAINAB HR BIGBULLS Harshit Srivastava
No ratings yet
ZAINAB HR BIGBULLS Harshit Srivastava
3 pages
Semester Internship: Andhra Pradesh State Council of Higher Education
No ratings yet
Semester Internship: Andhra Pradesh State Council of Higher Education
93 pages
Internship Report Format 2022-23
No ratings yet
Internship Report Format 2022-23
13 pages
Techprosng Internship Agreement
No ratings yet
Techprosng Internship Agreement
3 pages
User File
No ratings yet
User File
11 pages
Offer Letter (Shivam Kesharwani (MT)
No ratings yet
Offer Letter (Shivam Kesharwani (MT)
5 pages
Internship Report Format
No ratings yet
Internship Report Format
12 pages
Report Intership
No ratings yet
Report Intership
65 pages
Rekha Excelr
No ratings yet
Rekha Excelr
97 pages
AppointmentLtr - Intern-Trainee - Girish Kumar Dewangan - 7th January 2024 - 7th July 2025 - Six Months
No ratings yet
AppointmentLtr - Intern-Trainee - Girish Kumar Dewangan - 7th January 2024 - 7th July 2025 - Six Months
2 pages
Amulya Sharma - India Intern
No ratings yet
Amulya Sharma - India Intern
4 pages
AppointmentLtr - Intern-Trainee - Dhawal Soni - 1st January 2024 - 30th June 2024
No ratings yet
AppointmentLtr - Intern-Trainee - Dhawal Soni - 1st January 2024 - 30th June 2024
2 pages
Degree Internship Log Book - Mean Stack - 2025
No ratings yet
Degree Internship Log Book - Mean Stack - 2025
91 pages
Tech Prebid 1000597
No ratings yet
Tech Prebid 1000597
21 pages
Mamoon Shahzad Internship Offer Letter
No ratings yet
Mamoon Shahzad Internship Offer Letter
4 pages
What Is Agile and Scrum
No ratings yet
What Is Agile and Scrum
6 pages
Tybcom Internship Manual
No ratings yet
Tybcom Internship Manual
6 pages
Eh-Log Book-1
No ratings yet
Eh-Log Book-1
67 pages
Momin Aves Ahmed
No ratings yet
Momin Aves Ahmed
3 pages
1 3 2 Mca
No ratings yet
1 3 2 Mca
104 pages
Ankita T
No ratings yet
Ankita T
37 pages
Ganesh Maurya
No ratings yet
Ganesh Maurya
1 page
388-Internship Placement Form (Latest)
No ratings yet
388-Internship Placement Form (Latest)
5 pages
Shirke Madhuri Offer Letter
No ratings yet
Shirke Madhuri Offer Letter
4 pages
Manual Testing and Selenium
No ratings yet
Manual Testing and Selenium
155 pages
Sanjeev Saini
No ratings yet
Sanjeev Saini
2 pages
2024-05-29 - LTX - Internship Letter - Shivani 2
No ratings yet
2024-05-29 - LTX - Internship Letter - Shivani 2
4 pages
Rakhi Nain Density Exchange - Offer Letter
No ratings yet
Rakhi Nain Density Exchange - Offer Letter
4 pages
Offer Letter
No ratings yet
Offer Letter
3 pages
Objectives: Online Auctions, Virtual Communities, and Web Portals
No ratings yet
Objectives: Online Auctions, Virtual Communities, and Web Portals
11 pages
Internship Offer Letter
No ratings yet
Internship Offer Letter
2 pages
State Bank of India: Recruitment of Specialist Cadre Officers in Sbi On Regular Basis
No ratings yet
State Bank of India: Recruitment of Specialist Cadre Officers in Sbi On Regular Basis
5 pages
Program Book
No ratings yet
Program Book
63 pages
Using Mof For Iso Iec 20000
No ratings yet
Using Mof For Iso Iec 20000
27 pages
Digital Marketing Final
No ratings yet
Digital Marketing Final
75 pages
Senior Systems Administrator-RD Nov2019
No ratings yet
Senior Systems Administrator-RD Nov2019
6 pages
Merged Internship A1 & A2 Format-1 - III Year
No ratings yet
Merged Internship A1 & A2 Format-1 - III Year
18 pages
Appendix 12.7.0 Fortinet Corporate Profile
No ratings yet
Appendix 12.7.0 Fortinet Corporate Profile
8 pages
OTO 01062 - Minimal Platforms
No ratings yet
OTO 01062 - Minimal Platforms
88 pages
Zainab Finance Bigbulls Patel Dev
No ratings yet
Zainab Finance Bigbulls Patel Dev
3 pages
Draft - Internship Framework
No ratings yet
Draft - Internship Framework
12 pages
Oracle.1z0-1077-20.v2021-07-05.q29: Leave A Reply
No ratings yet
Oracle.1z0-1077-20.v2021-07-05.q29: Leave A Reply
12 pages
Guidelines For Internship Program
No ratings yet
Guidelines For Internship Program
14 pages
Offer Letter
No ratings yet
Offer Letter
3 pages
Maintenance Policy
No ratings yet
Maintenance Policy
3 pages
Signed
No ratings yet
Signed
7 pages
1a. Ford-Scorecard-ScreenImages - Jan2024
No ratings yet
1a. Ford-Scorecard-ScreenImages - Jan2024
16 pages
Sangameswaran
No ratings yet
Sangameswaran
7 pages
Internship Offer+Letter+in+2024!04!20 .
No ratings yet
Internship Offer+Letter+in+2024!04!20 .
1 page
Intern Letter
No ratings yet
Intern Letter
1 page
sg248950 z16 AGZ Intro
No ratings yet
sg248950 z16 AGZ Intro
128 pages
Internship Agreement
No ratings yet
Internship Agreement
4 pages
Cloud Digital Leader
No ratings yet
Cloud Digital Leader
38 pages
Business Analysis School Masterclass - Workbook
No ratings yet
Business Analysis School Masterclass - Workbook
18 pages
Internship - 13 Pages Format
No ratings yet
Internship - 13 Pages Format
13 pages
Big Data in Construction Industry
No ratings yet
Big Data in Construction Industry
10 pages
Apple SWOT
No ratings yet
Apple SWOT
12 pages
Activity Template Gantt Chart
No ratings yet
Activity Template Gantt Chart
5 pages
Internship 3rd Sem
No ratings yet
Internship 3rd Sem
9 pages
ALC Sharif E 5 P 3
No ratings yet
ALC Sharif E 5 P 3
3 pages
III Forms 2021
No ratings yet
III Forms 2021
14 pages
Internship Report: MR - Ritik Vijay Gedam)
No ratings yet
Internship Report: MR - Ritik Vijay Gedam)
8 pages
How Much Does It Cost To Build A Prescription Discount App Like GoodRx .Edited
No ratings yet
How Much Does It Cost To Build A Prescription Discount App Like GoodRx .Edited
10 pages
Analysing NOSACQ-50 Data
No ratings yet
Analysing NOSACQ-50 Data
6 pages
STIN2023042
No ratings yet
STIN2023042
2 pages
Imperva ClientReputationServices Datasheet 20201015
No ratings yet
Imperva ClientReputationServices Datasheet 20201015
4 pages
ElysianSync AI Presentation
No ratings yet
ElysianSync AI Presentation
15 pages
Internship Report
No ratings yet
Internship Report
9 pages
IEEE 829 Test Plan Example
No ratings yet
IEEE 829 Test Plan Example
4 pages
Offer Letter - Sinkey Singh
No ratings yet
Offer Letter - Sinkey Singh
2 pages
Organized
No ratings yet
Organized
31 pages
Software Internship Offer Letter
No ratings yet
Software Internship Offer Letter
2 pages
Microservice
No ratings yet
Microservice
4 pages
Essential Factors in Developing An Efficient In-Office Aligner System - ScienceDirect
No ratings yet
Essential Factors in Developing An Efficient In-Office Aligner System - ScienceDirect
1 page
Info Sys
No ratings yet
Info Sys
5 pages
Everything you want to know about Agile: How to get Agile results in a less-than-agile organization
From Everand
Everything you want to know about Agile: How to get Agile results in a less-than-agile organization
Jamie Lynn Cooke
3.5/5 (4)
Guide to Writing A Winning Grant
From Everand
Guide to Writing A Winning Grant
Nester T. Moyo
No ratings yet