Dbamp Refresh and Replicate Optimizations
Dbamp Refresh and Replicate Optimizations
To
1
2
Certificate
I Khushboo Yadav (05304092017) certify that the MCA Dissertation Project Report entitled
“DBAMP REFRESH AND REPLICATE OPTIMIZATIONS” is done by me and it is an
authentic work carried out by me at ThoughtFocus, Gurgaon . It is submitted in partial
fulfilment of the requirements for the award of the Master in Computer Applications at
Department of IT, Indira Gandhi Delhi Technical University for Women. The matter
embodied in this project work has not been submitted earlier for the award of any degree or
diploma to the best of my knowledge and belief.
Date:
Certified that the Project Report entitled “DBAMP REFRESH AND REPLICATE
OPTIMIZATIONS” done by the above student is completed under my guidance. It has not
been submitted elsewhere either in part or full, for award of any other degree or diploma to
the best of my knowledge and belief.
3
UNDERTAKING REGARDING ANTI-PLAGIARISM
Khushboo Yadav
05304092017
4
ACKNOWLEDGEMENT
The dissertation would not have been possible without the guidance and the help of
several individuals who in one way or another contributed and extended their valuable
assistance in the preparation and completion of this project. I would be failing in my
endeavor if I do not place my acknowledgement.
The Internship opportunity I had with ThoughtFocus, Gurgaon was a big milestone in
my career development. I feel privileged for I had the opportunity to be a part of it. I am
also grateful for having a chance to meet so many wonderful people and professionals
who led me through this internship period. I will strive to use gained skills and
knowledge in the best possible way.
I would like to express my deepest thanks to my guide Mr. Sourabh Bharti, Assistant
Professor, Department of IT, for his continuous support, patience and valuable advice till
the end of my project work.
I express my sincere gratitude to Mr. Manish Singh for his inspiration, constructive
suggestions and affectionate guidance in my work, without which this project work
completion would have been impossible for me.
Khushboo Yadav
(05304092017)
5
TABLE OF CONTENTS
S No Topic Page No
1 Certificate 2
2 Acknowledgements 4
3 List of Tables/Figures/Symbols 5
4 Chapter-1: Introduction 6
5 Chapter-2: System Analysis 12
6 Chapter-3: Software Requirements Specifications
7 Chapter-4: System Design
8 Chapter-5: Implementation and Testing
9 Chapter-6: Snapshots
10 Chapter-7: Conclusion
11 Chapter-8: List of References
LIST OF TABLES
Table No Title Page No
1 File Design for Employee Record
2 File Design for Personal Details
LIST OF FIGURES
LIST OF SYMBOLS
Chapter 1: Introduction
6
1.1 About the Organization
1.1.1 ThoughtFocus
The organization has grown rapidly since its inception, and is now a mid-sized
company, as well as part of the Blackstone portfolio. The founders hold executive
positions within the organization and are actively involved with clients and projects.
7
1.1.2 Blackstone
The DB Amp replicate (AB Job) is used to perform full replication of data to SQL
Database. Currently the job runs every Saturday and takes 15 hours to finish the process
to complete data replication.
With the passage of time there is a regular increase of data thus continuous increase in
replication time. Since the maximum time permissible for successful executing of this job
is 840 mins, above which if the job is not completed the same gets failed due to time out
error thus leading to data loss.
So, there is a need to validate this process by some means and try to decrease the manual
effort of user.
8
Time of execution is also considered as a big constraint in this fast-moving world. So
alternative techniques need to be explored and implemented for lesser execution time and
better performance rate of the Jobs.
9
1.4 Terminologies used in Salesforce:
1.4.1 Pinnacle
Pinnacle is a programming dialect that enables designers to execute stream and
exchange control articulations on the Force.com stage. As a dialect, Apex is
incorporated, simple to utilize, information engaged, thorough, facilitated, multitenant
mindful, naturally upgradeable, simple to test, and formed.
10
1.4.3 Salesforce Lightning
Salesforce Lightning is a part-based structure that contains the gathering of
instruments and advancements behind the update for the Salesforce1 stage. It takes into
consideration outsider applications by clients to be based over Salesforce applications.
1.4.4 Apex
Apex code is the first multitenant, on-demand programming language for developers
interested in building the next generation of business applications. Apex revolutionizes
the way developers create on-demand applications.
While many customization options are available through the Salesforce user interface,
such as the ability to define new fields, objects, workflow, and approval processes,
developers can also use the SOAP API to issue data manipulation commands such
as delete(), update() or upsert(), from client-side programs.
1.5 Methodology
Agile Scrum Methodology
First of all, for every quarter a set of epics are decided that are to be done in the time
period of that quarter only. Each epic is further divided into smaller tasks and these epics
are prioritized and assigned to the resources accordingly. Then, a sprint of 10 days is
created in which a resource must select a set of tasks from its assigned and prioritized
epic that needs to be completed within 10 days. All this is reviewed by a scrum master.
Every day a scrum meeting occurs to review the status of task by the scrum master. On
11
the last day of the sprint retrospection is done by the team, scrum master and team
director.
Component Required
Processor Intel Core i3
RAM 4 GB, DDR3
Hard Drive 500GB
Table 1
Software used:
Tortoise Git – Automatic tool for merging changes into main branch. This works on
Salesforce Sandbox.
Hybrid Framework
12
Jira for task tagging
13
2.1Introduction
DBAmp is a data integration tool between Salesforce and Microsoft SQL Server. It
allows us to pull down local copies of the Salesforce DB so we can perform complex
analysis on the data without Salesforce's resource limits. We currently pull down 2
different Salesforce databases: Innovations (ours), GSO.
There are 2 different DBAmp methods we use to pull data from Salesforce: replications
and refreshes.
1. Replication: This drops the previously replicated database and copies over the
Salesforce db into our own local db
2. Refresh: This takes diffs between the locally copied db and Salesforce's db
and layers them on top of our local copy
We currently refresh every hour and replicate once a week - replications start at 1 AM
NY time every Saturday. We replicate every week to ensure our local copy is exactly the
same as our Salesforce DB (there's a chance refreshes could miss an update if something
is updated at the exact same time as the refresh).
2.2Stakeholders:
General Partners
Limited Partners
Business Administrators
Admin
14
These scheduled jobs currently take almost 16 hours to replicate all the data from the
Salesforce cloud to local SQL server. Hence the rise in data volume on Salesforce will
increase the job execution time rapidly.
There is a time constraint of 14 hours on the replicate job that means after 14 hours of job
execution, the job will fail even if all the tables are not replicated yet. This failure occurs
due to the time out error caused by the Active Batch Job schedular.
By default, the Replicate job uses Web Services API to pull data
from Salesforce.com. The Web Services API is synchronous, meaning that for
rows retrieved from Salesforce.com, an immediate response is sent indicating the
success or failure of those rows. There is a need to convert this process into asynchronous
so that the time gap between two queries would be decreased and the user don’t need to
wait for the response.
In case of replicate job failures, user needs to retrigger the job with some custom settings
for replicating the remaining tables (that are not replicated yet) that is time consuming
task and takes lot of manual efforts.
In current scenario, each table is replicated on every weekend. User has no flexibility to
skip some tables from replication, if required.
The main objective of this project is to decrease the run time of the replicate jobs.
15
Secondly, it aims to make that replicate jobs intelligent so that manual efforts are automated
and make more use of Salesforce UI to change settings related to the DBAmp jobs.
Scope of the system:
Also, all the DBAmp replicate enhancements are to be done for the DBAmp refresh and GSO
replicate Jobs as well.
Domain Analysis: Every software falls into some domain category. The expert people in
the domain can be a great help to analyses general and specific requirements.
Task Analysis: Our Team of analysts and developers may analyses the operation for which
the new system is required. Our client already has a software to perform certain operation, it
is studied, and requirements of proposed system are collected.
JAD: Team of experts try to understand the client’s requirement. We try to give our views
about what we have understood from the result of client meeting. Individual participant pen
down his/her thoughts and presented in front of others.
2.6Project Planning:
16
17
Fig 1.7 Gantt Charts of Timesheet Management System
18
Chapter 3 system Design
19
1.6 Input/output Design
Custom Metadata:
The Salesforce administrator defines custom objects and their properties, such as custom fields,
relationships to other types of data, page layouts, and a custom tab. If the administrator created a tab
for a custom object, click the custom object’s tab to view the records.
20
DB Amp Replication Schedular is a custom object that allows the user to do the
replication of the tables on the Salesforce Database as per their requirements.
Salesforce Database
21
Tables:
Below mentioned are tables that we are creating and updating in Salesforce Database
Account
AccountContactRole
AccountPartner
Campaign
CampaignMember
CampaignMemberStatus
Case
Contact
CurrencyType
Group
Groupmember
Lead
LoginHistory
Opportunity
Organization
Partner
PartnerRole
22
Profile
RecordType
Task
User
UserRole
23
4.1DBAmp
DBAmp is a simple, yet very powerful tool, that exposes the Force.com as another database
to your SQL Server. It allows the developers to use their familiar SQL (and SOQL as part
of Open Query) to do all the CRUD operations on the Salesforce objects. The data on the
Force.com platform can be backed up completely to an on-premise SQL Server using
DBAmp with very little programming.
4.2Active Batch
ActiveBatch is an Enterprise Workload Automation and Job Scheduling Tool that helps
users integrate applications, databases, and technologies into end-to-end workflows.
24
4.3DBAmp Jobs (Refresh and Replicate)
25
Chapter 5: System Implementation
4.1Problem Statement
As a part of this Project, we are solving problems that are listed below:
when we starts the DBAmp Replicate job to replicate the data from Salesforce.com, due to
large amount of data the replicate job fails for timeout error. So many tables can't be
replicated due to timeout error. As the max time for execution of the Replicate job is 840
mins and currently it is taking more than that.
Currently after Replicate job failure, when we restart the Replicate job, it again starts
replicating tables which are already replicated. So we are trying to replicate only the tables
which are not replicated within a time frame. we need to find where the last DBAmp
replicate job is failed and start replicating only the tables which are not replicated in the last
replicate job.
4.2Solution
Use BulkApi OR Pkchunk option for DBAmp Replicate Job for reducing the overall run
time
In this approach for the time out error in replicate job, we are using the asynchronous
BulkApi for the replicate job. So that the run time of the replicate job can be reduced.
4.3Implementation
26
In order to fix this issue we have 3 changes that have been put in place to achieve
successful replication of complete data. The expected time of the run is 5 hours after
carrying out changes as mentioned below:
Created a field in TableRefreshes custom metadata type (ReplicateOption__c)
We can specify for each table that which option needs to be chosen for the
replication process. The choice of replication process to be used is mentioned
against each table keeping in mind speed of replication which further depends
upon the data size of the table. The choices are given as under:
Web Services API (Default): This is used if the time taken for the
replication of the table is less than 30 secs
pkchunk: If the table is taking more than 30 secs to replicate, then
this option is used. In this, the rows of the tables are replicated
asynchronously.
bulkapi: Currently not being used.
Added a time-based constraint field in TableRefreshes custom metadata
type (HoursConstraint)
A field has been added in the Table refreshes custom metadata, HoursConstraint
for each table. If the job is triggered within the hour’s constraint for the tables the
below situations will be observed:
For the tables for which the replication failed in the previous runs
and again the job is triggered, the tables will be queues for
replication
For the tables for which replication passed in the previous runs and
the job is triggered again within the Hours Constraint, the tables
will not get replicated.
In between table failures other than Timeout issue:
The job will skip the failed table and continue replicating for the rest of the tables
and at the last will retry again for the failed tables. These changes ave been made
on both the DB Amp Replicate and DB Amp Refresh Job
If the job fails, the logs will contain the list of the tables which
failed at the end of the job
If everything gets replicated in the SQL DB, then the job gets
succeeded otherwise the job fails
27
Chapter 6: System Testing
Testing means verifying correct behaviour. Testing can be done at all stages of module
development: requirements analysis, interface design, algorithm design, implementation, and
integration with other modules. Debugging is a cyclic activity involving execution testing and
code correction. The testing that is done during debugging has a different aim than final module
testing. Final module testing aims to demonstrate correctness, whereas testing during debugging
is primarily aimed at locating errors. This difference has a significant effect on the choice of
testing strategies.
Module Testing: Module testing is the testing of complete code objects as produced by the
compiler when built from source.
System Testing: System Testing is a level of software testing where a complete and
integrated software is tested. The purpose of this test is to evaluate the system’s compliance with
the specified requirements.
6.2Test Cases
28
DB Amp Replicate Job
Category Scenarios
Configuration Custom Metadata type > Table Refreshes > Validate the field
replication options is present
Custom Metadata type > Table Refreshes > Validate the field hours
constraint is present
Custom Metadata type > Table Refreshes > Replication Option >
Validate error message upon entering values other than pkchunk
and bulkapi
Custom Metadata type > Table Refreshes > Replication option >
Validate default option blank (Web Services API) is accepted in the
field
Logs Validation DB Amp Replicate Job > View Logs > Validate all table names
are visible with last replicated details specified
DB Amp Replicate Job > View Logs > Validate the batch size for
pkchunk for the tables is visible
DB Amp Replicate Job > View logs > Validate the no. of rows
replicated for the table is specified
DB Amp Replicate Job > View logs > Job failed > Validate the list
of failed tables are specified
Performance DB Amp Replicate Job > Validate the job is taking less than 8
hours for completion
29
replicating based on the option specified
Replication of tables > Web Service API option > Validate the table
is replicating based on the option specified
Functionality
Re-Replication of failed table > In-between table failure > Validate
the failed tables are tried for replication again
Triggering the job within the time frame specified > Validate the
tables which failed previously are getting replicated
Data Validation Validate the data in SQL Database and Salesforce cloud are in
sync
Validate the new data created for the tables are in sync > Account,
Activity, Activity Link, Branch, Contact, Coverge, Case, Task,
Transaction, Project
Job is in progress > web servie api tables > Create a new record for
the table already replicated > Validate that the table is not again
replicated again
Job is in progress > pkchunk tables > Create a new record for the
table not yet replicated > Validate the new data also gets replicated
with the table
Job is in progress > pkchunk tables > Create a new record for the
table already replicated > Validate that the table is not again
replicated again
Job is in progress > web servie api tables > Create a new record for
the table not yet replicated > Validate the new data also gets
replicated with the table
30
DB Amp Refresh Job
Components Scenarios
Re-Replication of In-between table failure > The tables are again queued at the end for
failed tables replication after all the tables are replicated
Re-Replication of In-Between table failures > The table failing after retrying > The list
failed tables of the tables which failed will be listed in the logs of the job
Data Reconciliation Trigger DB Amp Refresh Job > Data Consistency > Salesforce
cloud to SQL Salesforce DB > Validate the consistency of data for
both the environments
5.1 Acquisition
Network Requirements
Hardware Requirements
31
- Server computer configuration
Item Requirement
Processor Intel Pentium/Celeron family or compatible
Pentium III Xeon or higher processor.
RAM Minimum: 8 GB
Recommended: 14 GB
Table 5.2
Item Requirement
Processor Intel Pentium/Celeron family or compatible
Pentium III Xeon or higher processor.
RAM Minimum: 2 GB
Recommended: 4 GB
Table 5.3
32
Software Requirements
• Databases
• VSS writer
33
- Requirements for database servers
Basic training will be provided by the experienced senior staff about the
execution and relative use of the functionality for the ease of the end user
Summary
In Salesforce, we have a number of scheduled jobs that are responsible for replicating Salesforce
data into a local SQL server. On Saturday, we perform a full replication of all the data to SQL
Database to ensure that both systems are in sync. This full replication currently takes 10 to 11
hours to finish and is increasing as the volume of data in Salesforce increases. By default, this
34
Replicate job uses Web Services API to pull data from Salesforce.com. The Web Services API is
synchronous, meaning that for rows retrieved from Salesforce.com, an immediate response is
sent indicating the success or failure of those rows.
So we are trying to refactor the Replicate job and to reduce the overall time of the Replicate job
using the BulkApi option. The Bulkapi is asynchronous, meaning that rows retrieved
from Salesforce.com are queued as a job. The job is executed at some time in the future. The
application must look the status of the job at a later time to retrieve the success, failure, or
unprocessed results of the rows sent.
Also after Replicate job failure, when we restart the Replicate job, it again starts replicating
tables which are already replicated. So, we are trying to replicate only the tables which are not
replicated within a time frame. we need to find where the last DBAmp replicate job is failed and
start replicating only the tables which are not replicated in the last replicate job.
35