One Note Rakshit
One Note Rakshit
Migration-----------------------
Artifacts
PRD link
ARD –
Figma –
Teams/POC involved iOS, DP Team, Android, Analytics
GTM Plan Mid September release with XP based
rollout
Gantt –
QA/E2E cases
●
Currently, there is a significant loss of impression and click analytics events, resulting in
an average of 15-20% of generated events on the app not reaching the server. In rare
instances, the event loss has reached as high as 70-80% in the past, causing difficulties
for the analytics team in their analysis and preventing the rollout of certain features.
Migrate the medium of events flowing from GTM SDK to Swiggylytics SDK which can
handle batches of 100, 500, 1000, 2000 events at once as well. There is no limitation of
hits during launch and quota getting exhausted post a certain threshold in Swiggylytics.
Existing system:
FirebaseGTM
All the events inside the app flow via FirebaseGTM. This flow has a major drawback of
rate limit i.e. 60 hits after that it recharges at the rate of 1 event for every 2 seconds. This
often leads to event loss and it requires its own specific handling to manually send
events via GET Api. The data flow for sending data via FirebaseGTM also attributes to
extra cost incurred while transforming data using Rill Job. Another major disadvantage
that has been observed is its event size limit. In a situation where the event size can be
pretty huge, maybe due to large context or object value or maybe we are doing an
internal batching to send multiple events at once in the form of the array, it’s very likely to
fail since it sends data via GET API through query params. Refer to this doc for more on
this.
Swiggylytics
Swiggylytics is our own inhouse SDK to manage the event flow. With the current
implementation, we are only using this to send certain high priority events such as ads
events as well as cart/menu attribution events. The major advantage of this SDK is its
handling of the event lifecycle till it’s finally dispatched successfully. We can use this to
send events in the form of batches hereby reducing the number of API hits significantly.
Another major advantage of this it’s capability where we can have the Rill Job
transformation logic on the client side itself, where we do the mapping from
abbreviations to column names such as ov =====> object_value, calculation of
GeoHash as well as dealing with static values. This helps us skip the Rill Job layer on
the server end and leads to major cost savings.
Proposed Approaches:
Approach 1:
Creating a new object with field names similar to the clickstream table.
Updating the existing event trigger function and adding XP based parameters to
process the config fallback case.
Config bypass mechanism to process all events. Currently, only events specified in the
config file are sent via Swiggylytics and all other events are rejected in ‘Event Validator’.
The logic for the same would be updated to support all events, while prioritizing config
events.
Addition of extra columns in local storage to classify the event priority as config/
non-config. Since, we intend to prioritize config events, we need to classify them
accordingly in local storage.
Failure handling in a serial queue to make sure all events get equal no of retries.
Pros:
Since there will be one single queue, it will be easier to maintain for processes like
batching, purging, etc.
Cons:
The table needs to be maintained properly for config/non-config events as they both
serve different priorities.
Approach 2 (Selected):
This approach creates a new Database. The changes required for this are below:
Creating a new database which interacts with the table of similar structure and uses the
same DAO.
Updating the existing event trigger function and adding XP based parameters to
process the config fallback case.
Pros:
Segregation of concern - The code will remain clean and quite segregated making it
easier to maintain in the longer run.
In case we were to add a different priority set of events in future, we can build on top of
this architecture.
Ease of mitigating issues - In case of any issues with the new implementation, it will only
impact the new events moved to Swiggylytics (non-config). Config events will flow as is.
Keeping separate queues would add separation of concern where the older flow won’t
be hampered and the new process would run independently.
Cons:
Having two queues will introduce more complexity and would require their own purging,
batching operation etc.
Storage:
The flow of events in the starting phase will remain the same. Currently, we send the
event via SwiggylyticsEventHandler, which further sends the event via
SwiggyLyticsEventHandler which is a custom implementation of HandlerThread. From
here on, the data is processed in the background thread. SwiggyLyticsEventHandler
further sends the data to EventManager. Here, we have the handling for RealTime and
Batched events, with another logging an unsupported event type. Here, we add the
event to a NonConfigEventsQueue. We also update the eventType enum for this as
NonConfig which will help us further down the line when we retrieve data from DB and
to manage the lifecycle more cleanly.
Interface Updates:
IDispatcher:
○ Observable<Event> subscribeNonConfigSendingFailed();
○ void dispatchNonConfigBatch(Batch batch);
○ Batch getDispatchedPendingNonConfigBatchList();
○ Observable<Batch> subscribeNoPendingNonConfigInDispatch();
● IEventManager:
○ Observable<Batch> subscribeNonConfigBatchAvailable();
○ LinkedBlockingDeque<Event> getNonConfigEventsQueue();
○ int getNonConfigEventsCountInQueue();
● IEventStorage:
○ void removeNonConfigOrphans();
○ Observable<List<EventTable>> getNonConfigEvents();
○ Observable<List<EventTable>> getLimitNonConfigEvents(int limit);
●
The above mentioned functions are additions on the already existing interfaces. The
idea is to reuse most of the existing codebase and update wherever required. These
functions are basically counterparts to already existing RealTime and NonRealTime
functions with very similar implementations, the only difference is in the manipulation
front of events, which queues we’re dealing with, which DB to interact with, which states
are the updates made. For more context please look at the below example.
In the below function, we get all the real time events from AppDatabase for the given
limit and also update the database by setting it’s property of is_in_memory as true for
the extracted events.
eventTableList.addAll(appDatabase.eventDao().getLimitRealTime(limit));
updateEventsInMemory(eventTableList, true);
return eventTableList;
}).subscribeOn(Schedulers.io());
eventTableList.addAll(nonConfigDatabase.eventDao().getLimitNonConfig(lim
it));
updateEventsInMemory(eventTableList, true);
return eventTableList;
}).subscribeOn(Schedulers.io());
}
As it can clearly be seen, the implementation remains very similar with the only change
being the point of interaction. All the other functions are very similar to this where the
core logic is retained and only the point of interaction/ scope is updated.
Beacon Service:
At all instances, two timers run the application for realTime and nonRealTime events,
with its debounce value configured via the config.json file. For our use case, we are
using the config values the same as RealTime, hence to reduce the complexity, we will
be using the realTimeTimer to trigger the batch. Unlike the usual implementation where
we trigger the batch when we reach the batch limit, we have a limiting of 1 when events
are triggered from the timer, i.e. as long as we have even a single event in the queue
and the timer has reached it’s trigger point we will send that solo event.
Event Dispatch:
Rill Job:
A Job runs on the server end which consumes the events we send, transforms them
into DB readable format, inserts some values and pushes the data further. For e.g.
please look at the below event request from FirebaseGTM.
https://100.64.1.11/client/metric/event/gtm?e=impression&sn=HUD-related_
screen&on=launch-api-response&ov=%7Bstatus_code%3D0%2C-has_track
able_orders%3Dfalse%2C-has_feedback%3Dtrue%7D&op=9999&ui=608348
20&us=8nj14621-0368-497a-935d-eb16c84304b5&ud=d7fc6f8611e28784&p=
an&cx=-&av=1999&sqn=6563&sc=direct&rf=-<=30.7370204&lg=76.639873
6&ts=1691553316775&itd=true&exp=->mcb=1539541086
The data that's being sent is in the format of abbreviation, such as event_name is
denoted as e, screen_name is denoted as sn, user_id as ui, etc. The Rill job has a
transformation logic written where it transforms this data to appropriate format.
return gtmEvent{
Version: "1",
Component: "GTM",
IPAddress: o.getHeaderM([]string{"HTTP_X_FORWARDED_FOR",
"Remote_Addr"}, ""),
ServerTimeStamp: o.getParamM([]string{},
time.Now().Format(serverTimestampFormat)),
As it can be seen in the above function, it picks up the data by the abbreviation and
transforms it into their own struct and further transforms it into the column name format
as seen below.
SELECT
CastAnythingToString(header.uuid) AS uuid,
header.eventId AS event_id,
CastAnythingToString(event.version) AS version,
CastAnythingToString(event.component) AS component,
CastAnythingToString(event.appVersionCode) AS app_version_code,
CastAnythingToString(event.appVersionCode) AS appVersionCode,
CastAnythingToString(event.userId) AS user_id,
CastAnythingToString(event.userId) AS userId,
CastAnythingToString(event.sid) AS sid,
CastAnythingToString(event.tid) AS tid,
CastAnythingToString(event.context) AS context,
CastAnythingToString(event.referral) AS referral,
CastAnythingToString(event.deviceId) AS device_id,
CastAnythingToString(event.deviceId) AS deviceId,
CastAnythingToString(event.platform) AS platform,
CastAnythingToString(event.ipAddress) AS ip_address,
CastAnythingToString(event.ipAddress) AS ipAddress,
CastAnythingToString(event.userAgent) AS user_agent,
CastAnythingToString(event.userAgent) AS userAgent,
case
else CastAnythingToString(event.eventName)
end AS eventName,
CastAnythingToString(event.eventName) AS event_name,
ClientTSConversion(event.clientTimeStamp) AS client_timestamp,
CastAnythingToString(event.clientTimeStamp) AS clientTimeStamp,
CastAnythingToString(event.sequenceNumber) AS sequenceNumber,
CastAnythingToString(event.objectName) AS object_name,
CastAnythingToString(event.objectName) AS objectName,
CastAnythingToString(event.screeName) AS screen_name,
CastAnythingToString(event.screeName) AS screeName,
CastAnythingToString(event.objectValue) AS object_value,
CastAnythingToString(event.objectValue) AS objectValue,
CastAnythingToString(event.objectPosition) AS objectPosition,
CastAnythingToString(header.schemaVersion) AS schema_version,
CastAnythingToString(header.schemaVersion) AS schemaVersion,
ConvertToLong(event.sequenceNumber, Cast(-9999 AS bigint)) AS
sequence_number, event.systemTime AS systemTime,
CastAnythingToString(event.serverTimeStamp) AS server_timestamp,
CastAnythingToString(event.serverTimeStamp) AS serverTimeStamp,
CastAnythingToString(event.source) AS source,
CastAnythingToDouble(event.latitude) AS latitude,
CastAnythingToDouble(event.longitude) AS longitude,
GetGeoHash(
CastAnythingToDouble(event.latitude),
CastAnythingToDouble(event.longitude)
) AS geo_hash,
header._server_time_stamp AS _server_time_stamp,
event.extraParams as extra_params
from
GTMEvent
Transformer:
The final step of sending events via the new Pipeline is the transformation job where we
transform the column name abbreviation to original column names as well as the header
data to make sure it gets populated in the new table only. This is to be done as the final
step as there are no dependencies to it earlier and the only point of its requirement is
when we send the event via API.
Testing:
The major concern while moving forward to this change is to make sure that the data
loss is kept to a minimum and the system survives any level of stress situations where
we pass in bulk of events and make sure none of it is lost. For this we have a temporary
class which will be initialized as a singleton and would fire sets of events every 100 ms
for which we would test scenarios like network connectivity loss, app kill, app
background/foreground, etc.
@Singleton
withContext(Dispatchers.IO) {
async {
for(a in 0..50) {
swiggyEventHandler.handleOnClickEvent(
delay(100)
async {
for(a in 51..100) {
swiggyEventHandler.handleOnClickEvent(
delay(100)
async {
for(a in 101..200) {
swiggyEventHandler.handleOnClickEvent(
This will allow us to stress test events that are being rapidly fired, even concurrently
over a very short period of time. We would be able to validate multiple batches being
created and them being sent successfully or to debug failure cases as well.
Analytics Testing:
This further needs to be validated at the analytics end as well where we would send
events via both FirebaseGTM pipeline as well as the Swiggylytics pipeline. A view will be
created conjoining the two tables dp_clickstream and dp_clicktream_v2 which will
help us merge the data and combine for analysis. There would be in basic, two rounds of
validation:
Pre-release analysis:
We would be triggering events to both the tables where the analytics POC would
analyze the data in both the tables to look for the volume as well as the uniqueness of
events in both the columns where we would expect an equal number if not more number
of events via the new pipeline. We would also add debug level logs to look for broken
ends and data loss.
Post-release analysis:
Once we release the app, we would run another set of similar exercise as above on real
production volume of data and depending on the results we would further scale up the
XP accordingly.
Rollout Strategy:
The target release plan for this is September mid where we would start the dev testing
as early as 21st of August where we start sending data and observe the data flow. The
feature will be rolled out via XP and will gradually be scaled up based on the results
observed.
XP Link:
https://xp.swiggy.in/experiment/1315/instance/3058
Snowflake query:
References:
Proposed flow
Proposal doc
Pipeline Matching
DP Table creation Ticket
Macrobenchmark Sheet
Low End Device Metric Sheet Mid End Device Metric Sheet
Artifacts
XP Doc
Solution doc
GTM steps
Both Consumer Android app and Consumer iOS App side changes will be shipped with
mid Sep release. Changes include Completed :
Send GTM (Google tag manager) driven events via Swiggylytics SDK as well along with
an XP.
For test variant, events will be sent via both GTM and Swiggylytics SDK
For control variant, events will be sent via GTM SDK only
XP will be rolled out gradually at 1% -> 5% -> 10% -> 20% and so on and the analytics
team will keep an eye on data.In Progress [currently running at 5%]
During this XP period, the Analytics team needs to perform an analysis for the event
loss percentage i.e events received by Swiggylytics SDK should be higher as compared
to the ones received via GTM SDK by X%(to be determined).In Progress
First cut of the data and analysis has been shared by Shivangi Sharma which says :
We filter out for those Object_names where Swiggylytics events count was non
zero i.e. events were flowing through Swiggylytics pipeline.
For iOS, Swiggylytics events are increased by ~3% to that of GTM events
Once this analysis is completed and results are good as per the expectation and the XP
is rolled out to 100%, apps will disable the flag to send events via GTM SDK and enable
the flag to always send the events via Swiggylytics SDK for 100% of the events. X days
of hold up period will be there where GTM and Swiggylytics SDK will be sending 100%
events from the apps. After X days we can deprecate GTM SDK flows.
Currently Portal and Instamart web teams also send events to dp_clickstream table and
their migration is a blocker to completely deprecate the legacy pipeline of RILL job
conversion. Blocked [It has been prioritized by Portal web team at least in their OND
tech goals]
One alternative over here we can do is have a proxy layer at DP’s end to intercept,
transform and send those events via Swiggylytics migration pipeline, this way we’ll not
be losing out any single ounce of the data. [AI on DP team : Anshuman Singh Deepak
Jindal]
Post scale up decision and > 90-95% app adoption, there has to be an overlap period
for both the data to co exist (in respective table
STREAMS.PUBLIC.DP_CLICKSTREAM_V2 and
STREAMS.PUBLIC.DP_CLICKSTREAM), so as to provide time for alignment in
metrics baseline shifts with Business teams.
Post 100% roll out of the changes Anshuman Singh to combined the tables
STREAMS.PUBLIC.DP_CLICKSTREAM_V2 and
STREAMS.PUBLIC.DP_CLICKSTREAM accordingly into
STREAMS.PUBLIC.DP_CLICKSTREAM only so that BAU flow keep working as is. Not
Started [AI : Anshuman Singh]
○ Let's assume the date is 11th Nov 2023 when XP will be 100% rolled out, post
that there will be a hold off period of X(7) days where both the GTM and the
Swiggylytics SDK events will be sent 100% from both the platforms.
○ Within this hold off period DP team will make the sufficient changes to combine
both dp_clickstream and dp_clickstream_v2 views into a single view
dp_clickstream. The Combined view should contain the data from
dp_clickstream and dp_clickstream_v2 from 9th Nov 2023 so that 2 days of
repeated data we get just in case to not miss any data during this transition
period.
○ Once this hold off period is over, Apps team will disable the feature flag put from
the clickstream migration App version onwards (Android: 1173 , iOS:4.8.5(6)) so
that from that app version onwards events flow from the GTM SDK will be
stopped.
○ From this date onwards, clickstream events generated via Swiggylytics SDK only
will be sent.
○ Portal and IM generated events will keep getting ingested into the dp_clickstream
table.
○ So, from here onwards the combined table will contain events generated from
Apps Swiggylytics SDK and portal, IM team’s generated events as well.
○ After this migration is completed there will still be older app version users before
clickstream migration who will not be updating their apps event after enabling soft
and hard update nudge, for those less than ~5% users, events will keep getting
generated via GTM sdk and the combined table will contain those events as well.
●
PS :
○ Post this migration itself the legacy DP pipeline cost will be significantly reduced.
○ S3 storing raw data will be deprecated post this migration as well.
○ Post 100% roll out and 3-6 months after that the older app version users should
significantly reduce and after that we all including the Analytics team can take a
call to completely deprecate the RILL pipeline. However, if the older app version
users absolute number is significant then we’ll have to take a call of routing the
traffic of RILL job ingesting data to awz_s3_SwiggyticEvent instead of
awz_s3_GTMEvent.
●
Android
iOS
AIs for increasing iOS events post Clickstream migration
○ If the DP team doesn’t send the response back, will it hamper any app’s flow or
not ?
Because 6k$ /month cost is involved over here. AI on Raj Gohil In Progress
●
Objective:
Since we have started modularizing our code base, this is another contribution towards
the umbrella project. In this we will be creating a framework that will be responsible for
driving analytics related logics. This framework will also push the events to the
respective destination either via GTM or SwiggyLytics API.
As the idea was to create a single module source code that will drive the entire business
logic of the system, there are multiple tech stack available in the market that supports
cross platform development e.g: React native, Flutter & KMM. Out of these 3 the idea of
going with KMM was based upon the following reasons:
○ Kotlin has higher performance as its a compiled language.
○ Learning curve is less for the iOS developers as the language paradigm matches
with swift.
○ Injection of kmm module is much easier than of other stacks for the existing
projects.
○ Kmm has an advantage of having fast development cycle over RN that uses
runtime javascript.
● Current state:
The flow of analytics events currently driven in little scattered way. There were two
provisions were maintained to push an event to the clickstream pipeline.
GTM queue is nothing but google tag manger which helps as a mediator that transmit
an event to the BE data dump. The abstractions of the whole GTM framework resides at
the Application side.
Swiggylytics is meant for sending high priority events to the Backend. This submodule
was developed in order to tackle the losses (~20%) & latency in GTM queue.
Shortcomings:
Swiggy-App env:
From the application side there will be only a single way of communication with
Analytics framework. App will project a type of event that needs to get generated and
synced with BE. From that point it will be the responsibility of Analytics framework to
acknowledge and proceed.
○ Screen name.
○ Object value (bannerId, widgetId, etc)
○ Extra-params (requestId, etc)
○ Context values which are dynamic wrt to the type of event.
○ Current sids & other global level values.
● Analytics module env:
Analytics module on receiving an event it will communicate with KMM module to get the
respective metadata of the event. The responsibilities of the module are as follows:
KMM framework will hold crude business logic codes. This module will be responsible
for crafting metadata in the required format. The crafted metadat will be formed using the
values that we passed from the app side at the first place. App will only pass values that
are dynamic in nature like id, screenName, etc.
Approach 1:
As we are segregating major sub-modules that can work independently wrt Application
side. We can move the whole analytic codes into a separate framework. This can help
us in the followings:
○ App should not bother about the event validation as it will be fire & forget
mechanism for app. Rest will be handled via individual module.
○ Better segregation with the responsibility sharing.
○ Can create logging & alerting system for individual module.
○ Easy migration in case we will move everything to KMM.
● Cons:
○ Need to create a layer that will responsible for communicating with kmm module,
which can increase dev effort to some extent.
● Approach 2:
In this approach we can create the same module layers such as Kmm module, Analytics
module & App. Here the communication will be achieved via Application where App will
be act like a mediator between the two modules.
Pros:
○ Can use the existing pipeline for syncing events with swiggylytics & GTM.
○ Comparatively can be lesser time to conclude the development.
● Cons:
○ App will be a dependency factor which can restrict for scaling this to other large
scope.
○ We will need to import dependencies of both Kmm & Analytics across the app for
a single sub-system.
○ For future migration to move everything to kmm will be effort heavy as there is no
direct communication between kmm & analytics module.
● Communication contract:
Since there will be multiple static & dynamic data involved in the the process we need to
define some well defined classifications & provisions for passing dynamic data from app
to KMM module.
Static data: Those data that will not change at runtime and can be kept as a static
declaration in the KMM module. Ex: impression-brand-carousel-item-ad,
click-collection-restaurant-item, etc.
Dynamic data: Those data that will can change at runtime, and will be fetched from BE.
Communication channel: We can expose dedicated functions & class instances for a
particular component which can be an UI element, General events (app-launch), etc.
inside the KMM module. These functions will be parameterized with dynamic data and
will always return a ready to sync metadata.
Metadata:
This is will be the end result data that will be consumed by GTM & Swiggylytics.
Ex:
Ad-Object:
Exp-Object:
Pros:
○ We need to create mapper functions in the Analytics framework till we are not
moving data sync mechanism in KMM.
●
LLD:
The system will be designed in such a way that the atomicity at each abstraction will be
maintained to the highest possible level. The single responsibility will be kept in order to
handle an event in and out.
Event Objetcs: This will contain folder sections of each type of event. The nested folder
will contain its respective Interface, Manager & Constant files that will be responsible for
handling the functionality of the event.
Extensions: This will contain all the extensions that will be used for the project module.
Interface: This will hold high-level common interface files. These interfaces will be used
used across the project and can be implemented at all horizontal levels of the module.
Model: This will contain all the data models that will be used for the common purpose
use across the project module. Two dedicated folders will be maintained input & output
which will hold the models that will be used for data injection (input) and data
outsourcing (output) for the project module.
Singleton: This will hold singleton classes that will be used across the project.
Utils: This will hold supporting files for driving adhoc functionality for the project module.
CommonTest: This will contain entire unit test suit for the project.
Design pattern:
In order to resolve data flow in the module we need some base concept of simple but
effective design solution. Hence we will be using the widely known Repository pattern for
providing the vital data flow mechanism in the module.
Dependency injection:
Dependency injection will play a vital role in passing necessary data to the individual
constructors. For the purpose of resolving we will be using the following types:
Constructor injection:
In this case the data will be injected directly into the classes by passing arguments in
the constructor itself.
class BottomBarEventManager(
) {
// Class implementation
class ModuleRepository()
// declare it
singleOf(::MyRepository)
factoryOf(::BottomBarEventManager)
ModuleRepository:
The abstract class will act as an repository data provider for the other event managers.
The responsibility of the class will be to hold mutable & immutable application & config
data in a thread-safe environment and provide required injection for other classes that
needs those data.
These classes were the actual event manager classes that will abstract the creation,
validation & outsourcing of a particular event. These classes will be injected with other
critical app data in order to prepare a final event. These classes will also implement
interfaces for driving the event creation & validation proces. For every dedicated event
type a new manager class will be created.
AnalyticsSyncOutputData:
This data class will be the final output object that will be ready for getting synced with
the BE. This class will abstract 2 data property i.e. GTMData & SwiggylyticsData. Both of
them will be constructed via dedicated functions in the manager and get synced with BE
via dedicated channels.
Interfaces:
Interfaces will be defined at various levels in order to create an outer blue print of a
implementation. These interfaces will contain mandatory fields & functions that a
implementor must have to implement in order to keep development integrity along all the
classes. Interfaces can also hold optional functionality that can be implemented based
on the requirements.
IGTMEvent: This will provide support interface for GTM event data with property holding
GTMData & function to prepare for the same.
ISwiggylyticsEvent: This will provide support interface for swiggylytics event data with
property holding SwiggyLyticsData & function to prepare for the same.
IEventValidation: This will provide support for validating an event. We will inclued more
functions if needed.
IDynamicAppData: This will provide interface for updating mutable app properties via
dedicated channels. This interface will mostly implemented by dedicated central classes
in the repository wher the data will be stored.
Future state:
With the evolvement of KMM we will evaluate the tech stack at all perspectives:
○ Mitigating major risks of failures for complex executions using beta version which
might impact on critical data & revenue.
○ Migration of major complex things may get avoided both with Thirdparty &
In-house codes.
○ May be we can use the stable system at the first place without thinking about the
old KMM stuff.
●
Cross platform dev guidelines:
○ Will keep dedicated kill switch for Ad and non-Ad event flow via KMM.
○ Will roll out using XP with 1% userbase and monitor for any event drop and
gradually increase adoption.
●
○ Apps team Balvinder Gambhir Mitansh Mehrotra Sambuddha Dhar Priyam Dutta
Nihar Ranjan Chadhei Agam Mahajan
○ QA team Suresh Thangavelu Vijay S
○ Analytics team Shreyas M Kumar Keshav
○ DP team <poc to be added>
○ Ads team <poc to be added>
● Appendix:
Build size has been calculated by creating archived build and validating framework
executable & combined app package size.
Note: The size has been evaluated with the basic foundation files & codes. This may
increase with future development.
iOS:
Universal: 130 MB
Android: These tests were done using a sample app with R8 enabled.
Currently, there is a significant loss of impression and click analytics events, resulting in
an average of 15-20% of generated events on the app not reaching the server. In rare
instances, the event loss has reached as high as 70-80% in the past, causing difficulties
for the analytics team in their analysis and preventing the rollout of certain features.
Root cause
At first, during the initial eye test, it seemed that the app was successfully transmitting
all the events, and the data loss was occurring on the backend. However, upon
conducting a thorough investigation into the matter, we discovered that the Firebase
SDK(GTM - Google Tag Manager), which we utilize to queue and send the events, was
discarding some of the events from its queue. This was primarily due to the high
volume of events being sent within a short timeframe.
At the time of initialisation of the SDK, the app gets 60 available hits that are
replenished at a rate of 1 hit every 2 seconds for events.
This means that initially the app only gets to send 60 events and after exhausting the
available hits, the app can send only 1 event every 2 seconds.
GTM Quotas
So if a user navigates within the app as follows: Home -> Food -> Search -> Menu
Home ~ 30 events, Food ~ 30 events, Search ~ 120 events, Menu ~10 events
Since the app does not have 190 hits, SDK does not allow the app to send 190 events
but instead sends around 80-90 events and drops the rest.
○ Migrate the medium of events flowing from GTM SDK to Swiggylytics SDK which
can handle batches of 100, 500, 1000, 2000 events at once as well. There is no
limitation of hits during launch and quota getting exhausted post a certain
threshold in Swiggylytics.
● Testimony from DE app
Harsh Kataria to add the pre post analysis for the same.
Similar analysis was done for Menu and Cart attribution migration from GTM to
Swiggylytics and there too we saw a gain of 7-10% increase in the events daily.
Similar efforts which were there for Menu and Cart attribution events migration will incur
for dp_clickstream migration too.
Prerequisites
The DP team needs to migrate the conventional and complex dp_clickstream schema
to the schema registry.
Getting a column addition via conventional dp_clickstream flow itself took a lot of time
during the Widget Ranking project.
○
● Owned by Anshuman Singh
○ Problem Statement
○ Current Data Flow
○ Issues with current flow
○ Proposed Data Flow
○ Optimisation/Saving
● Problem Statement
Currently the client app team sends around ~2 Billion events per day for the clickstream data via
the legacy event ingestion data flow and we then transform this data 1:1. (So we have 2 kinds of
data :- Raw & transformed).
In this flow we are ingesting the clickstream data in kafka twice(raw and transformed) and storing
the above data again twice in S3(raw & transformed) and have an additional layer of
transformation for 1:1 event.
The proposal is to directly send the transformed data from the client end which can save a large
amount of cost and can make the pipeline generic.
View :- default.dp_clickstream
Delta table :- streams_delta.dp_clickstream
Orc table :- streams_orc.dp_clickstream
Above view points to delta & orc table.
○ Then data is synced to snowflake via snowpipe
Snowflake table :- streams.public.dp_clickstream
●
More details regarding custom changes being done at event-collector can be found here :- DP
Clickstream Flow Details | Understanding flow
Optimisation/Saving
○ If we start sending the transformed data directly from the client via current generic
pipeline then we can save approximately 5K- 7K monthly.
○ Below are the aprox cost savings :-
●
○ The pipeline would be more streamlined, generic, will have support for schema-evolution.
●
From <https://swiggy.atlassian.net/wiki/spaces/DP/pages/3862069249/DP+Clickstream+Flow+Optimisation>
1.1 Overview
○ Our QA’s have found multiple disparities in events when testing the events
via the Ard Automator tool.
For example
1. https://swiggy.slack.com/archives/GBZMDADPZ/p1706709761046829
2. https://swiggy.slack.com/archives/GBZMDADPZ/p1690894571135269
○ In these disparities, the most common issue is difference in key, for
example android is sending “Favourite”, ios is sending “favourite”. These
issues would never come if there was a common sdk for generating the
payload of these events.
● 4. GTM
In the first milestone, we have migrated all the events of the accounts page to
this SDK.
We are currently running an xp and monitoring whether there is any data loss or
not.
XP link - https://xp.swiggy.in/experiment/1467/instance/3701
After the ARD is provided by the Analyst and walkthrough is done, all the devs
must add their ARD events via this SDK only going forward once we move to
100% adoption of the Klytics SDK
6. Future Plans
ARD Automator:
ARD Automation
Slack Channel for Bug reports, feedback, announcements and releases : #android-toolkit
Problem
Our current Analytics workflow has been plagued by manual and fragmented processes,
resulting in numerous challenges, such as:
Lack of a Single Source of Truth: ARD specifications scattered across various documents
make it challenging to maintain a clear understanding of expected data.
Manual Verification and Regression Testing: The need for manual verification and
regression testing consumes valuable time and can overlook crucial checks due to the volume
of events.
Limited Coverage: Due to the sheer number of events flowing through, it wasn't possible to
cover the entire Analytics events in regression testing.
Solution
We are proposing an improved and automated workflow to once and for all resolve all these
issues. The following workflow ensures we have a simple yet effective way to write contracts
between teams and how to verify this in an automated way.
Key Features
A Single Source of Truth: We leverage Git to establish a central repository for contracts. This
repository will serve as the authoritative source for ARD specifications, ensuring clarity and
consistency.
Easy contract Maintenance: We ensure creation and modification of contracts are easy with
tooling with GUI forms, bulk imports, and local testing workflows.
Complex Payload Validators: The payload validators allow for matching complex structures,
enabling us to cover almost all type of events.
Automated Contract Verification: Our workflow includes tooling and processes that automate
the verification of ARD contracts, reducing manual effort and increasing accuracy. This will
significantly enhance our efficiency and reliability in verification and regression testing. Thanks
to the automated process, we can now efficiently handle a significantly larger number of events
than we could before.
We already have the best tool for this, Git. So we made a repository to hold consumer app
contracts
https://github.com/swiggy-private/consumer-app-contracts
Since the initial M1 release, the tool has undergone significant modifications to address various
requirements put forth by different teams. Some of the major changes are
Support for complex validators - Originally limited to exact value matches, the tool now
accommodates complex validators. This includes pattern-based regex and nested JSON
validators, to cover almost any payload.
Transition from GTM to Generic Payload Verification - The initial focus on GTM events for
the Consumer App has been broadened to cover generic event verification, allowing for
validation of any event from any team.
Migration from YAML based contract to fully automated JSON contracts - The contracts
underwent a transition to a fully automated workflow utilizing JSON instead of the original,
handwritten YAML one. This shift was prompted by the need to mitigate human errors,
especially with the introduction of intricate, nested validators. A user-friendly GUI facilitates ease
of contract creation and editing.
Local Contract Testing - Capability to load and test local contract files before uploading them
to the server.
Bulk Event Import - To improve the execution speed of creating contracts, we introduced bulk
event imports. This fast tracked the adoption significantly.
Folder Structure Implementation - Given the tool's widespread adoption across various
teams, the incorporation of a simple folder structure was deemed essential. This structure aids
in organizing contracts based on teams or workflows, thereby streamlining the adoption
process.
Request data exporter - Provision for exporting parsed properties to CSV to aid in debugging
and analyzing instances where requests do not match. This feature facilitates seamless sharing
and swift identification of discrepancies.
Replay of Requests - We had originally only supported live events coming from an emulator or
real device, but some teams preferred to use a HAR file (dump of all requests) to verify instead.
We quickly added this in to enable replaying of the requests from the dump, enabling more
teams and improving the speed of execution.
These modifications collectively represent a significant evolution of the tool, enhancing its
versatility, usability, and efficiency in meeting diverse requirements and facilitating seamless
integration across teams and workflows.
Adoption
Team Status
Consumer Tech Onboarded with 220+ events created (50% of
total)
Dineout Onboarded with 26 events created (17% of total)
Minis Onboarded with 14 events created
DE Onboarding
IM Onboarding
Vendor Onboarding
Insanely Good Onboarding
Validators
JSON Validator - Can have complex object and array objects with each node being any other
validator
Onboarding - TLDR
Generate some events on App and tool will receive the events
Workflow
ARD creation
PR needs to be approved by both IOS and Android POC. Analytics POC can verify but due to
access limitations, might not be able to review PRs. Analytics POC can verify the contract on
the toolkit tool after the PR is merged in.
Verification / QA regression
You can also see all the requests flowing in to understand extra or unwanted events
Demo
https://drive.google.com/file/d/1T0wjZnZxg2mG1r-IAp-w_b6_zOtYGaHH/view?usp=sharing
Sample PR
https://github.com/swiggy-private/consumer-app-contracts/pull/22
https://swiggy.slack.com/archives/GBZMDADPZ/p1712153844159739