DZone Spotlight

Saturday, May 10 View All Articles »

How to Configure and Customize the Go SDK for Azure Cosmos DB

By Abhishek Gupta

CORE

The Go SDK for Azure Cosmos DB is built on top of the core Azure Go SDK package, which implements several patterns that are applied throughout the SDK. The core SDK is designed to be quite customizable, and its configurations can be applied with the ClientOptions struct when creating a new Cosmos DB client object using NewClient (and other similar functions). If you peek inside the azcore.ClientOptions struct, you will notice that it has many options for configuring the HTTP client, retry policies, timeouts, and other settings. In this blog, we will cover how to make use of (and extend) these common options when building applications with the Go SDK for Cosmos DB. I have provided code snippets throughout this blog. Refer to this GitHub repository for runnable examples. Retry Policies Common retry scenarios are handled in the SDK. You can dig into cosmos_client_retry_policy.go for more info. Here is a summary of errors for which retries are attempted: Error Type / Status CodeRetry LogicNetwork Connection ErrorsRetry after marking endpoint unavailable and waiting for defaultBackoff.403 Forbidden (with specific substatuses)Retry after marking endpoint unavailable and updating the endpoint manager.404 Not Found (specific substatus)Retry by switching to another session or endpoint.503 Service UnavailableRetry by switching to another preferred location. Let's see some of these in action. Non-Retriable Errors For example, here is a function that tries to read a database that does not exist. Go func retryPolicy1() { c, err := auth.GetClientWithDefaultAzureCredential("https://demodb.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) db, err := c.NewDatabase("i_dont_exist") if err != nil { log.Fatal("NewDatabase call failed", err) } _, err = db.Read(context.Background(), nil) if err != nil { log.Fatal("Read call failed: ", err) } The azcore logging implementation is configured using SetListener and SetEvents to write retry policy event logs to standard output. See Logging section in azcosmos package README for details. Let's look at the logs generated when this code is run: Plain Text //.... Retry Policy Event: exit due to non-retriable status code Retry Policy Event: =====> Try=1 for GET https://demodb.documents.azure.com:443/dbs/i_dont_exist Retry Policy Event: response 404 Retry Policy Event: exit due to non-retriable status code Read call failed: GET https://demodb-region.documents.azure.com:443/dbs/i_dont_exist -------------------------------------------------------------------------------- RESPONSE 404: 404 Not Found ERROR CODE: 404 Not Found When a request is made to read a non-existent database, the SDK gets a 404 (not found) response for the database. This is recognized as a non-retriable error, and the SDK stops retrying. Retries are only performed for retriable errors (like network issues or certain status codes). The operation failed because the database does not exist. Retriable Errors - Invalid Account This function tries to create a Cosmos DB client using an invalid account endpoint. It sets up logging for retry policy events and attempts to create a database. Go func retryPolicy2() { c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Let's look at the logs generated when this code is run, and see how the SDK handles retries when the endpoint is unreachable: Plain Text //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=682.644105ms Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=2.343322179s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #3, Delay=7.177314269s Retry Policy Event: =====> Try=4 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 3 exceeded failed to retrieve account properties: Get "https://iamnothere.docume Each failed attempt is logged, and the SDK retries the operation several times (three times to be specific), with increasing delays between attempts. After exceeding the maximum number of retries, the operation fails with an error indicating the host could not be found - the SDK automatically retries transient network errors before giving up. But you don't have to stick to the default retry policy. You can customize the retry policy by setting the azcore.ClientOptions when creating the Cosmos DB client. Configurable Retries Let's say you want to set a custom retry policy with a maximum of two retries and a delay of one second between retries. You can do this by creating a policy.RetryOptions struct and passing it to the azcosmos.ClientOptions when creating the client. Go func retryPolicy3() { retryPolicy := policy.RetryOptions{ MaxRetries: 2, RetryDelay: 1 * time.Second, } opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ Retry: retryPolicy, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", &opts) if err != nil { log.Fatal(err) } log.Println(c.Endpoint()) azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Each failed attempt is logged, and the SDK retries the operation according to the custom policy — only two retries, with a 1-second delay after the first attempt and a longer delay after the second. After reaching the maximum number of retries, the operation fails with an error indicating the host could not be found. Plain Text Retry Policy Event: =====> Try=1 for GET https://iamnothere.documents.azure.com:443/ //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=1.211970493s Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=3.300739653s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 2 exceeded failed to retrieve account properties: Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host exit status 1 Note: The first attempt is not counted as a retry, so the total number of attempts is three (1 initial + 2 retries). You can customize this further by implementing fault injection policies. This allows you to simulate various error scenarios for testing purposes. Fault Injection For example, you can create a custom policy that injects a fault into the request pipeline. Here, we use a custom policy (FaultInjectionPolicy) that simulates a network error on every request. Go type FaultInjectionPolicy struct { failureProbability float64 // e.g., 0.3 for 30% chance to fail } // Implement the Policy interface func (f *FaultInjectionPolicy) Do(req *policy.Request) (*http.Response, error) { if rand.Float64() < f.failureProbability { // Simulate a network error return nil, &net.OpError{ Op: "read", Net: "tcp", Err: errors.New("simulated network failure"), } } // no failure - continue with the request return req.Next() } This can be used to inject custom failures into the request pipeline. The function configures the Cosmos DB client to use this policy, sets up logging for retry events, and attempts to create a database. Go func retryPolicy4() { opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ PerRetryPolicies: []policy.Policy{&FaultInjectionPolicy{failureProbability: 0.6}, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", &opts) // Updated to use opts if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test_1"}, nil) if err != nil { log.Fatal(err) } } Take a look at the logs generated when this code is run. Each request attempt fails due to the simulated network error. The SDK logs each retry, with increasing delays between attempts. After reaching the maximum number of retries (default = 3), the operation fails with an error indicating a simulated network failure. Note: This can change depending on the failure probability you set in the FaultInjectionPolicy. In this case, we set it to 0.6 (60% chance to fail), so you may see different results each time you run the code. Plain Text Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ //.... Retry Policy Event: MaxRetries 0 exceeded Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=794.018648ms Retry Policy Event: =====> Try=2 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #2, Delay=2.374693498s Retry Policy Event: =====> Try=3 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #3, Delay=7.275038434s Retry Policy Event: =====> Try=4 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: MaxRetries 3 exceeded Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=968.457331ms 2025/05/05 19:53:50 failed to retrieve account properties: read tcp: simulated network failure exit status 1 Do take a look at Custom HTTP pipeline policies in the Azure SDK for Go documentation for more information on how to implement custom policies. HTTP-Level Customizations There are scenarios where you may need to customize the HTTP client used by the SDK. For example, when using the Cosmos DB emulator locally, you want to skip certificate verification to connect without SSL errors during development or testing. TLSClientConfig allows you to customize TLS settings for the HTTP client and setting InsecureSkipVerify: true disables certificate verification – useful for local testing but insecure for production. Go func customHTTP1() { // Create a custom HTTP client with a timeout client := &http.Client{ Transport: &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, }, } clientOptions := &azcosmos.ClientOptions{ ClientOptions: azcore.ClientOptions{ Transport: client, }, } c, err := auth.GetEmulatorClientWithAzureADAuth("http://localhost:8081", clientOptions) if err != nil { log.Fatal(err) } _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } All you need to do is pass the custom HTTP client to the ClientOptions struct when creating the Cosmos DB client. The SDK will use this for all requests. Another scenario is when you want to set a custom header for all requests to track requests or add metadata. All you need to do is implement the Do method of the policy.Policy interface and set the header in the request: Go type CustomHeaderPolicy struct{} func (c *CustomHeaderPolicy) Do(req *policy.Request) (*http.Response, error) { correlationID := uuid.New().String() req.Raw().Header.Set("X-Correlation-ID", correlationID) return req.Next() } Looking at the logs, notice the custom header X-Correlation-ID is added to each request: Plain Text //... Request Event: ==> OUTGOING REQUEST (Try=1) GET https://ACCOUNT_NAME.documents.azure.com:443/ Authorization: REDACTED User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Version: 2020-11-05 Request Event: ==> OUTGOING REQUEST (Try=1) POST https://ACCOUNT_NAME-region.documents.azure.com:443/dbs Authorization: REDACTED Content-Length: 27 Content-Type: application/query+json User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Documentdb-Query: True X-Ms-Version: 2020-11-05 OpenTelemetry Support The Azure Go SDK supports distributed tracing via OpenTelemetry. This allows you to collect, export, and analyze traces for requests made to Azure services, including Cosmos DB. The azotel package is used to connect an instance of OpenTelemetry's TracerProvider to an Azure SDK client (in this case, Cosmos DB). You can then configure the TracingProvider in azcore.ClientOptions to enable automatic propagation of trace context and emission of spans for SDK operations. Go func getClientOptionsWithTracing() (*azcosmos.ClientOptions, *trace.TracerProvider) { exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint()) if err != nil { log.Fatalf("failed to initialize stdouttrace exporter: %v", err) } tp := trace.NewTracerProvider(trace.WithBatcher(exporter)) otel.SetTracerProvider(tp) op := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ TracingProvider: azotel.NewTracingProvider(tp, nil), }, } return &op, tp } The above function creates a stdout exporter for OpenTelemetry (prints traces to the console). It sets up a TracerProvider, registers this as the global tracer, and returns a ClientOptions struct with the TracingProvider set, ready to be used with the Cosmos DB client. Go func tracing() { op, tp := getClientOptionsWithTracing() defer func() { _ = tp.Shutdown(context.Background()) }() c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", op) //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } //ctx := context.Background() tracer := otel.Tracer("tracer_app1") ctx, span := tracer.Start(context.Background(), "query-items-operation") defer span.End() query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(ctx) if err != nil { log.Fatal("query items failed:", err) } for _, item := range queryResp.Items { log.Printf("Queried item: %+v\n", string(item)) } } } The above function calls getClientOptionsWithTracing to get tracing-enabled options and a tracer provider, and ensures the tracer provider is shut down at the end (flushes traces). It creates a Cosmos DB client with tracing enabled, executes an operation to query items in a container. The SDK call is traced automatically, and exported to stdout in this case. You can plug in any OpenTelemetry-compatible tracer provider and traces can be exported to various backend. Here is a snippet for Jaeger exporter. The traces are quite large, so here is a small snippet of the trace output. Check the query_items_trace.txt file in the repo for the full trace output: Go //... { "Name": "query_items democontainer", "SpanContext": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "f2c892bec75dbf5d", "TraceFlags": "01", "TraceState": "", "Remote": false }, "Parent": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "b833d109450b779b", "TraceFlags": "01", "TraceState": "", "Remote": false }, "SpanKind": 3, "StartTime": "2025-05-06T17:59:30.90146+05:30", "EndTime": "2025-05-06T17:59:36.665605042+05:30", "Attributes": [ { "Key": "db.system", "Value": { "Type": "STRING", "Value": "cosmosdb" } }, { "Key": "db.cosmosdb.connection_mode", "Value": { "Type": "STRING", "Value": "gateway" } }, { "Key": "db.namespace", "Value": { "Type": "STRING", "Value": "demodb-gosdk3" } }, { "Key": "db.collection.name", "Value": { "Type": "STRING", "Value": "democontainer" } }, { "Key": "db.operation.name", "Value": { "Type": "STRING", "Value": "query_items" } }, { "Key": "server.address", "Value": { "Type": "STRING", "Value": "ACCOUNT_NAME.documents.azure.com" } }, { "Key": "az.namespace", "Value": { "Type": "STRING", "Value": "Microsoft.DocumentDB" } }, { "Key": "db.cosmosdb.request_charge", "Value": { "Type": "STRING", "Value": "2.37" } }, { "Key": "db.cosmosdb.status_code", "Value": { "Type": "INT64", "Value": 200 } } ], //.... Refer to Semantic Conventions for Microsoft Cosmos DB. What About Other Metrics? When executing queries, you can get basic metrics about the query execution. The Go SDK provides a way to access these metrics through the QueryResponse struct in the QueryItemsResponse object. This includes information about the query execution, including the number of documents retrieved, etc. Plain Text func queryMetrics() { //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(context.Background()) if err != nil { log.Fatal("query items failed:", err) } log.Println("query metrics:\n", *queryResp.QueryMetrics) //.... } } The query metrics are provided as a simple raw string in a key-value format (semicolon-separated), which is very easy to parse. Here is an example: Plain Text totalExecutionTimeInMs=0.34;queryCompileTimeInMs=0.04;queryLogicalPlanBuildTimeInMs=0.00;queryPhysicalPlanBuildTimeInMs=0.02;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.07;indexLookupTimeInMs=0.00;instructionCount=41;documentLoadTimeInMs=0.04;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=9;retrievedDocumentSize=1251;outputDocumentCount=9;outputDocumentSize=2217;writeOutputTimeInMs=0.02;indexUtilizationRatio=1.00 Here is a breakdown of the metrics you can obtain from the query response: Plain Text | Metric | Unit | Description | | ------------------------------ | ----- | ------------------------------------------------------------ | | totalExecutionTimeInMs | ms | Total time taken to execute the query, including all phases. | | queryCompileTimeInMs | ms | Time spent compiling the query. | | queryLogicalPlanBuildTimeInMs | ms | Time spent building the logical plan for the query. | | queryPhysicalPlanBuildTimeInMs | ms | Time spent building the physical plan for the query. | | queryOptimizationTimeInMs | ms | Time spent optimizing the query. | | VMExecutionTimeInMs | ms | Time spent executing the query in the Cosmos DB VM. | | indexLookupTimeInMs | ms | Time spent looking up indexes. | | instructionCount | count | Number of instructions executed for the query. | | documentLoadTimeInMs | ms | Time spent loading documents from storage. | | systemFunctionExecuteTimeInMs | ms | Time spent executing system functions in the query. | | userFunctionExecuteTimeInMs | ms | Time spent executing user-defined functions in the query. | | retrievedDocumentCount | count | Number of documents retrieved by the query. | | retrievedDocumentSize | bytes | Total size of documents retrieved. | | outputDocumentCount | count | Number of documents returned as output. | | outputDocumentSize | bytes | Total size of output documents. | | writeOutputTimeInMs | ms | Time spent writing the output. | | indexUtilizationRatio | ratio | Ratio of index utilization (1.0 means fully utilized). | Conclusion In this blog, we covered how to configure and customize the Go SDK for Azure Cosmos DB. We looked at retry policies, HTTP-level customizations, OpenTelemetry support, and how to access query metrics. The Go SDK for Azure Cosmos DB is designed to be flexible and customizable, allowing you to tailor it to your specific needs. For more information, refer to the package documentation and the GitHub repository. I hope you find this useful! Resources Go SDK for Azure Cosmos DBCore Azure Go SDK packageClientOptionsNewClient More

The Cypress Edge: Next-Level Testing Strategies for React Developers

By Raju Dandigam

Introduction Testing is the backbone of building reliable software. As a React developer, you’ve likely heard about Cypress—a tool that’s been making waves in the testing community. But how do you go from writing your first test to mastering complex scenarios? Let’s break it down together, step by step, with real-world examples and practical advice. Why Cypress Stands Out for React Testing Imagine this: You’ve built a React component, but it breaks when a user interacts with it. You spend hours debugging, only to realize the issue was a missing prop. Cypress solves this pain point by letting you test components in isolation, catching errors early. Unlike traditional testing tools, Cypress runs directly in the browser, giving you a real-time preview of your tests. It’s like having a pair of eyes watching every click, hover, and API call. Key Advantages: Real-Time Testing: Runs in the browser with instant feedback.Automatic Waiting: Eliminates flaky tests caused by timing issues.Time Travel Debugging: Replay test states to pinpoint failures.Comprehensive Testing: Supports unit, integration, and end-to-end (E2E) tests Ever felt like switching between Jest, React Testing Library, and Puppeteer is like juggling flaming torches? Cypress simplifies this by handling component tests (isolated UI testing) and E2E tests (full user flows) in one toolkit. Component Testing vs. E2E Testing: What’s the Difference? Component Testing: Test individual React components in isolation. Perfect for verifying props, state, and UI behavior.E2E Testing: Simulate real user interactions across your entire app. Great for testing workflows like login → dashboard → checkout. Think of component tests as “microscope mode” and E2E tests as “helicopter view.” You need both to build confidence in your app. Setting Up Cypress in Your React Project Step 1: Install Cypress JavaScript npm install cypress --save-dev This installs Cypress as a development dependency. Pro Tip: If you’re using Create React App, ensure your project is ejected or configured to support Webpack 5. Cypress relies on Webpack for component testing. Step 2: Configure Cypress Create a cypress.config.js file in your project root: JavaScript const { defineConfig } = require('cypress'); module.exports = defineConfig({ component: { devServer: { framework: 'react', bundler: 'webpack', }, }, e2e: { setupNodeEvents(on, config) {}, baseUrl: 'http://localhost:3000', }, }); Step 3: Organize Your Tests JavaScript cypress/ ├── e2e/ # E2E test files │ └── login.cy.js ├── component/ # Component test files │ └── Button.cy.js └── fixtures/ # Mock data This separation ensures clarity and maintainability. Step 4: Launch the Cypress Test Runner JavaScript npx cypress open Select Component Testing and follow the prompts to configure your project. Writing Your First Test: A Button Component The Component Create src/components/Button.js: JavaScript import React from 'react'; const Button = ({ onClick, children, disabled = false }) => { return ( <button onClick={onClick} disabled={disabled} data-testid="custom-button" > {children} </button> ); }; export default Button; The Test Create cypress/component/Button.cy.js: JavaScript import React from 'react'; import Button from '../../src/components/Button'; describe('Button Component', () => { it('renders a clickable button', () => { const onClickSpy = cy.spy().as('onClickSpy'); cy.mount(<Button onClick={onClickSpy}>Submit</Button>); cy.get('[data-testid="custom-button"]').should('exist').and('have.text', 'Submit'); cy.get('[data-testid="custom-button"]').click(); cy.get('@onClickSpy').should('have.been.calledOnce'); }); it('disables the button when the disabled prop is true', () => { cy.mount(<Button disabled={true}>Disabled Button</Button>); cy.get('[data-testid="custom-button"]').should('be.disabled'); }); }); Key Takeaways: Spies:cy.spy() tracks function calls.Selectors:data-testid ensures robust targeting.Assertions: Chain .should() calls for readability.Aliases:cy.get('@onClickSpy') references spies. Advanced Testing Techniques Handling Context Providers Problem: Your component relies on React Router or Redux. Solution: Wrap it in a test provider. Testing React Router Components: JavaScript import { MemoryRouter } from 'react-router-dom'; cy.mount( <MemoryRouter initialEntries={['/dashboard']}> <Navbar /> </MemoryRouter> ); Testing Redux-Connected Components: JavaScript import { Provider } from 'react-redux'; import { store } from '../../src/redux/store'; cy.mount( <Provider store={store}> <UserProfile /> </Provider> ); Leveling Up: Testing a Form Component Let’s tackle a more complex example: a login form. The Component Create src/components/LoginForm.js: JavaScript import React, { useState } from 'react'; const LoginForm = ({ onSubmit }) => { const [email, setEmail] = useState(''); const [password, setPassword] = useState(''); const handleSubmit = (e) => { e.preventDefault(); if (email.trim() && password.trim()) { onSubmit({ email, password }); } }; return ( <form onSubmit={handleSubmit} data-testid="login-form"> <input type="email" value={email} onChange={(e) => setEmail(e.target.value)} data-testid="email-input" placeholder="Email" /> <input type="password" value={password} onChange={(e) => setPassword(e.target.value)} data-testid="password-input" placeholder="Password" /> <button type="submit" data-testid="submit-button"> Log In </button> </form> ); }; export default LoginForm; The Test Create cypress/component/LoginForm.spec.js: JavaScript import React from 'react'; import LoginForm from '../../src/components/LoginForm'; describe('LoginForm Component', () => { it('submits the form with email and password', () => { const onSubmitSpy = cy.spy().as('onSubmitSpy'); cy.mount(<LoginForm onSubmit={onSubmitSpy} />); cy.get('[data-testid="email-input"]').type('[email protected]').should('have.value', '[email protected]'); cy.get('[data-testid="password-input"]').type('password123').should('have.value', 'password123'); cy.get('[data-testid="submit-button"]').click(); cy.get('@onSubmitSpy').should('have.been.calledWith', { email: '[email protected]', password: 'password123', }); }); it('does not submit if email is missing', () => { const onSubmitSpy = cy.spy().as('onSubmitSpy'); cy.mount(<LoginForm onSubmit={onSubmitSpy} />); cy.get('[data-testid="password-input"]').type('password123'); cy.get('[data-testid="submit-button"]').click(); cy.get('@onSubmitSpy').should('not.have.been.called'); }); }); Key Takeaways: Use .type() to simulate user input.Chain assertions to validate input values.Test edge cases, such as missing fields. Authentication Shortcuts Problem: Testing authenticated routes without logging in every time.Solution: Use cy.session() to cache login state. JavaScript beforeEach(() => { cy.session('login', () => { cy.visit('/login'); cy.get('[data-testid="email-input"]').type('[email protected]'); cy.get('[data-testid="password-input"]').type('password123'); cy.get('[data-testid="submit-button"]').click(); cy.url().should('include', '/dashboard'); }); cy.visit('/dashboard'); // Now authenticated! }); This skips redundant logins across tests, saving time. Handling API Requests and Asynchronous Logic Most React apps fetch data from APIs. Let’s test a component that loads user data. The Component Create src/components/UserList.js: JavaScript import React, { useEffect, useState } from 'react'; import axios from 'axios'; const UserList = () => { const [users, setUsers] = useState([]); const [loading, setLoading] = useState(false); useEffect(() => { setLoading(true); axios.get('https://api.example.com/users') .then((response) => { setUsers(response.data); setLoading(false); }) .catch(() => setLoading(false)); }, []); return ( <div data-testid="user-list"> {loading ? ( <p>Loading...</p> ) : ( <ul> {users.map((user) => ( <li key={user.id} data-testid={`user-${user.id}`}> {user.name} </li> ))} </ul> )} </div> ); }; export default UserList; The Test Create cypress/component/UserList.spec.js: JavaScript import React from 'react'; import UserList from '../../src/components/UserList'; describe('UserList Component', () => { it('displays a loading state and then renders users', () => { cy.intercept('GET', 'https://api.example.com/users', { delayMs: 1000, body: [{ id: 1, name: 'John Doe' }, { id: 2, name: 'Jane Smith' }], }).as('getUsers'); cy.mount(<UserList />); cy.get('[data-testid="user-list"]').contains('Loading...'); cy.wait('@getUsers').its('response.statusCode').should('eq', 200); cy.get('[data-testid="user-1"]').should('have.text', 'John Doe'); cy.get('[data-testid="user-2"]').should('have.text', 'Jane Smith'); }); it('handles API errors gracefully', () => { cy.intercept('GET', 'https://api.example.com/users', { statusCode: 500, body: 'Internal Server Error', }).as('getUsersFailed'); cy.mount(<UserList />); cy.wait('@getUsersFailed'); cy.get('[data-testid="user-list"]').should('be.empty'); }); }); Why This Works: cy.intercept() mocks API responses without hitting a real server.delayMs simulates network latency to test loading states.Testing error scenarios ensures your component doesn’t crash. Best Practices for Sustainable Tests Isolate Tests: Reset state between tests using beforeEach hooks.Use Custom Commands: Simplify repetitive tasks (e.g., logging in) by adding commands to cypress/support/commands.js.Avoid Conditional Logic: Don’t use if/else in tests—each test should be predictable.Leverage Fixtures: Store mock data in cypress/fixtures to keep tests clean. Use Data Attributes as Selectors Example: data-testid="email-input" instead of #email or .input-primary.Why? Class names and IDs change; test IDs don’t. Mock Strategically Component Tests: Mock child components with cy.stub().E2E Tests: Mock APIs with cy.intercept(). Keep Tests Atomic Test one behavior per block: One test for login success.Another for login failure. Write Resilient Assertions Instead of: JavaScript cy.get('button').should('have.class', 'active'); Write: JavaScript cy.get('[data-testid="status-button"]').should('have.attr', 'aria-checked', 'true'); Cypress Time Travel Cypress allows users to see test steps visually. Use .debug() to pause and inspect state mid-test. JavaScript cy.get('[data-testid="submit-button"]').click().debug(); FAQs: Your Cypress Questions Answered Q: How do I test components that use React Router? A: Wrap your component in a MemoryRouter to simulate routing in your tests: JavaScript cy.mount( <MemoryRouter> <YourComponent /> </MemoryRouter> ); Q: Can I run Cypress tests in CI/CD pipelines? A: Absolutely! You can run your tests head less in environments like GitHub Actions using the command: JavaScript cypress run Q: How do I run tests in parallel to speed up CI/CD? A: To speed up your tests, you can run them in parallel with the following command: JavaScript npx cypress run --parallel Q: How do I test file uploads? A: You can test file uploads by selecting a file input like this: JavaScript cy.get('input[type="file"]').selectFile('path/to/file.txt'); Wrapping Up Cypress revolutionizes testing by integrating it smoothly into your workflow. Begin with straightforward components and progressively address more complex scenarios to build your confidence and catch bugs before they affect users. Keep in mind that the objective isn't to achieve 100% test coverage; rather, it's about creating impactful tests that ultimately save you time and prevent future headaches. More

Trend Report

Generative AI

AI technology is now more accessible, more intelligent, and easier to use than ever before. Generative AI, in particular, has transformed nearly every industry exponentially, creating a lasting impact driven by its (delivered) promises of cost savings, manual task reduction, and a slew of other benefits that improve overall productivity and efficiency. The applications of GenAI are expansive, and thanks to the democratization of large language models, AI is reaching every industry worldwide.Our focus for DZone's 2025 Generative AI Trend Report is on the trends surrounding GenAI models, algorithms, and implementation, paying special attention to GenAI's impacts on code generation and software development as a whole. Featured in this report are key findings from our research and thought-provoking content written by everyday practitioners from the DZone Community, with topics including organizations' AI adoption maturity, the role of LLMs, AI-driven intelligent applications, agentic AI, and much more.We hope this report serves as a guide to help readers assess their own organization's AI capabilities and how they can better leverage those in 2025 and beyond.

Refcard #158

Machine Learning Patterns and Anti-Patterns

By Tuhin Chattopadhyay

CORE

Machine Learning Patterns and Anti-Patterns

Refcard #269

Getting Started With Data Quality

By Miguel Garcia

CORE

*You* Can Shape Trend Reports: Join DZone's Software Supply Chain Security Research

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you choose) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Software Supply Chain Security Research Supply chains aren't just for physical products anymore; they're a critical part of how software is built and delivered. At DZone, we're taking a closer look at the state of software supply chain security to understand how development teams are navigating emerging risks through smarter tooling, stronger practices, and the strategic use of AI. Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. We're exploring key topics such as: SBOM adoption and real-world usageThe role of AI and ML in threat detectionImplementation of zero trust security modelsCloud and open-source security posturesModern approaches to incident response Join the Security Research We’ve also created some painfully relatable memes about the state of software supply chain security. If you’ve ever muttered “this is fine” while scanning dependencies, these are for you! Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By Lauren Forbes

Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways

In my experience managing large-scale Kubernetes deployments across multi-cloud platforms, traffic control often becomes a critical bottleneck, especially when dealing with mixed workloads like APIs, UIs, and transactional systems. While Istio’s default ingress gateway does a decent job, I found that relying on a single gateway can introduce scaling and isolation challenges. That’s where configuring multiple Istio Ingress Gateways can make a real difference. In this article, I’ll walk you through how I approached this setup, what benefits it unlocked for our team, and the hands-on steps we used, along with best practices and YAML configurations that you can adapt in your own clusters. Why Do We Use an Additional Ingress Gateway? Using an additional Istio Ingress Gateway provides several advantages: Traffic isolation: Route traffic based on workload-specific needs (e.g., API traffic vs. UI traffic or transactional vs. non-transactional applications).Multi-tenancy: Different teams can have their gateway while still using a shared service mesh.Scalability: Distribute traffic across multiple gateways to handle higher loads efficiently.Security and compliance: Apply different security policies to specific gateway instances.Flexibility: You can create any number of additional ingress gateways based on project or application needs.Best practices: Kubernetes teams often use Horizontal Pod Autoscaler (HPA), Pod Disruption Budget (PDB), Services, Gateways, and Region-Based Filtering (via Envoy Filters) to enhance reliability and performance. Understanding Istio Architecture Istio IngressGateway and Sidecar Proxy: Ensuring Secure Traffic Flow When I first began working with Istio, one of the key concepts that stood out was the use of sidecar proxies. Every pod in the mesh requires an Envoy sidecar to manage traffic securely. This ensures that no pod can bypass security or observability policies. Without a sidecar proxy, applications cannot communicate internally or with external sources.The Istio Ingress Gateway manages external traffic entry but relies on sidecar proxies to enforce security and routing policies.This enables zero-trust networking, observability, and resilience across microservices. How Traffic Flows in Istio With Single and Multiple Ingress Gateways In an Istio service mesh, all external traffic follows a structured flow before reaching backend services. The Cloud Load Balancer acts as the entry point, forwarding requests to the Istio Gateway Resource, which determines traffic routing based on predefined policies. Here's how we structured the traffic flow in our setup: Cloud Load Balancer receives external requests and forwards them to Istio's Gateway Resource.The Gateway Resource evaluates routing rules and directs traffic to the appropriate ingress gateway: Primary ingress gateway: Handles UI requests.Additional ingress gateways: Route API, transactional, and non-transactional traffic separately.Envoy Sidecar Proxies enforce security policies, manage traffic routing, and monitor observability metrics.Requests are forwarded to the respective Virtual Services, which process and direct them to the final backend service. This structure ensures better traffic segmentation, security, and performance scalability, especially in multi-cloud Kubernetes deployments. Figure 1: Istio Service Mesh Architecture – Traffic routing from Cloud Load Balancer to Istio Gateway Resource, Ingress Gateways, and Service Mesh. Key Components of Istio Architecture Ingress gateway: Handles external traffic and routes requests based on policies.Sidecar proxy: Ensures all service-to-service communication follows Istio-managed rules.Control plane: Manages traffic control, security policies, and service discovery. Organizations can configure multiple Istio Ingress Gateways by leveraging these components to enhance traffic segmentation, security, and performance across multi-cloud environments. Comparison: Single vs. Multiple Ingress Gateways We started with a single ingress gateway and quickly realized that as traffic grew, it became a bottleneck. Splitting traffic using multiple ingress gateways was a simple but powerful change that drastically improved routing efficiency and fault isolation. On the other hand, multiple ingress gateways allowed better traffic segmentation for APIs, UI, and transaction-based workloads, improved security enforcement by isolating sensitive traffic, and scalability and high availability, ensuring each type of request is handled optimally. The following diagram compares a single Istio Ingress Gateway with multiple ingress gateways for handling API and web traffic. Figure 2: Single vs. Multiple Istio Ingress Gateways – Comparing routing, traffic segmentation, and scalability differences. Key takeaways from the comparison: A single Istio Ingress Gateway routes all traffic through a single entry point, which may become a bottleneck.Multiple ingress gateways allow better traffic segmentation, handling API traffic and UI traffic separately.Security policies and scaling strategies can be defined per gateway, making it ideal for multi-cloud or multi-region deployments. Feature Single Ingress Gateway Multiple Ingress Gateways Traffic Isolation No isolation, all traffic routes through a single gateway Different gateways for UI, API, transactional traffic Resilience If the single gateway fails, traffic is disrupted Additional ingress gateways ensure redundancy Scalability Traffic bottlenecks may occur Load distributed across multiple gateways Security Same security rules apply to all traffic shared Custom security policies per gateway Setting Up an Additional Ingress Gateway How Additional Ingress Gateways Improve Traffic Routing We tested routing different workloads (UI, API, transactional) through separate gateways. This gave each gateway its own scaling behavior and security profile. It also helped isolate production incidents — for example, UI errors no longer impacted transactional requests. The diagram below illustrates how multiple Istio Ingress Gateways efficiently manage API, UI, and transactional traffic. Figure 3: Multi-Gateway Traffic Flow – External traffic segmentation across API, UI, and transactional ingress gateways. How it works: Cloud Load Balancer forwards traffic to the Istio Gateway Resource, which determines routing rules.Traffic is directed to different ingress gateways: The Primary ingress gateway handles UI traffic.The API Ingress Gateway handles API requests.The Transactional Ingress Gateway ensures financial transactions and payments are processed securely.The Service Mesh enforces security, traffic policies, and observability. Step 1: Install Istio and Configure Operator For our setup, we used Istio’s Operator pattern to manage lifecycle operations. It’s flexible and integrates well with GitOps workflows. Prerequisites Kubernetes cluster with Istio installedHelm installed for deploying Istio components Ensure you have Istio installed. If not, install it using the following commands: Plain Text curl -L https://istio.io/downloadIstio | ISTIO_VERSION=$(istio_version) TARGET_ARCH=x86_64 sh - export PATH="$HOME/istio-$ISTIO_VERSION/bin:$PATH" Initialize the Istio Operator Plain Text istioctl operator init Verify the Installation Plain Text kubectl get crd | grep istio Alternative Installation Using Helm Istio Ingress Gateway configurations can be managed using Helm charts for better flexibility and reusability. This allows teams to define customizable values.yaml files and deploy gateways dynamically. Helm upgrade command: Plain Text helm upgrade --install istio-ingress istio/gateway -f values.yaml This allows dynamic configuration management, making it easier to manage multiple ingress gateways. Step 2: Configure Additional Ingress Gateways With IstioOperator We defined separate gateways in the IstioOperator config (additional-ingress-gateway.yaml) — one for UI and one for API — and kept them logically grouped using Helm values files. This made our Helm pipelines cleaner and easier to scale or modify. Below is an example configuration to create multiple additional ingress gateways for different traffic types: YAML apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: additional-ingressgateways namespace: istio-system spec: components: ingressGateways: - name: istio-ingressgateway-ui enabled: true k8s: service: type: LoadBalancer - name: istio-ingressgateway-api enabled: true k8s: service: type: LoadBalancer Step 3: Additional Configuration Examples for Helm We found that adding HPA and PDB configs early helped ensure we didn’t hit availability issues during upgrades. This saved us during one incident where the default config couldn’t handle a traffic spike in the API gateway. Below are sample configurations for key Kubernetes objects that enhance the ingress gateway setup: Horizontal Pod Autoscaler (HPA) YAML apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: ingressgateway-hpa namespace: istio-system spec: minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: istio-ingressgateway Pod Disruption Budget (PDB) YAML apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: ingressgateway-pdb namespace: istio-system spec: minAvailable: 1 selector: matchLabels: app: istio-ingressgateway Region-Based Envoy Filter YAML apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: region-header-filter namespace: istio-system spec: configPatches: - applyTo: HTTP_FILTER match: context: GATEWAY listener: filterChain: filter: name: envoy.filters.network.http_connection_manager subFilter: name: envoy.filters.http.router proxy: proxyVersion: ^1\.18.* patch: operation: INSERT_BEFORE value: name: envoy.filters.http.lua typed_config: '@type': type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua inlineCode: | function envoy_on_response(response_handle) response_handle:headers():add("X-Region", "us-eus"); end Step 4: Deploy Additional Ingress Gateways Apply the configuration using istioctl: Plain Text istioctl install -f additional-ingress-gateway.yaml Verify that the new ingress gateways are running: Plain Text kubectl get pods -n istio-system | grep ingressgateway After applying the configuration, we monitored the rollout using kubectl get pods and validated each gateway's service endpoint. Naming conventions like istio-ingressgateway-ui really helped keep things organized. Step 5: Define Gateway Resources for Each Ingress Each ingress gateway should have a corresponding gateway resource. Below is an example of defining separate gateways for UI, API, transactional, and non-transactional traffic: YAML apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: my-ui-gateway namespace: default spec: selector: istio: istio-ingressgateway-ui servers: - port: number: 443 name: https protocol: HTTPS hosts: - "ui.example.com" Repeat similar configurations for API, transactional, and non-transactional ingress gateways. Make sure your gateway resources use the correct selector. We missed this during our first attempt, and traffic didn’t route properly — a simple detail, big impact. Step 6: Route Traffic Using Virtual Services Once the gateways are configured, create Virtual Services to control traffic flow to respective services. YAML apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: my-api-service namespace: default spec: hosts: - "api.example.com" gateways: - my-api-gateway http: - route: - destination: host: my-api port: number: 80 Repeat similar configurations for UI, transactional, and non-transactional services. Just a note that VirtualServices gives you fine-grained control over traffic. We even used them to test traffic mirroring and canary rollouts between the gateways. Resilience and High Availability With Additional Ingress Gateways One of the biggest benefits we noticed: zero downtime during regional failovers. Having dedicated gateways meant we could perform rolling updates with zero user impact. This model also helped us comply with region-specific policies by isolating sensitive data flows per gateway — a crucial point when dealing with financial workloads. If the primary ingress gateway fails, additional ingress gateways can take over traffic seamlessly.When performing rolling upgrades or Kubernetes version upgrades, separating ingress traffic reduces downtime risk.In multi-region or multi-cloud Kubernetes clusters, additional ingress gateways allow better control of regional traffic and compliance with local regulations. Deploying additional IngressGateways enhances resilience and fault tolerance in a Kubernetes environment. Best Practices and Lessons Learned Many teams forget that Istio sidecars must be injected into every application pod to ensure service-to-service communication. Below are some lessons we learned the hard way When deploying additional ingress gateways, consider implementing: Horizontal Pod Autoscaler (HPA): Automatically scale ingress gateways based on CPU and memory usage.Pod Disruption Budgets (PDB): Ensure high availability during node upgrades or failures.Region-Based Filtering (EnvoyFilter): Optimize traffic routing by dynamically setting request headers with the appropriate region.Dedicated services and gateways: Separate logical entities for better security and traffic isolation.Ensure automatic sidecar injection is enabled in your namespace: Plain Text kubectl label namespace <your-namespace> istio-injection=enabled Validate that all pods have sidecars using: Plain Text kubectl get pods -n <your-namespace> -o wide kubectl get pods -n <your-namespace> -o jsonpath='{.items[*].spec.containers[*].name}' | grep istio-proxy Without sidecars, services will not be able to communicate, leading to failed requests and broken traffic flow. When upgrading additional ingress gateways, consider the following: Delete old Istio configurations (if needed): If you are upgrading or modifying Istio, delete outdated configurations: Plain Text kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io istio-sidecar-injector kubectl get crd --all-namespaces | grep istio | awk '{print $1}' | xargs kubectl delete crd Ensure updates to proxy version, deployment image, and service labels during upgrades to avoid compatibility issues. YAML proxyVersion: ^1.18.* image: docker.io/istio/proxyv2:1.18.6 Scaling down Istio Operator: Before upgrading, scale down the Istio Operator to avoid disruptions. Plain Text kubectl scale deployment -n istio-operator istio-operator --replicas=0 Backup before upgrade: Plain Text kubectl get deploy,svc,cm,secret -n istio-system -o yaml > istio-backup.yaml Monitoring and Observability With Grafana With Istio's built-in monitoring, Grafana dashboards provide a way to segregate traffic flow by ingress type: Monitor API, UI, transactional, and non-transactional traffic separately.Quickly identify which traffic type is affected when an issue occurs in Production using Prometheus-based metricsIstio Gateway metrics can be monitored in Grafana & Prometheus to track traffic patterns, latency, and errors.It provides real-time metrics for troubleshooting and performance optimization.Using PrometheusAlertmanager, configure alerts for high error rates, latency spikes, and failed request patterns to improve reliability. FYI, we extended our dashboards in Grafana to visualize traffic per gateway. This was a game-changer — we could instantly see which gateway was spiking and correlate it to service metrics. Prometheus alerting was configured to trigger based on error rates per ingress type. This helped us catch and resolve issues before they impacted end users. Conclusion Implementing multiple Istio Ingress Gateways significantly transformed the architecture of our Kubernetes environments. This approach enabled us to independently scale different types of traffic, enforce custom security policies per gateway, and gain enhanced control over traffic management, scalability, security, and observability. By segmenting traffic into dedicated ingress gateways — for UI, API, transactional, and non-transactional services — we achieved stronger isolation, improved load balancing, and more granular policy enforcement across teams. This approach is particularly critical in multi-cloud Kubernetes environments, such as Azure AKS, Google GKE, Amazon EKS, Red Hat OpenShift, VMware Tanzu Kubernetes Grid, IBM Cloud Kubernetes Service, Oracle OKE, and self-managed Kubernetes clusters, where regional traffic routing, failover handling, and security compliance must be carefully managed. By leveraging best practices, including: Sidecar proxies for service-to-service securityHPA (HorizontalPodAutoscaler) for autoscalingPDB (PodDisruptionBudget) for availabilityEnvoy filters for intelligent traffic routingHelm-based deployments for dynamic configuration Organizations can build a highly resilient and efficient Kubernetes networking stack. Additionally, monitoring dashboards like Grafana and Prometheus provide deep observability into ingress traffic patterns, latency trends, and failure points, allowing real-time tracking of traffic flow, quick root-cause analysis, and proactive issue resolution. By following these principles, organizations can optimize their Istio-based service mesh architecture, ensuring high availability, enhanced security posture, and seamless performance across distributed cloud environments. References Istio Architecture OverviewIstio Ingress Gateway vs. Kubernetes IngressIstio Install Guide (Using Helm or Istioctl)Istio Operator & Profiles for Custom Deployments Best Practices for Istio Sidecar InjectionIstio Traffic Management: VirtualServices, Gateways & DestinationRules

By Prabhu Chinnasamy

Measuring the Impact of AI on Software Engineering Productivity

It is hard to imagine a time not long ago where AI has not been front and center of our everyday news, let alone in the software engineering world? The advent of LLMs coupled with the existing compute power catapulted the use of AI in our everyday lives and in particular so in the life of a software engineer. This article breaks down some of the use cases of AI in software engineering and suggests a path to investigate the key question: Did we actually become more productive? It has only been a few years since the inception of GitHub Copilot in 2021. Since then, AI assisted coding tools have had a significant impact on software engineering practices. As of 2024 it is estimated that 75% of developers use some kind of AI tool. Often, these tools are not fully rolled out in organizations and used on the side. However, Gartner estimates that we will reach 90% enterprise adoption by 2028. Today there are dozens of tools that do or claim can help software engineers in their daily lives. Besides GitHub Copilot, ChatGTP, and Google Gemini, common tools include GitLab Duo, Claude, Jetbrains AI, Cody, Bolt, Cursor, and AWS CodeWhisperer. Updates are reported almost daily leading to new and advanced solutions. AI Assisted Coding with GitHub Copilot AI in Software Engineering: What is Changing? Looking at the use cases inside engineering organization, we can identify a number of key purposes: Building proof of concepts and scaffolding quickly for new products. Engineers use AI-based solutions that leverage intrinsic knowledge about frameworks for generating initial blueprints and solutions. Solutions include Bolt, v0 and similar.Writing new code, iterating existing code and using AI as a perceived productivity assistant. The purpose is to quickly iterate on existing solutions, have an AI-supported “knowledge base” and assistance. This type of AI not only produces code, but to a degree, replaces expert knowledge forums and sites such as Stack Overflow. This is the space where we have seen the most success with solutions being embedded on the IDE, connected to repositories and tightly integrated into the software development process. Automating engineering processes through Agentic AI. The latest approach is increasing the level of automation on niche tasks as well as connecting across tasks and development silos. Besides automating more mundane tasks, Agentic AI shapes up to be helpful in creating test cases, optimizing build pipelines and managing the whole planning-to-release process and is an area of much ongoing development. For the purpose of this article let us focus on the most mature technology—AI assisted coding solutions. Besides all the progress and the increasing adoption of AI, the main question remains: Are we any more productive? Productivity means getting done what needs to be done with a particular benefit in mind. Producing more code can be a step in the right direction, but it might also have unintended consequences of producing low-quality code, code that works, but does not meet the intention, or where junior developers might blindly accept code leading to issues down the road. Obviously, a lot depends on the skill of the prompt engineer (asking the right question), the ability to iterate on the AI generated code (the expertise and experience of the developer) and of course on the maturity of the AI technology. Let us dive into the productivity aspect in more detail. AI and Productivity: The Big Unknown One of the key questions in rolling out AI tools across the engineering organization is judging its productivity impact. How do we know if and when AI assisted coding really helps our organization to be more productive? What are good indicators and what might be good metrics to measure and track productivity over time? Firstly, as mentioned above, productivity does not mean simply writing more code. More code is just more code. It does not mean it necessarily does anything useful or adds something to a product that is actually needed. Nonetheless, more code produced quickly is helpful if it solves a business problem. Internal indicators for this can be that feature tickets get resolved quicker, code reviews are accepted (quickly) and security and quality criteria are met—either through higher pre-release pass rates, or lower incidence tickets post release. As such, some common indicators for productivity are: The throughput of your accepted coding activities as for instance defined by the number of PRs you get approved and merged in a week. The number of feature tickets or tasks that can be resolved in a sprint, for instance measured by the number of planning tickets you can complete. The quality and security standard of your coding activities. For instance, does AI coding assistance generate less security issues, do quality tests pass more often, or do code reviews take less time and less cycles?The time it takes to get any of the above done and a release out of the door. Do you release more often? Are your release pipelines more reliable? All things being equal, in a productive AI assisted coding organization we would expect that you would be able to ship more or ship faster—ideally both. ROI: Measuring the Impact of AI The best time to measure your engineering productivity is today. Productivity is never a single number and the trend is important. Having a baseline to measure the current state against future organizational and process improvements is crucial to gauge any productivity gains. If you haven’t invested heavily into AI tooling yet but planning to, it is a good time to establish a baseline. If you have invested in AI, it is essential to track ongoing changes over time. You can do this with manual investigation at certain points in time, or automatically and continuously with software engineering intelligence platforms such as Logilica, which not only track your ongoing metrics, but also enable you to forensically look into the past and future project states. There are a number of key metrics we suggest tracking and see if your AI investment pays off. We suggest centering them around the following dimensions: Speed of delivery: Are you able to deliver faster than before? This means, are you able to respond to customer needs and market demand quicker and more flexibly? Indicators are your release cadence, your lead time for releases, lead time for customer and planning tickets and even cycle times for each individual code activities (PRs).Features shipped: Are you able to actually ship more? Not producing more code only, but finishing more planned tasks, approving and merging code activity (PRs), and are you able to have more or larger releases? Throughput metrics are important if they are balanced with time and quality metrics.Level of predictability: One main challenge with software engineering is having on-target delivery and not letting deadlines or scope slip. Do your AI initiatives help you with this? For instance, do you hit the target dates more reliably? Does your sprint planning improve, and conversely, are you able to reduce your sprint overruns? Does your effort more reliably align with the business expectation, e.g., do you track if new features increase and bug fixing/technical debt decreases?Security/quality expectations: Does your downstream release pipeline improve with less build failures? Do you hit your testing and security scanning criteria? Do you see less support tickets since the introduction of AI? Is there a change in user sentiment that supports your ongoing investment?Developer team health: Lastly, does the introduction of AI positively impact your developer team health, lead to less overload and to happier teams? This is a big one and much less clear cut than one might expect. While AI assisted development can produce more code quicker, it is unclear if it does not create more burden elsewhere. For instance, more code means more code reviews, easily making humans a bottle neck again, which leads to frustration and burn out. Also, AI generated code might be larger, leading to larger PR where the actual submitter has less confidence in his own AI-assisted code. QA/security might feel the extra burden and customers report more bugs that take longer to resolve. Overall, it is essential to track engineering processes and key metrics from multiple dimensions simultaneously, ensuring that your AI investment actually delivers positive, measurable productivity gains. Tracking the Impact of AI Assisted Development Outlook AI assisted development has arrived. It is a new reality that will rapidly permeate all parts of the software development lifecycle. As such, it is critical to build up the expertise and strategies to use that technology in the most beneficial way. Ensuring success requires the right level of visibility into the software engineering processes to provide the essential observability for decision makers. Those decisions are two-fold: Justifying the investment to executive teams with data-driven evidence, and being able to set the right guardrails for productivity improvements and process goals. There is the inevitable hype cycle around AI assisted coding. To look beyond the hype it is important to measure the positive impact and steer its adoptions into the right direction, to ensure a positive business impact. Software Engineering Intelligence (SEI) platforms connect with your engineering data and give you the visibility into your processes and bottlenecks to get answers to the above questions. These platforms automate the measuring and analytics process for you to focus on the data-driven division making. In future parts of this series we will dive into details of how predictive models can be applied to your engineering processes, how you can use AI to monitor your software engineering AI and how SEI platforms can help you to build high-performance engineering organizations.

By Ralf Huuck

Simplify Authorization in Ruby on Rails With the Power of Pundit Gem

Hi, I'm Denis, a backend developer. I’ve been recently working on building a robust all-in-one CRM system for HR and finance, website, and team management. Using the Pundit gem, I was able to build such an efficient role-based access system, and now I'd like to share my experience. Managing authorization efficiently became a crucial challenge as this system expanded, requiring a solution that was both scalable and easy to maintain. In Ruby on Rails, handling user access can quickly become complex, but the Pundit gem. In this article, I will explore how Pundit simplifies authorization, its core concepts, and how you can integrate it seamlessly into your Rails project. By the end of this guide, you'll hopefully have a clear understanding of how to use Pundit effectively to manage user permissions and access control in your Rails applications. Why Choose Pundit for Authorization? Pundit is a lightweight and straightforward authorization library for Rails that follows the policy-based authorization pattern. Unlike other authorization libraries, such as CanCanCan, which rely on rule-based permissions, Pundit uses plain Ruby classes to define access rules. Some key benefits of Pundit include: Simplicity: Uses plain old Ruby objects (POROs) to define policies, making it easy to read and maintain.Flexibility: Provides fine-grained control over permissions, allowing complex authorization rules to be implemented easily.Scalability: Easily extendable as your application grows without adding unnecessary complexity.Security: Encourages explicit authorization checks, reducing the risk of unauthorized access to resources. Unlike role-based systems that define broad roles, Pundit allows for granular, action-specific permission handling. This approach improves maintainability and prevents bloated permission models. Pundit vs. CanCanCan Pundit and CanCanCan are both popular authorization libraries for Rails, but they take different approaches: FeaturePunditCanCanCanAuthorization MethodPolicy-based (separate classes for each resource)Rule-based (centralized abilities file)FlexibilityHigh (you define logic explicitly)Medium (relies on DSL rules)ComplexityLower (straightforward Ruby classes)Higher (complex rules can be harder to debug)PerformanceGenerally better for large applicationsCan slow down with many rules If you need explicit, granular control over access, Pundit is often the better choice. If you prefer a more declarative, centralized way of defining permissions, CanCanCan might be suitable. Getting Started With Pundit Before diving into using Pundit, it’s important to understand how it fits into Rails’ authorization system. By relying on clear, independent policies, Pundit keeps your code maintainable and easy to follow. Now, let’s walk through the setup process and see how you can start using Pundit to manage access control in your application. 1. Installing Pundit Gem To begin using Pundit in your Rails project, add it to your Gemfile and run bundle install: Then, install Pundit by running: This command generates an ApplicationPolicy base class that will be used for defining your policies. This base class provides default behavior for authorization checks and serves as a template for specific policies you create for different models. 2. Defining Policies Policies in Pundit are responsible for defining authorization rules for a given model or resource. A policy is simply a Ruby class stored inside the app/policies/ directory. For example, let’s generate a policy for a Post model: This generates a PostPolicy class inside app/policies/post_policy.rb. A basic policy class looks like this: Each method defines an action (e.g., show?, update?, destroy?) and returns true or false based on whether the user has permission to perform that action. Keeping policy methods small and specific makes them easy to read and debug. 3. Using Policies in Controllers In your controllers, you can leverage Pundit's authorize method to enforce policies. Here’s how you can integrate Pundit into a PostsController: Here, authorize @post automatically maps to PostPolicy and calls the appropriate method based on the controller action. This ensures authorization is consistently checked before performing actions on a resource. 4. Handling Authorization at the View Level Pundit provides the policy helper, which allows you to check permissions in views: You can also use policy_scope to filter records based on permissions: This ensures that only authorized data is displayed to the user, preventing unauthorized access even at the UI level (but data loading with policy scope is recommended on the non-view level). 5. Custom Scopes for Querying Data Pundit allows you to define custom scopes for fetching data based on user roles. Modify PostPolicy to include a Scope class: In the controller: This ensures users only see records they are authorized to view, adding an extra layer of security and data privacy. In our experience, it is often necessary to load data from another scope, and then you need to specify additional parameters when loading data from the policy scope: Also, when you have several scopes for one policy, you can specify which one you need (because by default, the scope uses the "resolve" method for scope). For example, in your policy, you have: And you can call it: 6. Rescuing a Denied Authorization in Rails It's important not only to verify authorization correctly but also to handle errors and access permissions properly. In my implementation, I used role-based access rules to ensure secure and flexible control over user permissions, preventing unauthorized actions while maintaining a smooth user experience. I won’t be dwelling a lot upon them in this article, as I described them in detail in one of my recent CRM overviews. Pundit raises a Pundit::NotAuthorizedError you can rescue_from in your ApplicationController. You can customize the user_not_authorized method in every controller. So you can also change the behavior of your application when access is denied. Best Practices for Using Pundit To get the most out of Pundit, it's essential to follow best practices that ensure your authorization logic remains clean, efficient, and scalable. Let’s explore some key strategies to keep your policies well-structured and your application secure. 1. Сreating a Separate Module: A Clean and Reusable Approach A well-structured application benefits from modularization, reducing repetitive code, and improving maintainability. Module encapsulates authorization logic, making it easy to reuse across multiple controllers. Let’s break it down: The load_and_authorize_resource method is a powerful helper that: Loads the resource based on controller actions.Authorizes the resource using Pundit.Automatically assigns the loaded resource to an instance variable for use in the controller actions. Example: This means that controllers no longer need to explicitly load and authorize records, reducing boilerplate code. For example, the load_and_authorize method dynamically loads records based on controller actions: index: Loads all records.new/create: Initializes a new record.Other actions: Fetches a specific record using a flexible find strategy. This makes it easy to add authorization without cluttering individual controllers. 2. Applying It in a Controller With the AuthorizationMethods module included in ApplicationController, controllers become much cleaner. For example, in PostsControllerloading and authorizing a Post record is as simple as: With load_and_authorize_resource, the controller: ✅ Automatically loads Post records✅ Ensures authorization is enforced ✅ Remains clean and maintainable Other Best Practices for Pundit Keep policies concise and focused. Each policy should only contain logic related to authorization to maintain clarity and separation of concerns.Use scopes for query-level authorization. This ensures that unauthorized data is never retrieved from the database, improving both security and efficiency.Always call authorize in controllers. This prevents accidental exposure of sensitive actions by ensuring explicit permission checks.Avoid authorization logic in models. Keep concerns separate by handling authorization in policies rather than embedding logic within models. Wrap Up Pundit simplifies authorization in Ruby on Rails by providing a clean and structured way to define and enforce permissions. By using policies, scopes, and controller-based authorization, you can create secure, maintainable, and scalable applications with minimal complexity. If you’re building a Rails app that requires role-based access control, Pundit is a powerful tool that can streamline your authorization logic while keeping your codebase clean and easy to manage.

By Denys Kozlovskyi

Cookies Revisited: A Networking Solution for Third-Party Cookies

Cookies are fundamental aspects of a web application that end users and developers frequently deal with. A cookie is a small piece of data that is stored in a user’s browser. The data element is used as a medium to communicate information between the web browser and the application's server-side layer. Cookies serve various purposes, such as remembering a user’s credentials (not recommended), targeting advertisements (tracking cookies), or helping to maintain a user’s authentication status in a web application. Several fantastic articles on the internet have been written over the years on cookies. This article focuses on handling cross-domain, aka third-party cookies. Cookie Types Before jumping straight away into the main goal of this article, let’s briefly highlight the categories into which we can break cookies. One category is based on the type of use cases they solve, and other category is based on ownership of cookie. Breakdown Based on Use Cases Session Cookie As the name suggests session cookie is a type of cookie that is used to manage a user’s web session. Typically, the server sends the cookie as response back to the browser after successful authentication as “Set-Cookie” response header. The browser sends the cookie as part of the request in subsequent call to the server. The server validates the cookie to make sure the user is still authenticated before responding with data. If the user logs out or the session times out, the cookie is invalidated. Otherwise, if the user closes the browser, the session cookie also becomes inaccessible. JavaScript // Setting a session cookie document.cookie = "session_cookie=value; path=/"; Persistent Cookie A persistent cookie is a type of cookie that doesn’t die when the browser is closed or if the user signs out of a web application. Its purpose is to retain some information in the user's workstation for a longer period of time. One common example of a use case where the persistent cookie is used is during two-factor authentication on a website. We’ve all encountered this experience on websites, especially when logging into online banking portals. After entering our user ID and password, we’re often prompted for a second layer of authentication. This second factor is typically a one-time passcode (OTP), which is sent either to our mobile device via SMS or voice call, or to our email address (though using email is generally discouraged, as email accounts are more prone to compromise). Generally, the 2nd factor authentication screen gives us the option to remember the device. If we choose that as an option, typically the application generates some random code and persists on the server side. The application sets that random code as a persistent cookie and sends it back to the browser. During subsequent login, the client-side code of the application sends the persistent cookie in the request after successful authentication. The server side of the code finds the persistent cookie as valid, then it doesn’t prompt the user for 2nd factor authentication. Otherwise, the user is challenged for the OTP code. JavaScript // Setting a persistent cookie for 30 days var expirationDate = new Date(); expirationDate.setDate(expirationDate.getDate() + 30); document.cookie = "persistent_cookie=value; expires=" + expirationDate.toUTCString() + "; path=/"; Tracking Cookie Unlike session or persistent cookies, which have been very common since the inception of cookie-based solutions in web applications, tracking cookies are comparatively new and mostly a phenomenon of the past decade. Here, a website starts tracking users' browsing activity and stores it in the browser. Later, it is used to display relevant advertisements to users when they access the internet from the same browser. As a tracking cookie is used to capture user data, websites prompt users to accept or reject tracking cookies when they access a website that implements tracking cookies. JavaScript // Setting a tracking cookie with SameSite=None and Secure option for cross site access var expirationDate = new Date(); expirationDate.setDate(expirationDate.getDate() + 365); // Expires after 1 year document.cookie = "tracking_cookie=value; expires=" + expirationDate.toUTCString() + "; path=/; SameSite=None; Secure"; Breakdown Based on Ownership First-Party Cookie Imagine we open the URL www.abc.com in a browser tab. The website uses cookies, and a cookie is set in the browser. As the URL in the browser, www.abc.com, matches the domain of the cookie, it is a first-party cookie. In other words, a cookie issued in the browser for the website address present in the browser address bar is a first-party cookie. Third-Party Cookie Now, imagine there is a webpage within www.abc.com that loads content from a different website, www.xyz.com. Typically, this is done using an iFrame HTML tag. The cookie issued by www.xyz.com is called a third-party cookie. As the domain of the cookie for www.xyz.com doesn’t match the URL present in the address bar of the browser (www.abc.com), the cookie is considered third-party. Solving Third-Party Cookie Access Issue Due to privacy reasons, Safari on Mac, iOS, and Chrome in Incognito mode block third-party cookies. Even if the third-party cookie is set using the SameSite=None; Secure attribute, Safari, and Chrome Incognito will block it. Therefore, the iFrame-based content embedding example explained above will not work in the browser, which puts restrictions on it. In order to solve the problem, some networking work needs to be done. An alias record, such as xyz-thirdparty.abc.com, needs to be created.The alias record xyz-thirdparty.abc.com needs to have www.xyz.com as the target endpoint in the network configuration. A certificate needs to be generated with CN and Subject Alternate Name as xyz-thirdparty.abc.com by a Certificate Authority (e.g., VeriSign). The certificate needs to be installed in the infrastructure (e.g., reverse proxy, web server, load balancer, etc.) of www.xyz.com.The iFrame code should use the target URL as xyz-thirdparty.abc.com instead of www.xyz.com.This way the cookie issued by www.xyz.com will actually be issued under alias record xyz-thirdpary.abc.com. As the domain of the cookie is abc.com which matches the domain of the URL present in the browser address bar(www.abc.com), the cookie will be treated as first party. The application using iFrame will work in Safari and Chrome Incognito mode. Note: The subdomain for the alias record could be anything like foo.abc.com. I have used the subdomain of the alias as xyz-thirdparty for demonstration purposes only. The diagram below demonstrates the networking solution. Network configuration for cross-domain iFrame Consideration The www.xyz.com website must use X-Frame-Options headers in its infrastructure (e.g., reverse proxy) and whitelist www.abc.com as the website that can iFrame. Otherwise, even with the alias record solution, www.abc.com will not be able to iFrame www.xyz.com. As a side note, the X-Frame-Options header is used to control if a website can be iFrame-d or not and if yes, which specific websites can iFrame it. This is done to protect a website from a clickjacking attack. Conclusion Protecting end users and websites from malicious attacks is critical in the modern web. Browsers are getting restrictive with add-on controls. However, there are legitimate use cases when cross-domain communications need to happen in a browser, like embedding one website within another in iFrame. Third-party cookies become a challenge to implement in cross-domain iFrame-based implementations. This article demonstrated the technique about how to implement this feature using network configuration. References Saying goodbye to third-party cookies in 2024X-Frame-Options

By Dipankar Saha

Top Book Picks for Site Reliability Engineers

I believe reading is fundamental. site reliability engineers (SREs) need to have deep knowledge in a wide range of subjects and topics such as coding, operating systems, computer networking, large-scale distributed systems, SRE best practices, and more to be successful at their job. In this article, I discuss a few books that will help SREs to become better at their job. 1. Site Reliability Engineering, by the Google SRE team Google originally coined the term "Site Reliability Engineering." This book is a must read for anyone interested in site reliability engineering. It covers a wide range of topics that SREs focus on day to day such as SLOs, eliminating toil, monitoring distributed systems, release management, incident management, infrastructure, and more. This books gives an overview of the different elements that SREs work on. Although this book has many topics specific to Google, it provides a good framework and mental model about various SRE topics. The online version of this book is freely available, so there is no excuse not to read it. The free online version of this book is available here. 2. The Site Reliability Workbook, by the Google SRE team After the success of the original site reliability engineering book, the Google SRE team released this book as a continuation to add more implementation details to the topics in the first book. One of my favorite chapters in the book is "Introducing Non-Abstract Large Scale System Design," and I have read it multiple times. In similar fashion to their first book, this book is also available for free to read online. You can read this book for free here. 3. Systems Performance, by Brendan Gregg I got introduced to Brendan Gregg's work through his famous blog "Linux Performance Analysis in 60,000 Milliseconds." This book introduced me to the USE Method, which is one that can help to quickly troubleshoot performance issues. USE stands for usage, saturation, and errors. This book covers topics such as Linux kernel internals, various observability tools (to analyze CPU, memory, disk, file systems, and network), and application performance topics. The USE method helped me apply methodical problem solving while troubleshooting complex distributed system issues. This book can help you to gain a deeper understanding of troubleshooting performance issues on a Linux operating system. More information about his book can be found here. 4. The Linux Programming Interface, by Michael Kerrisk Having a deeper understanding about operating systems can provide a valuable advantage for SREs. Most of the time, SREs tend to use many commands to configure and troubleshoot various OS related issues. However, understanding how the operating systems work internally help make troubleshooting easier. This book provides a deeper understanding about the Linux OS, and focuses on the system call interface of the Linux OS. A majority of the teams and companies use Linux to run production systems. However, you may work in teams where other operating systems like Windows are being used. If that is the case, then including a book specific to the OS in your reading list is worthwhile. You can check out the above mentioned book here. 5. TCP/IP Illustrated: The Protocols, Volume 1, by Kevin Fall and Richard Stevens This book is great to learn about core networking protocols such as IP (Internet Protocol), ICMP (Internet Control Message Protocol), ARP (Address Resolution Protocol), UDP (User Datagram Protocol), and TCP (Transmission Control Protocol). Having strong understanding of the TCP/IP protocol suite and how to use various tools to debug networking issues is one of the core skills for SREs. This books provides the reader with a strong understanding of how protocols work under the hood. Details about the book are found here. 6. The Illustrated Network: How TCP/IP Works in a Modern Network, by Walter Goralski While TCP/IP Illustrated provides an in-depth explanation of the core TCP/IP protocols, this book focuses on understanding the fundamental principles and how they work in a modern networking context. This is great addition to your library along with TCP/IP Illustrated, which provides a deeper and broader understanding of TCP/IP protocols. More about this book can be found here. 7. Designing Data-Intensive Applications, by Martin Kleppmann This is a great book for understanding how distributed systems work through the lens of data-oriented systems. If you are working on distributed database systems, this book is a must read. I personally learned a lot with this book because I currently work as an SRE on CosmosDB (a globally distributed database service). What makes this book specifically useful for SREs is that it focuses on the reliability, scalability, and maintainability of data-intensive applications. It dives deep in to distributed database concepts such as replication, partitioning, transactions, and the problems with distributed consensus. You can learn more about this book here. 8. Building Secure and Reliable Systems, by the Google SRE team This book extends the principles of site reliability engineering to encompass the security aspects, and argues that security and reliability are not separate concerns, but rather are deeply related and should be addressed together. It advocates for integrating security practices into every stage of the system lifecycle— from design and development to deployment and operations. Google has made this book available for free here. 9. Domain-Specific Books Often, SREs work in specific domains such as databases, real-time communication systems, ERP/CRM systems, AI/ML systems, and more, and having a general understanding of these domains is important to be effective at your job. Including a book in your reading list that provides a breadth of knowledge about the domains is a great idea. Conclusion By reading these books, you can develop a deeper understanding on various subjects such as coding, operating systems, computer networking, distributed systems, and SRE principles which will help you to become a better site reliability engineer. Personally, these books helped me to broaden my understanding of the essential knowledge to perform my job as an SRE effectively, and also helped me while I was pursuing opportunities across teams and organizations as well. Happy reading!

By Krishna Vinnakota

How the Go Runtime Preempts Goroutines for Efficient Concurrency

Go's lightweight concurrency model, built on goroutines and channels, has made it a favorite for building efficient, scalable applications. Behind the scenes, the Go runtime employs sophisticated mechanisms to ensure thousands (or even millions) of goroutines run fairly and efficiently. One such mechanism is goroutine preemption, which is crucial for ensuring fairness and responsiveness. In this article, we'll dive into how the Go runtime implements goroutine preemption, how it works, and why it's critical for compute-heavy applications. We'll also use clear code examples to demonstrate these concepts. What Are Goroutines and Why Do We Need Preemption? A goroutine is Go's abstraction of a lightweight thread. Unlike heavy OS threads, a goroutine is incredibly memory-efficient — it starts with a small stack (typically 2 KB), which grows dynamically. The Go runtime schedules goroutines on a pool of OS threads, following an M scheduling model, where M OS threads map onto N goroutines. While Go's cooperative scheduling usually suffices, there are scenarios where long-running or tight-loop goroutines can hog the CPU, starving other goroutines. Example: Go package main func hogCPU() { for { // Simulating a CPU-intensive computation // This loop never yields voluntarily } } func main() { // Start a goroutine that hogs the CPU go hogCPU() // Start another goroutine that prints periodically go func() { for { println("Running...") } }() // Prevent main from exiting select {} } In the above code, hogCPU() runs indefinitely without yielding control, potentially starving the goroutine that prints messages (println). In earlier versions of Go (pre-1.14), such a pattern could lead to poor responsiveness, as the scheduler wouldn’t get a chance to interrupt the CPU-hogging goroutine. How Goroutine Preemption Works in the Go Runtime 1. Cooperative Scheduling Go's scheduler relies on cooperative scheduling, where goroutines voluntarily yield control at certain execution points: Blocking operations, such as waiting on a channel: Go func blockingExample(ch chan int) { val := <-ch // Blocks here until data is sent on the channel println("Received:", val) } Function calls, which naturally serve as preemption points: Go func foo() { bar() // Control can yield here since it's a function call } While cooperative scheduling works for most cases, it fails for compute-heavy or tight-loop code that doesn't include any blocking operations or function calls. 2. Forced Preemption for Tight Loops Starting with Go 1.14, forced preemption was introduced to handle scenarios where goroutines don’t voluntarily yield — for example, in tight loops. Let’s revisit the hogCPU() loop: Go func hogCPU() { for { // Simulating tight loop } } In Go 1.14+, the compiler automatically inserts preemption checks within such loops. These checks periodically verify whether the goroutine's execution should be interrupted. For example, the runtime monitors whether the preempt flag for the goroutine is set, and if so, the goroutine pauses execution, allowing the scheduler to run other goroutines. 3. Code Example: Preemption in Action Here's a practical example of forced preemption in Go: Go package main import ( "time" ) func tightLoop() { for i := 0; i < 1e10; i++ { if i%1e9 == 0 { println("Tight loop iteration:", i) } } } func printMessages() { for { println("Message from goroutine") time.Sleep(100 * time.Millisecond) } } func main() { go tightLoop() go printMessages() // Prevent main from exiting select {} } What Happens? Without preemption, the tightLoop() goroutine could run indefinitely, starving printMessages().With forced preemption (Go 1.14+), the runtime interrupts tightLoop() periodically via inserted preemption checks, allowing printMessages() to execute concurrently. 4. How the Runtime Manages Preemption Preemption Flags Each goroutine has metadata managed by the runtime, including a g.preempt flag. If the runtime detects that a goroutine has exceeded its time quota (e.g., it's executing a CPU-heavy computation), it sets the preempt flag for that goroutine. Preemption checks inserted by the compiler read this flag and pause the goroutine at predetermined safepoints. Safepoints Preemption only occurs at strategic safepoints, such as during function calls or other preemption-friendly execution locations. This allows the runtime to preserve memory consistency and avoid interrupting sensitive operations. Preemption Example: Tight Loop Without Function Calls Let’s look at a micro-optimized tight loop without function calls: Go func tightLoopWithoutCalls() { for i := 0; i < 1e10; i++ { // Simulating CPU-heavy operations } } For this code: The Go compiler inserts preemption checks during the compilation phase.These checks ensure fairness by periodically pausing execution and allowing other goroutines to run. To see preemption in effect, you could monitor your application’s thread activity using profiling tools like pprof or visualize execution using Go's trace tool (go tool trace). Garbage Collection and Preemption Preemption also plays a key role in garbage collection (GC). For example, during a "stop-the-world" GC phase: The runtime sets the preempt flag for all goroutines.Goroutines pause execution at safepoints.The GC safely scans memory, reclaims unused objects, and resumes all goroutines once it's done. This seamless integration ensures memory safety while maintaining concurrency performance. Conclusion Goroutine preemption is one of the innovations that make Go a compelling choice for building concurrent applications. While cooperative scheduling works for most workloads, forced preemption ensures fairness in compute-intensive scenarios. Whether you're writing tight loops, managing long-running computations, or balancing thousands of lightweight goroutines, you can rely on Go's runtime to handle scheduling and preemption seamlessly. Preemption paired with Go's garbage collection mechanisms results in a robust runtime environment, ideal for responsive and scalable software.

By Dinoja Padmanabhan

Segmentation Violation and How Rust Helps Overcome It

Segmentation faults are a common issue when working in low-level programming languages like C. These faults occur when a program attempts to access memory it’s not supposed to, leading to crashes and unpredictable behavior. While C offers granular control over memory management, it leaves developers responsible for ensuring safety, often resulting in bugs such as buffer overflows, use-after-free errors, and dangling pointers. Enter Rust, a systems programming language designed to address these issues with its built-in safety mechanisms. Rust prevents segmentation faults and other memory-related errors through its ownership system, borrow checker, and strict type and lifetime guarantees — all features enforced at compile time. This article will compare C's approach to segmentation faults with Rust's implementation and explain why Rust significantly reduces memory-related bugs. What Causes Segmentation Faults in C? Segmentation faults in C occur when a program interacts improperly with memory. Common causes include: 1. Null Pointer Dereferencing Attempting to access memory through a null or uninitialized pointer. C int *ptr = NULL; // Pointer is NULL *ptr = 42; // Dereferencing NULL -> segmentation fault Issue: We are trying to dereference NULL pointers, and there is no protection in C against accessing memory via an invalid pointer. 2. Buffer Overflows Reading or writing beyond the bounds of an allocated buffer. C char buffer[10]; buffer[10] = 'A'; // Access beyond the bounds of the allocated buffer 3. Dangling Pointers Accessing memory after it has been freed. C int *ptr = malloc(sizeof(int)); free(ptr); *ptr = 10; // Segmentation fault: Pointer points to freed memory. 4. Double Free Mismanaging heap memory by freeing the same pointer twice. C int *ptr = malloc(sizeof(int)); free(ptr); /* do something else */ free(ptr); // Undefined behavior: Double free. While extremely powerful, C's manual memory management opens the door to a wide range of memory-related bugs that cause segmentation faults. How Rust Prevents Segmentation Faults Rust was designed to be a safe systems programming language, offering low-level control similar to C but with safeguards that prevent memory-related errors. Here's how Rust compares: 1. Ownership System Rust’s ownership model ensures that each piece of memory is owned by a single variable. Once ownership moves (via assignment or function passing), the original variable becomes inaccessible, preventing dangling pointers and use-after-free errors. Example (Safe management of ownership): Rust fn main() { let x = String::from("Hello, Rust"); let y = x; // Ownership moves from `x` to `y` println!("{}", x); // Error: `x` is no longer valid after transfer } How it prevents errors: Ensures memory is cleaned up automatically when the owner goes out of scope.Eliminates dangling pointers by prohibiting the use of invalidated references. 2. Null-Free Constructs With Option Rust avoids null pointers by using the Option enum. Instead of representing null with a raw pointer, Rust forces developers to handle the possibility of absence explicitly. Example (Safe null handling): Rust fn main() { let ptr: Option<&i32> = None; // Represents a "safe null" match ptr { Some(val) => println!("{}", val), None => println!("Pointer is null"), } } How it prevents errors: No implicit "null" values — accessing an invalid memory location is impossible.Eliminates crashes caused by dereferencing null pointers. 3. Bounds-Checked Arrays Rust checks every array access at runtime, preventing buffer overflows. Any out-of-bounds access results in a panic (controlled runtime error) instead of corrupting memory or causing segmentation faults. Rust fn main() { let nums = [1, 2, 3, 4]; println!("{}", nums[4]); // Error: Index out of bounds } How it prevents errors: Protects memory by ensuring all accesses are within valid bounds.Eliminates potential exploitation like buffer overflow vulnerabilities. 4. Borrow Checker Rust’s borrowing rules enforce safe memory usage by preventing mutable and immutable references from overlapping. The borrow checker ensures references to memory never outlive their validity, eliminating many of the concurrency-related errors encountered in C. Example: Rust fn main() { let mut x = 5; let y = &x; // Immutable reference let z = &mut x; // Error: Cannot borrow `x` mutably while it's borrowed immutably. } How it prevents errors: Disallows aliasing mutable and immutable references.Prevents data races and inconsistent memory accesses. 5. Automatic Memory Management via Drop Rust automatically cleans up memory when variables go out of scope, eliminating the need for explicit malloc or free calls. This avoids double frees or dangling pointers, common pitfalls in C. Rust struct MyStruct { value: i32, } fn main() { let instance = MyStruct { value: 10 }; // Memory is freed automatically at the end of scope. } How it prevents errors: Ensures every allocation is freed exactly once, without developer intervention.Prevents use-after-free errors with compile-time checks. 6. Runtime Safety With Panics Unlike C, where errors often lead to undefined behavior, Rust prefers deterministic "panics." A panic halts execution and reports the error in a controlled way, preventing access to invalid memory. Why Choose Rust Over C for Critical Applications Rust fundamentally replaces C’s error-prone memory management with compile-time checks and safe programming constructs. Developers gain the low-level power of C while avoiding costly runtime bugs like segmentation faults. For systems programming, cybersecurity, and embedded development, Rust is increasingly favored for its reliability and performance. Rust demonstrates that safety and performance can coexist, making it a great choice for projects where stability and correctness are paramount. Conclusion While C is powerful, it leaves developers responsible for avoiding segmentation violations, leading to unpredictable bugs. Rust, by contrast, prevents these issues through ownership rules, the borrow checker, and its strict guarantee of memory safety at compile time. For developers seeking safe and efficient systems programming, Rust is a better option.

By Dinoja Padmanabhan

Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot

By definition, a markup language is a system used for annotating a document in a way that is syntactically distinguishable from the text. Essentially, it provides a way to structure and format text using tags or symbols that are embedded within the content. Markup languages are used to define elements within a document, such as headings, paragraphs, lists, links, and images. Hype Text Markup Language (HTML) is the most common of them. There are many other such as XML, SGML, Markdown, MathML, BBCode, to name a few. This article articulates the need of and presents a minimally working version to what the term “domain markup event mapping” is conferred. Lest an unfamiliar terminology introduced abruptly make the audience assume otherwise, let us illustrate the experience as a real-time commentary of an event, say a cricket match on popular online news media. ESPN, Yahoo cricket, Cricbuzz.com, Star Sports, and BBC are among the top players in this area. I remember how they used to be 15 years ago and and now, they've evolved to cater real-time updates. With advanced backend systems, communication protocols, better design approach and of course, modern browser technologies, they have always been on the top to provide their users the best intuitive updates compensating the absence of audio and video. Our audience must have noticed that Google and other websites have already adopted animated and visually intuitive UI components to make the user experience better. This article focuses on the need to provide a domain specific markup event mapper for all such use cases and illustrates an approach to create a minimalistic update system using Spring boot and WebSocket. Domain Specific Markup Event Mapper: The client (such as the web browser) receives an event. For the sake of generality, neither the communication protocol nor message format (e.g., whether it is a text or binary message) is assumed. The message converter yields an event object that the client understands, such as a JSON object that the browser side script knows how to handle and render. We now must agree that not all notifications and the events they carry, belong to the same category and therefore, be rendered in the same way. A running commentary, for example, may be rendered like a scrolling hypertext while a boundary or an over boundary or a fall of a wicket might require special effect as they’re rendered, to stand distinct. The role of the markup look-up engine is to identify a suitable engine given the category of an event. It delegates the event to a specific rendering strategy if one for the category is registered (known) to the system (client-side UI such as a browser). If none is found, there needs to be a fallback strategy. The four components that appear black in the image above are abstractions that the domain landscape must provide as we propose in this article. The rendering techniques for a cricket match must differ from a soccer match and coverage of an electoral poll must be presented in a different way than sports. We must now wear the developers’ hat and gear up to put the theory into practice. We aim to make the notification system minimalist with the following: A media operator section that posts updates.The intermediary backend that sends the notifications. For simplicity, we will not use any broker or third party cloud messaging system.We've chosen vanilla WebSocket as mode of communication although other approaches such as periodical long polls, server-sent-event, SockJS can be used with their respective pros and cons.The viewers’ section to consume (experience) the notification. We create a spring boot application with spring-boot-starter-websocket and spring-boot-starter-thymeleaf. XML <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>3.4.1</version> <relativePath />  </parent> <groupId>com.soham.demo</groupId> <artifactId>visually-appealing-realtime-update</artifactId> <version>0.0.1-SNAPSHOT</version> <name>visually-appealing-realtime-update</name> <description>visually-appealing-realtime-update</description> <url /> <licenses> <license /> </licenses> <developers> <developer /> </developers> <scm> <connection /> <developerConnection /> <tag /> <url /> </scm> <properties> <java.version>17</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-websocket</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-devtools</artifactId> <scope>runtime</scope> <optional>true</optional> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-thymeleaf</artifactId> </dependency> </dependencies> <build> <finalName>${project.artifactId}</finalName> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <annotationProcessorPaths> <path> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> </path> </annotationProcessorPaths> </configuration> </plugin> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <excludes> <exclude> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> </exclude> </excludes> </configuration> </plugin> </plugins> </build> </project> We expose a WebSocket end point where clients can connect to establish a WebSocket session. GET /chat HTTP/1.1 Host: example.com Upgrade: websocket Connection: Upgrade Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw== Sec-WebSocket-Version: 13 The server responds with a status code 101 (Switching Protocols): HTTP/1.1 101 Switching Protocols Upgrade: websocket Connection: Upgrade Sec-WebSocket-Accept: HSmrc0sMlYUkAGmm5OPpG2HaGWk= We expose an open (non-restricted ) endpoint "score" for clients to connect and configure it to allow traffic from anywhere. This is just for illustration and is not suitable on production grade. Java @Configuration @EnableWebSocket public class CricketScoreWebSocketConfig implements WebSocketConfigurer { private static final String PATH_SCORE = "score"; @Override public void registerWebSocketHandlers(WebSocketHandlerRegistry registry) { registry.addHandler(new TextNotificationHandler(), PATH_SCORE).setAllowedOrigins("*"); } } To pour in minimal effort, assuming that there will be exactly one bulletin entry operator, we'll create the class TextNotificationHandler. Java @Slf4j public class TextNotificationHandler extends TextWebSocketHandler { private Set<WebSocketSession> sessions = Collections.synchronizedSet(new HashSet<>()); @Override public void afterConnectionEstablished(WebSocketSession session) throws Exception { log.debug("afterConnectionEstablished :: session established remote host: {}",session.getRemoteAddress()); sessions.add(session); log.debug("afterConnectionEstablished :: connection from: {} is added. Current Open session count : {}",session.getRemoteAddress(),sessions.size()); } @Override protected void handleTextMessage(WebSocketSession session, TextMessage message) throws Exception { for (WebSocketSession webSocketSession : sessions) { if (webSocketSession.isOpen()) { webSocketSession.sendMessage(message); } } } @Override public void afterConnectionClosed(WebSocketSession session, CloseStatus status) throws Exception { log.debug("afterConnectionEstablished :: session closed remote host: {}",session.getRemoteAddress()); sessions.remove(session); log.debug("afterConnectionEstablished :: connection from: {} is removed. Current Open session count : {}",session.getRemoteAddress(),sessions.size()); } } Now, we create the two HTML files under src/resource/templates. HTML <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Media Operator For XYZ Television Cricket Match IND vs ENG</title> <style> body { margin: 0; height: 100%; width: 100%; } canvas { display: block; } .container { display: flex; height: 100vh; width: 100%; } .left-section { background-color: lightblue; width: 80%; padding: 20px; box-sizing: border-box; } .right-section { width: 20%; background-color: lightcoral; padding: 20px; box-sizing: border-box; } </style> </head> <body> <div class="container"> <div class="left-section"> <canvas id="cricketField"></canvas> </div> <div class="right-section"> <textarea rows="10" cols="50" id="tb_Comment" placeholder="Message here"></textarea> <button onclick="sendTextMsg()">Send</button> <fieldset> <legend>Quick Panel</legend> <button onclick="sendToastMsg('boundary')">Boundary</button> <button onclick="sendToastMsg('over-boundary')">Over-boundary</button> <button onclick="sendToastMsg('out')">OUT!!</button> <button onclick="sendToastMsg('100')">100* NOT OUT</button> </fieldset> </div> </div> <script> let socket; let dto; window.onload = function () { socket = new WebSocket("ws://"+window.location.host+"/score"); socket.onmessage = function (event) { console.log(event); }; }; const canvas = document.getElementById('cricketField'); const ctx = canvas.getContext('2d'); canvas.width = window.innerWidth; canvas.height = window.innerHeight; function drawCricketField() { ctx.fillStyle = 'green'; ctx.fillRect(0, 0, canvas.width, canvas.height); const centerX = canvas.width / 2; const centerY = canvas.height / 2; const fieldWidth = 600; const fieldHeight = 400; // Draw the oval cricket field ctx.fillStyle = 'lightgreen'; ctx.beginPath(); ctx.ellipse(centerX, centerY, fieldWidth / 2, fieldHeight / 2, 0, 0, Math.PI * 2); ctx.fill(); // Draw pitch and creases ctx.fillStyle = 'white'; ctx.fillRect(centerX - 3, centerY - 150, 6, 300); // Pitch ctx.fillRect(centerX - 150, centerY - 3, 300, 6); // Crease // Draw stumps ctx.fillRect(centerX - 6, centerY - 160, 4, 20); ctx.fillRect(centerX - 2, centerY - 160, 4, 20); ctx.fillRect(centerX + 2, centerY - 160, 4, 20); ctx.fillRect(centerX - 6, centerY + 140, 4, 20); ctx.fillRect(centerX - 2, centerY + 140, 4, 20); ctx.fillRect(centerX + 2, centerY + 140, 4, 20); } let drawing = false; let startX = 0; let startY = 0; let currentX = 0; let currentY = 0; canvas.addEventListener('mousedown', (e) => { const rect = canvas.getBoundingClientRect(); const mouseX = e.clientX - rect.left; const mouseY = e.clientY - rect.top; // console.log(mouseX+" " +mouseY+ctx.isPointInPath(mouseX, mouseY)); // if (ctx.isPointInPath(mouseX, mouseY)) { drawing = true; startX = mouseX; startY = mouseY; currentX = mouseX; currentY = mouseY; // } }); canvas.addEventListener('mousemove', (e) => { if (drawing) { const rect = canvas.getBoundingClientRect(); const mouseX = e.clientX - rect.left; const mouseY = e.clientY - rect.top; currentX = mouseX; currentY = mouseY; clearCanvas(); drawLine(startX, startY, currentX, currentY); } }); canvas.addEventListener('mouseup', () => { drawing = false; }); canvas.addEventListener('mouseout', () => { drawing = false; }); function clearCanvas() { ctx.clearRect(0, 0, canvas.width, canvas.height); drawCricketField(); } function drawLine(startX, startY, endX, endY) { ctx.beginPath(); ctx.moveTo(startX, startY); ctx.lineTo(endX, endY); ctx.strokeStyle = 'red'; ctx.lineWidth = 5; // Increase the line width for better visibility ctx.stroke(); ctx.closePath(); dto = {}; dto.startX = startX; dto.startY = startY; dto.endX = endX; dto.endY = endY; sendMessage("VISUAL", dto); } function sendMessage(strType, dto) { dto.id = Date.now(); dto.type = strType; socket.send(JSON.stringify(dto)); } function sendTextMsg() { dto = {}; dto.message = document.getElementById("tb_Comment").value; sendMessage("TEXT", dto); } function sendToastMsg(msg) { dto = {}; dto.message = msg; sendMessage("TOAST", dto); } drawCricketField(); </script> </body> </html> The full source code is available here . You can also run the application from Docker hub by this command using the target port that you prefer. Dockerfile docker run -p 9876:9876 sohamsg/dockerhub:websocket-cricket-match-commetary However, the proposition the author made at the very beginning can now be revisited to understand its usage and need. Currently, the code is written to cater to selected specific use cases considering a cricket match. However, all these codes are created by individual teams/developers, though they were targeting the same thing of course, in different ways and USPs. To help visualize the components, let us take this enum, which is used in the mark up classifier below: Java public enum EvtType { VISUAL,TEXT,TOAST , // Keep adding your event types for another domain } class MarkUpClassifierService{ public Optional<EvtType> classifyMessagge(AbstractMessage message){ return classifierStgragey.apply(message); } /** Define your strategy to extract the category of the message. return empty Optional unless a category is found */ private Function<AbstractMessage,Optional<EvtType>> classifierStgragey; } The mark up look-up engine looks for a mark up strategy and the mark up implementation simply renders them, fetching the strategies from the server to browser/client only once. A CDN can be used, too! The perinodal structure looks like this: Java interface IMarkup{ public void markup(AbstractMessage message, OutputStream outStream); } @Service @SLF4j @RequiredArgsConstructor // We use it as regsitered bean in Spring but this is not specific to any framework class MarkupLookupService{ private final MarkupRegistry regitry; // We use it as regsitered bean in Spring but this is not specific to any framework public Optional<IMarkup> lookupMarkup(EvtType evtType){ if(regitry.supports(evtType)){ return registry.get(evtType);//write your look up logic }else{ return Optional.<IMarkup>empty(); } } } HTML <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Cricket Match IND vs ENG </title> <style> body { margin: 0; overflow: hidden; } canvas { display: block; } .container { display: flex; height: 100vh; width: 100%; } .left-section { background-color: lightblue; width: 80%; padding: 20px; box-sizing: border-box; } .right-section { width: 20%; background-color: white; padding: 20px; box-sizing: border-box; } #toast { visibility: hidden; min-width: 250px; margin-left: -125px; background-color: #333; color: #fff; text-align: center; border-radius: 5px; padding: 16px; position: fixed; z-index: 1; left: 50%; bottom: 30px; font-size: 37px; } #toast.show { visibility: visible; animation: fadein 0.5s, fadeout 0.5s 2.5s; } @keyframes fadein { from { top: -50px; opacity: 0; } to { top: 30px; opacity: 1; } } @keyframes fadeout { from { top: 30px; opacity: 1; } to { top: -50px; opacity: 0; } } </style> </head> <body> <div class="container"> <div class="left-section"> <canvas id="cricketField"></canvas> <div id="toast"></div> </div> <div class="right-section"> <textarea rows="10" cols="50" id="tb_Comment" placeholder="Message here" style="color:gold;background-color:black;read-only:true" readonly></textarea> <button onclick="sendTextMsg()">Send</button> </div> </div> <script> let socket; let dto; window.onload = function () { socket = new WebSocket("ws://"+window.location.host+"/score"); socket.onmessage = function (event) { var d = event.data; var data = JSON.parse(d); console.log(data + " " + data.startX); clearCanvas(); if ("VISUAL" === data.type) { drawLine(data.startX, data.startY, data.endX, data.endY); } else if("TEXT"===data.type){ document.getElementById("tb_Comment").value += data.message + "\n"; }else if("TOAST"===data.type){ showToast(data.message); }else{ console.error("Unsupported message type "+d); } }; }; const canvas = document.getElementById('cricketField'); const ctx = canvas.getContext('2d'); canvas.width = window.innerWidth; canvas.height = window.innerHeight; function drawCricketField() { ctx.fillStyle = 'green'; ctx.fillRect(0, 0, canvas.width, canvas.height); const centerX = canvas.width / 2; const centerY = canvas.height / 2; const fieldWidth = 600; const fieldHeight = 400; // Draw the oval cricket field ctx.fillStyle = 'lightgreen'; ctx.beginPath(); ctx.ellipse(centerX, centerY, fieldWidth / 2, fieldHeight / 2, 0, 0, Math.PI * 2); ctx.fill(); // Draw pitch and creases ctx.fillStyle = 'white'; ctx.fillRect(centerX - 3, centerY - 150, 6, 300); // Pitch ctx.fillRect(centerX - 150, centerY - 3, 300, 6); // Crease // Draw stumps ctx.fillRect(centerX - 6, centerY - 160, 4, 20); ctx.fillRect(centerX - 2, centerY - 160, 4, 20); ctx.fillRect(centerX + 2, centerY - 160, 4, 20); ctx.fillRect(centerX - 6, centerY + 140, 4, 20); ctx.fillRect(centerX - 2, centerY + 140, 4, 20); ctx.fillRect(centerX + 2, centerY + 140, 4, 20); } let drawing = false; let startX = 0; let startY = 0; let currentX = 0; let currentY = 0; canvas.addEventListener('mousedown', (e) => { const rect = canvas.getBoundingClientRect(); const mouseX = e.clientX - rect.left; const mouseY = e.clientY - rect.top; // console.log(mouseX+" " +mouseY+ctx.isPointInPath(mouseX, mouseY)); // if (ctx.isPointInPath(mouseX, mouseY)) { drawing = true; startX = mouseX; startY = mouseY; currentX = mouseX; currentY = mouseY; // } }); canvas.addEventListener('mousemove', (e) => { if (drawing) { const rect = canvas.getBoundingClientRect(); const mouseX = e.clientX - rect.left; const mouseY = e.clientY - rect.top; currentX = mouseX; currentY = mouseY; clearCanvas(); drawLine(startX, startY, currentX, currentY); } }); canvas.addEventListener('mouseup', () => { drawing = false; }); canvas.addEventListener('mouseout', () => { drawing = false; }); function clearCanvas() { ctx.clearRect(0, 0, canvas.width, canvas.height); drawCricketField(); } function drawLine(startX, startY, endX, endY) { ctx.beginPath(); ctx.moveTo(startX, startY); ctx.lineTo(endX, endY); ctx.strokeStyle = 'red'; ctx.lineWidth = 5; // Increase the line width for better visibility ctx.stroke(); ctx.closePath(); } function showToast(message) { var toast = document.getElementById("toast"); toast.className = "show"; toast.textContent = mapToVisual(message); setTimeout(function () { toast.className = toast.className.replace("show", ""); }, 3000); } function mapToVisual(msg){ switch(msg){ case "100": return "100*"; case "out": return "OUT!!"; case "boundary": return "Boundary"; case "over-boundary": return "Over-boundary"; } return ""; } drawCricketField(); </script> </body> </html> The logic to translation of marking up different types of event differently to the client can be done in multiple ways and we list down only a few: Write the strategy in client side code as a library (e.g., a JavaScript library). The downside is that updating the logic is prone to errors, as with any scripting.Caching and CDN—Ensuring the updates reflect and is not cached except beyond current session. Writing the strategy in the server side and sending transpiled Script back. The client side code and the backend is no more loosely coupled then. We will cover each approach in detail some other time.

By Soham Sengupta

Issue and Present Verifiable Credentials With Spring Boot and Android

As digital identity ecosystems evolve, the ability to issue and verify digital credentials in a secure, privacy-preserving, and interoperable manner has become increasingly important. Verifiable Credentials (VCs) offer a W3C-standardized way to present claims about a subject, such as identity attributes or qualifications, in a tamper-evident and cryptographically verifiable format. Among the emerging formats, Selective Disclosure JSON Web Tokens (SD-JWTs) stand out for enabling holders to share only selected parts of a credential, while ensuring its authenticity can still be verified. In this article, we demonstrate the issuance and presentation of Verifiable Credentials using the SD-JWT format, leveraging Spring Boot microservices on the backend and a Kotlin-based Android application acting as the wallet on the client side. We integrate support for the recent OpenID for Verifiable Credential Issuance (OIDC4VCI) and OpenID for Verifiable Presentations (OIDC4VP) protocols, which extend the OpenID Connect (OIDC) framework to enable secure and user-consented credential issuance and verification flows. OIDC4VCI defines how a wallet can request and receive credentials from an issuer, while OIDC4VP governs how those credentials are selectively presented to verifiers in a trusted and standardized way. By combining these technologies, this demonstration offers a practical, end-to-end exploration of how modern identity wallet architectures can be assembled using open standards and modular, developer-friendly tools. While oversimplified for demonstration purposes, the architecture mirrors key principles found in real-world initiatives like the European Digital Identity Wallet (EUDIW), making it a useful foundation for further experimentation and learning. Issue Verifiable Credential Let's start with a sequence diagram describing the flow and the participants involved: The WalletApp is the Android app. Implemented in Kotlin with Authlete's Library for SD-JWT included. The Authorization Server is a Spring Authorization Server configured for Authorization Code flow with Proof Key for Code Exchange (PKCE). It is used to authenticate users who request via the mobile app to obtain their Verifiable Credential, requires their Authorization Consent and issues Access Tokens with a predefined scope named "VerifiablePortableDocumentA1" for our demo. The Issuer is a Spring Boot microservice acting as an OAuth 2.0 Resource Server, delegating its authority management to the Authorization Server introduced above. It offers endpoints to authorized wallet instances for Credential Issuance, performs Credential Request validations, and generates and serves Verifiable Credentials in SD-JWT format to the requestor. It also utilizes Authlete's Library for SD-JWT, this time on the server side. The Authentic Source, in this demo, is part of the Issuer codebase (in reality, can be a totally separate but, of course, trusted entity) and has an in-memory "repository" of user attributes. These attributes are meant to be retrieved by the Issuer and be enclosed in the produced SD-JWT as "Disclosures." Credential Request Proof and Benefits of SD-JWT When a wallet app requests a Verifiable Credential, it proves possession of a cryptographic key using a mechanism called JWT proof. This proof is a signed JSON Web Token that the wallet creates and sends along with the credential request. The issuer verifies this proof and includes a reference to the wallet’s key (as a cnf claim) inside the SD-JWT credential. This process binds the credential to the wallet that requested it, ensuring that only that wallet can later prove possession. The issued credential uses the Selective Disclosure JWT (SD-JWT) format, which gives users fine-grained control over what information they share. Unlike traditional JWTs that expose all included claims, SD-JWTs let the holder (the wallet user) disclose only the specific claims needed, such as name or age, while keeping the rest private. This enables privacy-preserving data sharing without compromising verifiability. So even when only a subset of claims is disclosed, the original issuer’s signature remains valid! The Verifier can still confirm the credential’s authenticity, ensuring trust in the data while respecting the holder’s choice to share minimally. Now that the wallet holds a Verifiable Credential, the next step is to explore its practical use: selective data sharing with a Relying Party (Verifier). This is done through a Verifiable Presentation, which allows the user to consent to sharing only specific claims from their credential, just enough to, for example, access a protected resource, complete a KYC process, or prove eligibility for a service, without revealing unnecessary personal information. Present Verifiable Credential The following sequence diagram outlines a data-sharing scenario where a user is asked to share specific personal information to complete a task or procedure. This typically occurs when the user visits a website or application (the Verifier) that requires certain details, and the user's digital wallet facilitates the sharing process with their consent. The Verifier is a Spring Boot microservice with the following responsibilities: Generates a Presentation Definition associated with a specific requestIdServes pre-stored Presentation Definitions by requestIdAccepts and validates incoming vp_token posted by the wallet client. Again, here, Authlete's Library for SD-JWT is the primary tool At the core of this interaction is the Presentation Definition — a JSON-based structure defined by the Verifier that outlines the type of credential data it expects (e.g., name, date of birth, nationality). This definition acts as a contract and is retrieved by the wallet during the flow. The wallet interprets it dynamically to determine which of the user's stored credentials — and which specific claims within them—can fulfill the request. Once a suitable credential is identified (such as the previously issued SD-JWT), the wallet prompts the user to consent to share only the required information. This is where Selective Disclosure comes into play. The wallet then prepares a Verifiable Presentation token (vp_token), which encapsulates the selectively disclosed parts of the credential along with a binding JWT. This binding JWT, signed using the wallet’s private key, serves two critical functions: It proves possession of the credential and Cryptographically links the presentation to this specific session or verifier (e.g., through audience and nonce claims). It also includes a hash derived from the presented data, ensuring its integrity. On the Verifier side, the backend microservice performs a series of validations: It retrieves the Issuer’s public key to verify the original SD-JWT’s signature.It verifies the binding JWT, confirming both its signature and that the hash of the disclosed data matches the value expected — thereby ensuring the credential hasn’t been tampered with and that it’s bound to this particular transaction.It checks additional metadata, such as audience, nonce, and expiration, to ensure the presentation is timely and intended for this verifier. While this demo focuses on the core interactions, a production-ready verifier would also: Validate that the Presentation Definition was fully satisfied, ensuring all required claims or credential types were present and correctly formatted. Once all validations pass, the Verifier issues a response back to the wallet — for example, redirecting to a URI for further interaction, marking the user as verified, or simply displaying the successful outcome of the verification process. Takeaways The complete source code for both the backend microservices and the Android wallet application is available on GitHub: Backend (Spring Boot microservices): spring-boot-vci-vpAndroid wallet (Kotlin): android-vci-vp The README files of both repositories contain instructions and additional information, including how to run the demo, how to examine SD-JWT and vp_token using sites like https://www.sdjwt.co, Presentation Definition sample, and more. Video Finally, you can watch a screen recording that walks through the entire flow on YouTube.

By Kyriakos Mandalas

CORE