0% found this document useful (0 votes)
228 views145 pages

Building An Analytics Platform

The document discusses different architectures for building an analytics platform. It starts with a basic approach and identifies scaling issues. It then explores adding job queues to decouple processing and prioritize user requests, as well as using a cache to optimize dashboard queries.

Uploaded by

Alexcsandru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views145 pages

Building An Analytics Platform

The document discusses different architectures for building an analytics platform. It starts with a basic approach and identifies scaling issues. It then explores adding job queues to decouple processing and prioritize user requests, as well as using a cache to optimize dashboard queries.

Uploaded by

Alexcsandru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Martin Joo - Performance with Laravel

Building an analytics platform


Architecture
v1
v2
v3
v4
v5
PageView service
Analytics service
Page views
Unique visitors
Most visited pages
Measuring performance
Indexing
site_id index
visited_at and site_id composite index
site_id and visited_at composite index
Denormalization
Hashing URLs
Redis
Sets
Sorted sets
Page views
Querying page views
Daily and monthly page views
Scaling down the problem
Inserting page views
Time complexity
Unique visitors
Querying unique visitors
Daily and monthly unique visitors
HyperLogLog
Daily unique visitors
Monthly unique visitors
Hourly unique visitors
Scaling down the problem
Inserting unique visitors
Time complexity
Memoization
Most visited pages
Querying most visited pages
Scaling down the problem
Inserting most visited page views
Time complexity
Cleaning up data
Falling back to MySQL
Avoiding accidental data loss
Caching
Conclusion
Architecture v6: CQRS
The event stream

No. 1 / 145
Martin Joo - Performance with Laravel

No. 2 / 145
Martin Joo - Performance with Laravel

Building an analytics platform


In this chapter, we're going to build a basic analytics dashboard such as Splitbee, Fathom, or Google
Analytics. I'll use Splitbee as an example but the main principles and the performance issues are similar
across all these platforms. If you don't know too much about these platforms, here's an executive summary
for you:

You include a script provided by Splitbee into your website

Whenever a user visits a page on your site, the script sends a request to Splitbee

When you log into Splitbee you'll see an overview of your traffic

This is what the dashboard looks like:

There are two important terms:

Page views: if you go to a site and visit three pages it counts as three page views

Unique users: it might be three page views but there's only one user, you.

It also shows the most visited pages in a given time period and the sources of the traffic.

There's a date filter where you can select options such as: last 24 hours, last 7 days, etc. We're going to copy
all these features except traffic sources. It's not related to performance at all, so it's not interesting for us
now.

You're probably already thinking about GROUP BY and COUN(*) queries but let's take a step back first.

No. 3 / 145
Martin Joo - Performance with Laravel

Architecture
For the first part of this chapter exact traffic numbers are not that important. Let's just assume that the
database contains 10,000,000 page view records and we need to handle 100,000 requests a day.

v1
If we want to build a reliable analytics platform that handles high load we can't just do that:

This is a "standard approach" where we have an API that accepts HTTP requests from the scripts embedded
into users' websites, saves them into a database, and then serves them to a frontend.

For simplicity, I'll use the following terms:

When a user uses the dashboard via the frontend it's a user request

When a user's website sends an HTTP request it's a website request

This approach won't work very well for a few reasons:

The number of website requests is extremely high

Having the same server to serve user and website requests is probably a bad idea

No. 4 / 145
Martin Joo - Performance with Laravel

MySQL will store many millions of rows and there's a good chance that aggregated queries made by
user requests will be quite slow. Meanwhile, the database has to constantly run INSERT queries
because of the incoming website requests. Having a huge database that needs to handle a large
number of write and read queries as well is probably not a good idea. For example, if you index your
database right then reading will be faster but writing is going to be slower as mentioned in the
database index chapter.

If we have only one instance of the API it's a huge single point of failure. If it's down, everything is
down. If it's slow, everything is slow.

If we have only one instance of the API and it's down then incoming requests in this period cannot be
easily "replayed" later. So the data is lost.

v2
Of course, the low-hanging fruit is to scale the API up:

Horizontally scaling the API is a must-have in this application. It means that we run multiple instances of the
API and a load balancer balances the incoming traffic across the instances.

This way we eliminated the single point of failure problem, right? Let's think about two kinds of problems:

No. 5 / 145
Martin Joo - Performance with Laravel

Assume, we have autoscaling, meaning that the number of instances can grow as incoming traffic
grows. Let's say traffic grows pretty quickly and instance #1 is overloaded and is out of memory. The
autoscaler spins up new instances and future traffic goes to them. But what happened with the
requests handled by instance #1? Let's say there were 20 requests being processed when it ran out of
memory. What happened to them? It's hard to say, maybe some of them were processed successfully
maybe not. We don't know that exactly. I mean, it's not a huge concern and is probably acceptable but
still, there's some unreliability here. Unfinished requests are probably lost as well.

But now let's imagine we have a pretty ugly bug in the code. After the controller accepts the request
but before we execute the INSERT statement there's a bug that causes an exception so the request
fails. In this case, it doesn't matter how many replicas you have. All the incoming requests are lost until
the bug is present. Of course, you can recover them from nginx access logs or an audit table but it's
going to be a pretty bad evening.

Even if we have infinite scale the API can still cause some pretty bad problems.

Another problem is that we should not treat user requests and website requests with the same priority.
Even if we scale the API, the database can still experience high latency periods. Right now, if there are
hundreds of incoming website requests (keeping MySQL occupied) and then a user wants to see the
dashboard he/she might experience very high latency. In this scenario, the user request should clearly have
higher priority. It doesn't really matter if the application processes a website request right now, or 3
seconds from now.

No. 6 / 145
Martin Joo - Performance with Laravel

v3
The source of these issues is the absence of a job queue and workers:

Review the three problems from v2:

Data loss is possible when the traffic is growing faster than the autoscaler. This issue is very very
unlikely now. The API only accepts the incoming website requests and then immediately dispatches a
job to a queue. So even if the load is super high, these requests now consume significantly less CPU
and memory. The probability that the queue is out of memory, or space, or CPU is pretty low if we
choose the right technology and/or resource. If the worker runs out of resources the job will fail and
will be put in the failed_jobs table and can be processed later. So data loss and outage are

No. 7 / 145
Martin Joo - Performance with Laravel

significantly less now.

The code that inserts the page view records has a fatal bug. In this case, the job will fail and will be put
in the failed_jobs table and can be processed later. The worst-case scenario is that we cannot insert
new rows until the bug is present so users will see outdated data on the dashboard. The situation is
much better compared to v2.

The third problem was that we treated user and website requests with the same priority. We can now
easily prioritize user requests over workers by specifying a nice value (as we discussed in the queue
chapter). However, it's still a problem since there's a possibility that 100 workers are inserting into the
database at the exact moment when a user wants to see the dashboard. Or maybe 500 users at the
same time. We can use nice values but they won't "re-prioritize" already running processes.

No. 8 / 145
Martin Joo - Performance with Laravel

v4
One solution to the last problem is using a very fast cache-like database to serve user requests:

It's getting more and more complicated so here's a summary of how this design handles incoming website
requests:

The embedded script on the user's website sends an HTTP request to the API

The API dispatches a job

No. 9 / 145
Martin Joo - Performance with Laravel

The worker inserts the new record into MySQL

It also updates a cache-like Redis database

When a user requests the dashboard the API reads it from Redis. If data is not available it reads it from
MySQL

By maintaining a Redis DB to serve user requests we made the whole application significantly faster. Like
orders of magnitudes faster. Later, I'm going to show you the exact numbers. Another note is that most
developers think about Redis as a key-value cache where we set a string key and a string value (just as
Cache::set('key', 'value') works). However, it's a much more complicated database engine. We'll use
different data structures to make this cache-like database as efficient as possible. But it's going to be a bit
more complicated than a typical cache.

There's still a problem with this design, in my opinion. We still use the same component to handle user and
website requests. The analytics API. These are two completely different scenarios. They scale differently. In
an application that handles millions of requests, you just can't risk bringing your entire frontend (user
requests) down because of your background processes (website requests). And it's true vice versa. You
cannot hurt background processes because of a spike in frontend traffic caused by a new marketing
campaign. In my opinion, at least.

Another "issue" is that we bootstrap an entire Laravel application with 63MB of vendor code just to accept a
request and dispatch a job. Of course, you can always install Ooctane if you have the knowledge in your
team. But often people don't have experience with it. By extracting it out into a new component you can
easily write 50 lines of node or go to handle such a small but crucial task.

No. 10 / 145
Martin Joo - Performance with Laravel

v5
To solve these issues we can extract a new component called "PageView API:"

Do you see what we just did? We completely disconnected the Analytics API from website requests.

Now we can:

Monitor them separately. It's a big win because the two services have completely different traffic and
load patterns.

Optimize them based on their specific problems.

Scale them differently. It's a big win because the PageView API will get significantly more traffic than
the Analytics API.

No. 11 / 145
Martin Joo - Performance with Laravel

Implement the PageView API in node or go. Or use Octane. For example, if you don't have too much
production experience with it (as most developers don't, I think) it's better to try it out in one isolated
service that won't risk your entire application, only a part of it.

Now it's much better, more reliable architecture than before. If you now take a look at it, can you spot the
remaining single point of failure? Yes, the queue and the database. They'll both have an extraordinary
amount of load so it's probably best to use some sort of managed service instead of spinning up our own
Redis instance on a $20 VPS.

Later, there's going to be version 6 of this architecture, but that's enough for a high-level overview for now.
Let's write some code.

By the way, are we building a microservice application? Technically, yes. But at the end of the day, it's just an
app with two APIs because they do two different things. However, if you like the m-word I have good news
for you: later, we're going to implement event sourcing and CQRS as well.

No. 12 / 145
Martin Joo - Performance with Laravel

PageView service
The PageView API is quite easy to start with that one.

The database contains only two tables: sites and page_views . Each user can register sites and each page
view belongs to exactly one site. sites looks like this:

ID name domain user_id

1 My Blog https://martinjoo.dev 1

2 Your Blog https://yourblog.com 2

And page_views contains the following data:

ID site_id uri browser device ip visited_at

1 1 /blog Internet Explorer 1.0 PC 1.2.3.4 2024-03-07 19:40:21

2 2 /about Internet Explorer 1.5 Laptop 2.3.4.5 2024-03-07 19:40:22

3 2 /contact Internet Explorer 2 Windows XP PC 3.4.5.6 2024-03-07 19:40:23

The PageView service exposes only one API endpoint:

POST /api/page-views

This API accepts requests from the embedded script on users's websites.

The request looks like this:

class StorePageViewRequest extends FormRequest


{
public function authorize(): bool
{
return true;
}

public function rules(): array


{
return [
'site_id' "# ['required', 'integer', 'exists:sites,id'],
'browser' "# ['required', 'string', 'sometimes', 'nullable'],
'device' "# ['required', 'string', 'sometimes', 'nullable'],

No. 13 / 145
Martin Joo - Performance with Laravel

'uri' "# ['required', 'string'],


];
}
}

For this demo, we assume that the request contains the user's browser and device information. Other than
that it contains the uri which is the page the visitor visited on the user's website. It also contains the
site_id which is an argument to the embedded script.

The Controller is very straightforward:

namespace App\Http\Controllers;

class PageViewController
{
public function store(StorePageViewRequest $request)
{
$data = [
""$$request"%validated(),
'ip' "# $request"%ip(),
'visited_at' "# now(),
];

SavePageViewJob"&dispatch(PageViewData"&from($data));
}
}

It adds some other data to the request such as the IP address and the current date and then creates a DTO
from this array and dispatches a job. If you don't know what is a DTO I published multiple articles here and
here.

No. 14 / 145
Martin Joo - Performance with Laravel

The DTO itself is literally just a container of properties. Just like an associative array, but with types:

namespace App\DataTransferObjects;

class PageViewData
{
public function "'construct(
public string $site_id,
public string $uri,
public string $browser,
public string $device,
public string $ip,
public Carbon $visited_at,
) {}

public static function from(array $data): self


{
return new self(
site_id: $data['site_id'],
uri: $data['uri'],
browser: $data['browser'],
device: $data['device'],
ip: $data['ip'],
visited_at: $data['visited_at'],
);
}
}

No. 15 / 145
Martin Joo - Performance with Laravel

And then the last piece of the puzzle is the SavePageViewJob which couldn't be more simple:

class SavePageViewJob implements ShouldQueue


{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

public function "'construct(private readonly PageViewData $data)


{
}

public function handle(): void


{
PageView"&create([
'browser' "# $this"%data"%browser,
'device' "# $this"%data"%device,
'ip' "# $this"%data"%ip,
'visited_at' "# $this"%data"%visited_at,
'site_id' "# $this"%data"%site_id,
'uri' "# $this"%data"%uri,
]);

"( Later, we're going to come back here


}
}

It just creates a database record from the DTO. As easy as that.

That's it for the PageView service for now. No optimizations yet, just an API endpoint. I wanted to show you
that first, so you can have a better understanding of the application.

No. 16 / 145
Martin Joo - Performance with Laravel

Analytics service
We're going to implement three important features from SplitBee's dashboard:

Page views

Unique visitors

Most visited pages

I already showed you a picture but here it is again:

We're going to support five different date filters:

Last 24 hours

Last 7 days

Last 30 days

Last 6 months

Last 12 months

One thing that varies based on the date filter is the "resolution" of the chart. In the image above you can see
the last 30 days and each bar on the chart represents one day. If I change it to last 24 hours it's going to
display the data by hours:

No. 17 / 145
Martin Joo - Performance with Laravel

When you request the last 6 or 12 months you'll see months on the charts. That's something we need to
handle later.

There are three API endpoints:

GET /api/sites/{site}/unique-visitors
GET /api/sites/{site}/page-views
GET /api/sites/{site}/most-visited-pages

Each of them accepts one query param called period :

GET /api/sites/{site}/unique-visitors?period=24h
GET /api/sites/{site}/page-views?period=7d
GET /api/sites/{site}/most-visited-pages?period=6m

unique-visitors and page-views return a response such as this one:

No. 18 / 145
Martin Joo - Performance with Laravel

{
"labels":
[
"2024-02-07",
"2024-02-08",
"2024-02-09",
""$
],
"data":
[
3973,
3906,
3958,
""$
]
}

It's easy for a frontend chart library (such as Chart.js) to render these values on a chart.

most-visited-pages returns a simple map:

{
"/blog": 90,
"/": 55,
"/about": 51,
"/contact": 49,
"/news": 48
}

No. 19 / 145
Martin Joo - Performance with Laravel

Page views
There are lots of different ways to structure our codebase. On my blog you can read a lot about different
techniques. In this example, the main goal is performance so I'll use some pretty simple but effective service
classes.

The controller looks like this:

namespace App\Http\Controllers;

class PageViewController
{
public function index(Site $site, Request $request, Dashboard $dashboard)
{
$period = DateFilterPeriod"&from($request"%get('period'));

$dateFilter = DateFilter"&fromPeriod($period);

return match ($dateFilter"%resolution) {


'day' "# $dashboard"%pageViews($site)"%byDays($dateFilter, $period),
'hour' "# $dashboard"%pageViews($site)"%byHours($dateFilter, $period),
'month' "# $dashboard"%pageViews($site)"%byMonths($dateFilter,
$period),
};
}
}

DateFilterPeriod is a simple Enum that holds the possible filter periods:

No. 20 / 145
Martin Joo - Performance with Laravel

namespace App\Enums;

enum DateFilterPeriod: string


{
case TwentyFourHours = '24h';
case SevenDays = '7d';
case ThirtyDays = '30d';
case SixMonths = '6m';
case TwelveMonths = '12m';
}

This might seem a bit odd at first, but these values are static. They won't change. There aren't gonna be new
values either. Maybe after the app matures users request a 3-year or 5-year period but that's just about it.
We captured all possible periods in one enum.

Then we need to convert these 7d and 30d periods to dates that we can use in MySQL queries. The
DateFilter value object is used to express a start and end date and can be created from a period:

namespace App\ValueObjects;

class DateFilter
{
public function "'construct(
public Carbon $startDate,
public Carbon $endDate,
public string $resolution,
) {}

public static function fromPeriod(DateFilterPeriod $period): self


{
return match ($period) {
DateFilterPeriod"&TwentyFourHours "# new self(
now()"%subHours(24), now(), 'hour',
),
DateFilterPeriod"&SevenDays "# new self(
now()"%subDays(7)"%startOfDay(), now(), 'day',
),
DateFilterPeriod"&ThirtyDays "# new self(

No. 21 / 145
Martin Joo - Performance with Laravel

now()"%subDays(30)"%startOfDay(), now(), 'day',


),
DateFilterPeriod"&SixMonths "# new self(
now()"%subMonths(6)"%startOfDay(), now(), 'month',
),
DateFilterPeriod"&TwelveMonths "# new self(
now()"%subMonths(12)"%startOfDay(), now(), 'month',
),
default "# new self(
now()"%subHours(24), now(), 'hour',
),
};
}
}

For example, when the period is 24h or DateFilterPeriod::TwentyFourHours we create an object with
these arguments:

new self(now()"%subHours(24), now(), 'hour')

The third argument is the "resolution." It determines if it needs to show hours, days, or months on the
chart. All of these translate to different SQL queries.

After we have the DateFilter the Controller invokes a service class to get the results:

return $dashboard"%pageViews($site)"%getData($dateFilter, $period);

$dashboard is a service class located in app\Services\Dashboard\Dashboard.php . All it does is returning a


new instance of other services:

No. 22 / 145
Martin Joo - Performance with Laravel

namespace App\Services\Dashboard;

class Dashboard
{
public function pageViews(Site $site): PageViews
{
return app(PageViews"&class, ['site' "# $site]);
}

public function uniqueVisitors(Site $site): UniqueVisitors


{
return app(UniqueVisitors"&class, ['site' "# $site]);
}

public function pages(Site $site): Pages


{
return app(Pages"&class, ['site' "# $site]);
}
}

These three services ( PageViews , UniqueVisitors , and Pages ) contain business logic for a single feature,
such as generating the data for the unique visitor chart. Of course, they could have been actions as well.

There's a cool feature of Laravel's service container resolution I didn't know about until writing these lines.
You can pass arguments to the app function that cannot be resolved by Laravel. In this example, I pass the
$site variable:

return app(PageViews"&class, ['site' "# $site]);

If you're not familiar with the service container, this line is equivalent to this one:

return new PageViews($site);

But using the app helper Laravel will inject other dependencies to the PageViews class (such as other
services, third-party SDKs, API keys, etc).

In the PageViews class there's the query that returns page views grouped by days:

No. 23 / 145
Martin Joo - Performance with Laravel

DB"&table('page_views')
"%select(
DB"&raw('DATE_FORMAT(visited_at, "%H:00") as date'),
DB"&raw('COUNT(*) AS total'),
)
"%whereBetween('visited_at', [$dateFilter"%startDate, $dateFilter"%endDate])
"%where('site_id', $this"%site"%id)
"%groupBy('date')
"%orderBy('date', 'asc')
"%get();

That's quite simple, actually. DATE_FORMAT(visited_at, "%H:00") gives us a format such as 18:00 or
23:00 . We need to group the results based on this expression. Other than that, it's just a COUNT(*) and a
WHERE BETWEEN .

The only problem is that the query only returns dates where there was at least 1 page view. For example, in
my example database the last 7 days look like this:

[
{
"date": "2024-03-01",
"total": 4013
},
{
"date": "2024-03-06",
"total": 10
},
{
"date": "2024-03-07",
"total": 3
}
]

But the API should return all 7 days even if there was no traffic at all. So the desired outcome is this:

No. 24 / 145
Martin Joo - Performance with Laravel

[
{
"date": "2024-03-01",
"total": 4013
},
{
"date": "2024-03-02",
"total": 0
},
{
"date": "2024-03-03",
"total": 0
},
{
"date": "2024-03-04",
"total": 0
},
{
"date": "2024-03-05",
"total": 0
},
{
"date": "2024-03-06",
"total": 10
},
{
"date": "2024-03-07",
"total": 3
}
]

We need to fill in the missing dates. To handle these date-related things (we'll have more later) I created a
new service called DateService :

No. 25 / 145
Martin Joo - Performance with Laravel

class DateService
{
/**
* @return Collection<int>
")
public function getHoursBetween(DateFilter $dateFilter): Collection
{
$dates = [];

$date = $dateFilter"%startDate"%clone();

while (true) {
$dates[] = $date"%clone();

if (
$date"%isSameHour($dateFilter"%endDate) "*
$date"%isSameDay($dateFilter"%endDate)
) {
break;
}

if (count($dates) "+ 24) {


break;
}

$date"%addHour();
}

return collect($dates);
}
}

No. 26 / 145
Martin Joo - Performance with Laravel

This function returns the hours between two dates. The way it works is quite simple:

It sets the $date variable to the start date and starts an infinite loop

In every iteration, it adds one hour to $date and pushes it into the $dates array

If $date reaches the end date (same day, same hour) or we have more than 24 items it exits from the
loop

One important thing. Lots of Carbon methods are not "pure" so they change the original object instead of
returning a new one. For example, if you do this:

$date = Carbon"&parse('2024-03-08 21:50:12');

$oneHourLater = $date"%addHour();

Each of those variables will reference the same object and each date will be 2024-03-08 22:50:12 :

No. 27 / 145
Martin Joo - Performance with Laravel

Both $date and $oneHourLater have the same ID 1709938212 . This is why I use the clone method when
collecting the hours:

while (true) {
$dates[] = $date"%clone();

"( ""$

$date"%addHour();
}

No. 28 / 145
Martin Joo - Performance with Laravel

So we have all the hours between the two dates. The next step is to return an object such as this:

{
"labels":
[
"18:00",
"19:00",
"20:00",
""$
],
"data":
[
173,
96,
158,
""$
]
}

This is how we can produce this structure:

$dataByDates = [];

foreach ($data as $item) {


$dataByDates[$item"%date] = $item"%total;
}

$allDates = $this"%dateService"%getHoursBetween($dateFilter);

return [
'labels' "# $allDates"%map(fn (Carbon $date) "# $date"%format('H:00')),
'data' "# $allDates"%map(fn (Carbon $date) "# $dataByDates[$date-
>format('H:00')] ", 0),
];

First, we map the query result into a hash map and call it $dataByDates :

No. 29 / 145
Martin Joo - Performance with Laravel

[
'18:00' "# 173,
'19:00' "# 96,
'20:00' "# 158,
"( ""$
]

Then we calculate the hours and return an array with labels and data. labels is the result of
getHoursBetween formatted as H:00 which results in 20:00 , etc.

Then we go through $addDates and check if $dataByDates has a value for a given hour. If not we return
0.

This is what the whole PageViews class looks so far:

namespace App\Services\Dashboard;

class PageViews
{
public function "'construct(
private Site $site,
private DateService $dateService
) {}

public function getData(DateFilter $dateFilter): array


{
$data = DB"&table('page_views')
"%select(
DB"&raw('DATE_FORMAT(visited_at, "%H:00") as date'),
DB"&raw('COUNT(*) AS total'),
)
"%whereBetween('visited_at', [
$dateFilter"%startDate,
$dateFilter"%endDate
])
"%where('site_id', $this"%site"%id)
"%groupBy('date')
"%orderBy('date', 'asc')

No. 30 / 145
Martin Joo - Performance with Laravel

"%get();

$dataByDates = [];

foreach ($data as $item) {


$dataByDates[$item"%date] = $item"%total;
}

$allDates = $this"%dateService"%getHoursBetween($dateFilter);

return [
'labels' "# $allDates"%map(fn (Carbon $date) "# $date"%format('H:00')),
'data' "# $allDates"%map(fn (Carbon $date) "# $dataByDates[$date-
>format('H:00')] ", 0),
];
}
}

With this we can serve a request with a 24h period:

GET /api/sites/1/page-views?period=24h

Now let's think about what is the difference if the user wants to see a daily chart:

In the query, we need to use a different date expression that converts the date to 2024-03-09 format.
The date function is perfect for that so the select expression will be: select date(visited_at) as
date

We also need to replace getHoursBetween since now we need to calculate days between two dates.
The function is probably the same as before but instead of $date->addHour() we need to use $date-
>addDay()

That sounds quite easy. What if the user wants a monthly chart?

The MySQL date expression becomes DATE_FORMAT(visited_at, "%Y-%m") as date which returns
values such as 2024-03

We need a getMonthsBetween function that uses $date->addMonth()

We also need to use some kind of variable in the final return array since it also uses date formats:

No. 31 / 145
Martin Joo - Performance with Laravel

return [
'labels' "# $allDates"%map(fn (Carbon $date) "# $date"%format('H:00')),
'data' "# $allDates"%map(fn (Carbon $date) "# $dataByDates[$date-
>format('H:00')] ", 0),
];

So there are two variables:

The MySQL date expression

The date format to use in the Carbon format function

With these and the new date helper functions we can make a generic function that handles all three use
cases:

private function getData(DateFilter $dateFilter, DateFilterPeriod $period):


array
{
$dateFormat = match ($period) {
DateFilterPeriod"&TwentyFourHours "# [
'format' "# 'H:00',
'expression' "# 'DATE_FORMAT(visited_at, "%H:00") as date',
'allDates' "# $this"%dateService"%getHoursBetween($dateFilter),
],
DateFilterPeriod"&SevenDays, DateFilterPeriod"&ThirtyDays "# [
'format' "# 'Y-m-d',
'expression' "# 'DATE(visited_at) as date',
'allDates' "# $this"%dateService"%getDaysBetween($dateFilter),
],
DateFilterPeriod"&SixMonths, DateFilterPeriod"&TwelveMonths "# [
'format' "# 'Y-m',
'expression' "# 'DATE_FORMAT(visited_at, "%Y-%m") as date',
'allDates' "# $this"%dateService"%getMonthsBetween($dateFilter),
],
};

No. 32 / 145
Martin Joo - Performance with Laravel

$data = DB"&table('page_views')
"%select(
DB"&raw($dateFormat['expression']),
DB"&raw('COUNT(*) AS total'),
)
"%whereBetween('visited_at', [
$dateFilter"%startDate,
$dateFilter"%endDate
])
"%where('site_id', $this"%site"%id)
"%groupBy('date')
"%orderBy('date', 'asc')
"%get();

$dataByDates = [];

foreach ($data as $item) {


$dataByDates[$item"%date] = $item"%total;
}

return [
'labels' "# $dateFormat['allDates']"%map(fn (Carbon $date) "# $date-
>format($dateFormat['format'])),
'data' "# $dateFormat['allDates']"%map(fn (Carbon $date) "#
$dataByDates[$date"%format($dateFormat['format'])] ", 0),
];
}

You can also move the match expression into the enum itself however, an enum should not contain
database-specific code and should not depend on service classes in my opinion.

Now we have the data for the "Page Views" chart. Let's move to unique visitors, most visited pages and then
we'll start profiling and optimizing.

No. 33 / 145
Martin Joo - Performance with Laravel

Unique visitors
I won't go into that much detail in this chapter, because the main principles are the same as before the
main difference is the query itself.

To get unique visitors we need to group by the visited_at date but we need to count only distinct IP
addresses:

$data = DB"&table('page_views')
"%select(
DB"&raw($dateFormat['expression']),
DB"&raw('COUNT(DISTINCT(ip)) AS total'),
)
"%whereBetween('visited_at', [$dateFilter"%startDate, $dateFilter"%endDate])
"%where('site_id', $this"%site"%id)
"%groupBy('date')
"%orderBy('date', 'asc')
"%get();

Other than that, everything is exactly the same as with page views so let's move on.

No. 34 / 145
Martin Joo - Performance with Laravel

Most visited pages


This one is easier than the other since we don't need a chart, just a list of the most visited pages of a site:

class Pages
{
public function "'construct(
private Site $site,
) {}

public function getData(): array


{
$data = DB"&table('page_views')
"%select('uri', DB"&raw('COUNT(*) as total'))
"%whereBetween('visited_at', [
$dateFilter"%startDate,
$dateFilter"%endDate
])
"%where('site_id', $this"%site"%id)
"%groupBy('uri')
"%orderByDesc('total')
"%limit(5)
"%get();

$results = [];

foreach ($data as $item) {


$results[$item"%uri] = $item"%total;
}

return $results;
}
}

It groups the results by uri and counts them. Finally, it returns an object with the top 5 pages:

No. 35 / 145
Martin Joo - Performance with Laravel

{
"/blog": 1539,
"/about": 973,
"/contact": 831,
"/news": 736,
"/": 560
}

Measuring performance
Before optimizing we need to measure the current performance to establish some baselines.

Right now, this is what the database looks like in numbers:

Total page_views row numbers: 4,500,000

page_views for the site with the ID of 1: 1,500,000

The table doesn't have indices right now

These are pretty small numbers for an analytics platform. Let's see how the current system performs:

Page views for the last 24 hours took 912ms and for the last 12 months, it took 1275ms. It's clear that the
more data we have the more time it takes. Which is pretty bad. Right now it only needs to search in
1,500,000 records which is only a few hours of traffic for lots of sites.

The situation is even worse for unique visitors:

No. 36 / 145
Martin Joo - Performance with Laravel

But the absolute worst endpoint is the most visited pages:

By the end of this chapter, these requests will run under 500ms.

It quite surprised me that the most visited pages query took 5x more time than the page views query.
Before we add indices let's run an explain and try to understand what causes that.

Just a reminder, this is the page views query:

select
DATE_FORMAT(visited_at, "%Y-%m") as date,
COUNT(*) AS total
from
`page_views`
where
`visited_at` between '2023-03-10 00:00:00'
and '2024-03-10 13:53:33'
and `site_id` = 1
group by
`date`
order by
`date` asc

The is explain 's output:

It's a full table scan query (which is okay for now) that doesn't use any indices. There are three things in the
extra column:

No. 37 / 145
Martin Joo - Performance with Laravel

Using where simply means the query uses a where clause

Using temporary means that MySQL needs to create a temporary table to process the query. It typically
occurs when the query involves complex operations that require temporary storage to hold
intermediate results. It's caused by the GROUP BY statement.

Using filesort means that MySQL needs to perform a filesort operation to sort the result set. It's caused
by the ORDER BY statement.

Of course, it's bad since it's a full table scan involving lots of I/O operations and it also creates temporary
tables.

Now let's take a look at the most visited pages query:

select
`uri`,
COUNT(*) as total
from
`page_views`
where
`visited_at` between '2023-03-10 00:00:00'
and '2024-03-10 12:57:43'
and `site_id` = 1
group by
`uri`
order by
`total` desc
limit
5

Here's the explain:

It's exactly the same execution plan. Not even a single bit of difference. So what causes the 5x difference?

If it's not the way the query is executed then it must be something with the data itself.

With my amazing design skills, I highlighted the differences between the two queries:

No. 38 / 145
Martin Joo - Performance with Laravel

In the page views query we use date for select , group by , and order by . However, in the most visited
pages query we use uri in these expressions. uri is a varchar(255) and visited_at (which is the base
for date ) is a timestamp just as a usual created_at in any Laravel migration.

Maybe the uri column takes more space and URIs are long in general so it just takes longer for MySQl to
get the results? If that's the case, these two queries should have a big difference in runtime:

"- This should be faster ")


select DATE_FORMAT(visited_at, "%Y-%m") as date
from page_views
where site_id = 1
and visited_at between "2023-03-11 00:00:00" and "2024-03-11 23:59:59"

"- This should be slower ")


select uri
from page_views
where site_id = 1
and visited_at between "2023-03-11 00:00:00" and "2024-03-11 23:59:59"

On my laptop, the results are:

The first query took 1.6s and returned 1.4 million rows

The second one took 1.9s and returned the same number of rows

URIs are longer than dates in general, but it seems like it only explains 10-20% of the time difference.

Next, let's see the results with GROUP BY :

No. 39 / 145
Martin Joo - Performance with Laravel

"- This should be faster ")


select DATE_FORMAT(visited_at, "%Y-%m") as date
from page_views
where site_id = 1
and visited_at between "2023-03-11 00:00:00" and "2024-03-11 23:59:59"
group by `date`

"- This should be slower ")


select uri
from page_views
where site_id = 1
and visited_at between "2023-03-11 00:00:00" and "2024-03-11 23:59:59"
group by uri

The results are:

The first query took 1.5s and returned 13 rows

The second one took 4.6s and returned 634,000 rows

Here's the big difference! This simple group by made the second query 3x slower. And you probably
already guessed why that's the case:

Grouping by a formatted date results in very few unique values. There are only 12 months, 31 days, 24
hours. The number of unique values is small.

Grouping by columns such uri results in a very high number of unique values. In this example, I have
634,000 URIs for site 1.

Working with an array of 13 vs 634,000 is obviously much faster and less resource-intensive. Soon, we're
going to tackle this problem.

Now that we have an understanding of the problem, let me point out one last thing before moving on. The
original most visited pages query had a limit clause:

group by
`uri`
order by
`total` desc
limit 5

Even though it returns only 5 records, the query took more than 6 seconds. When you add a limit clause it's
easy to think "ohh it's just 5 records. The small result set must be blazing fast." Well, not really. MySQL still
needs to read, group, sort, and evaluate 634,000 records. Don't fall into the "I used a limit" trap.

No. 40 / 145
Martin Joo - Performance with Laravel

Indexing
As we learned in the MySQL index chapter, there are no "universal" indices that will work in every
circumstance. We add indices to queries. In this application, obviously, the dashboard is the most important
feature. Right now, it has three queries and I would say they are equally important to users. So let's treat
them as equals and try to come up with the best possible indices.

These are the queries:

"- Page views ")


select DATE_FORMAT(visited_at, "%Y-%m") as date, COUNT(*) AS total
from `page_views`
where
`visited_at` between '2023-03-10 00:00:00'and '2024-03-10 13:53:33'
and `site_id` = 1
group by `date`
order by `date` asc

"- Unique visitors ")


select DATE_FORMAT(visited_at, "%Y-%m") as date, COUNT(DISTINCT(ip)) AS total
from `page_views`
where
`visited_at` between '2023-03-10 00:00:00' and '2024-03-10 13:40:23'
and `site_id` = 1
group by `date`
order by `date` asc

"- Most visited pages ")


select `uri`, COUNT(*) as total
from `page_views`
where
`visited_at` between '2023-03-10 00:00:00' and '2024-03-10 12:57:43'
and `site_id` = 1
group by `uri`
order by `total` desc
limit 5

No. 41 / 145
Martin Joo - Performance with Laravel

Without indices here are the results:

Page views for the last year: 1.6s

Unique visitors for the last year: 3.5s

Most visited pages for the last year: 7s

It's obvious that site_id should be a foreign key and an index:

Now the results are:

Page views for the last year: 1.9s (from 1.6s)

Unique visitors for the last year: 3.7s (from 3.5s)

Most visited pages for the last year: 7.5s (from 7s)

Well, everything just got 15-20% worse.

Let's quickly recap what we have learned so far:

Grouping by URI instead of a date makes a query 3x slower

Adding a foreign key makes everything slower

Now, let's try to understand why all queries got slower. Of course, the execution time can vary a bit and I'm
measuring it on my local machine which results in even more variance (for example, the first time I might
had 12 Chrome tabs open vs 33 now, etc), but we can say all queries got somewhat worse.

Here's the explain output for the query with an index:

Compare that to the one without an index:

No. 42 / 145
Martin Joo - Performance with Laravel

In theory, we should do better since now MySQL uses the site_id index and instead of a full table scan
( ALL ) it performs a ref or index range scan type query. Meaning, that it uses the site_id index to find
the starting point of a range and only scans value from that node. Remember, the site_id index isn't
unique (we have lots of page_views records for the same site) so even if you search for site_id = 1 it
always means a range. A range of page_views records where the site_id is equal to 1.

The rows number is also much lower. instead of ~4m rows MySQL only needs to search in ~2m rows. The
output looks clearly better but the results are a bit worse (or at least not better).

Fortunately, MySQL 8 comes with a new type of explain . It's called EXPLAIN ANALYZE . It works the same as
EXPLAIN so we need to add EXPLAIN ANALYZE before the query, but instead of a single row, it returns a
more detailed execution plan.

Disclaimer: EXPLAIN ANALYZE will actually run your query! So don't analyze delete statements on your
production database.

Without an index EXPLAIN ANALYZE returns the following:

-> Sort: date (actual time=1976".1976 rows=13 loops=1)


-> Table scan on <temporary> (actual time=1976".1976 rows=13 loops=1)
-> Aggregate using temporary table (actual time=1976".1976 rows=13
loops=1)
-> Filter: ((page_views.site_id = 1) and (page_views.visited_at between
'2023-03-10' and '2024-03-10')) (cost=456970 rows=47880) (actual
time=0.124".1554 rows=1.42e+6 loops=1)
-> Table scan on page_views (cost=456970 rows=4.31e+6) (actual
time=0.0969".944 rows=4.44e+6 loops=1)

We need to read it backward. The last line is the first step MySQL takes. So this is how the query is executed:

Table scan on page_views (cased by select )

Filter: ((page_views.site_id = 1) and (page_views.visited_at between ...) (caused by where )

Aggregate using temporary table (caused by group by )

Table scan on temporary (caused by group by )

Sort: date (caused by order by )

This is because lower-level nodes are the ones that read data, and then higher-level nodes do something
with that data such as filtering, aggregating, and sorting. But in a query, we write it in reverse order. select
and from are always the first expressions, and group by or limit are always the last ones.

No. 43 / 145
Martin Joo - Performance with Laravel

If you don't want to run the query but still want to see the detailed execution plan, you can use explain
format=tree <query> .

How to read a line such as this:

> Table scan on page_views (cost=456970 rows=4.31e+6) (actual


time=0.0969".944 rows=4.44e+6 loops=1)

The first part is the actual operation which is quite easy to understand:

Table scan on page_views

MySQL makes a full table scan. In other words, it reads every row from the filesystem.

The next part is:

(cost=456970 rows=4.31e+6)

These are the estimates by the optimizer. MySQL estimates that it needs to read 4.31 million rows and it
costs 456970 . This cost is an internal value calculated by MySQL's optimizer. I don't know how it's calculated
and I only use it to compare operations with each other. So 456970 doesn't tell us anything useful in
isolation. We need to compare it to other cost values.

And the last part is:

(actual time=0.0969".944 rows=4.44e+6 loops=1)

These are the actual values after running the query. Time is the most interesting and it has two parts:

It took MySQL 0.0969ms to read the first row

And it took 944ms to read all the rows

The operation actually returned 4.44 million rows and MySQL ran this only once (loop=1). The loop value
increases if you use JOIN statements in the query. Then time values become average values. So if the loop
was equal to 2 then 944ms would be the average value. Meaning, the total execution time of reading the
data would be 2*944ms or 1.8s.

It's important to note that actual time values include all the child node's execution times.

Here's the whole expression simplified, showing only the actual time values:

No. 44 / 145
Martin Joo - Performance with Laravel

-> Sort: date (actual time=1976".1976)


-> Table scan on <temporary> (actual time=1976".1976)
-> Aggregate using temporary table (actual time=1976".1976)
-> Filter: (""$) (actual time=0.124".1554)
-> Table scan on page_views (actual time=0.0969".944)

We can read it like this:

The table scan took 944ms

Then the filter took 1554ms but this value also includes the table scan. So the actual time for the
where expression was 610ms

Then I honestly don't know why everything takes 0ms. I tried it multiple times but it is always 0. Maybe
it's a bug in explain analyze .

So we know exactly that reading data takes 944ms and running the where expression takes another 600ms

Now let's talk about the number of rows. There are two numbers: the estimated and the actual number of
rows. If the two differ significantly from each other it's a hint that something is wrong with the query. The
reason is that MySQL thinks it needs to query 100 rows (for example) so the optimizer chooses a plan that is
optimized for a small number of rows. But in reality, it needs to scan 1 million rows. So the execution plan
chosen by the optimizer is not the optimal one. Let's see the current numbers.

Table scan:

Table scan on page_views (cost=456970 rows=4.31e+6) (actual time=0.0969".944


rows=4.44e+6 loops=1)

It's 4.31 million vs 4.44 million. It's not a big difference at all. MySQL estimated the number of rows quite
well.

Filter:

Filter: (cost=456970 rows=47880) (actual time=0.124".1554 rows=1.42e+6


loops=1)

It's 47,880 vs 1.42 million. That's a big difference.

After running explain analyze , two important things were revealed:

Running a simple where expression takes almost as much time as reading 4 million rows

When running the where expression MySQL thinks there are only 47k rows satisfying the filters but in
fact there are 1.41 million

No. 45 / 145
Martin Joo - Performance with Laravel

site_id index
We discovered some inefficiencies in the query which is no surprise given we haven't used any indices. Our
next move was to add a foreign key and index to the site_id column but the query got even worse.

Here's the explain analyze output with an index:

-> Sort: date (actual time=2011".2011 rows=13 loops=1)


-> Table scan on <temporary> (actual time=2011".2011 rows=13 loops=1)
-> Aggregate using temporary table (actual time=2011".2011 rows=13
loops=1)
-> Filter: (page_views.visited_at between '2023-03-10 00:00:00' and '2024-
03-10 13:53:33') (cost=101195 rows=239401) (actual time=9.08".1586
rows=1.42e+6 loops=1)
-> Index lookup on page_views using site_id (site_id=1) (cost=101195
rows=2.15e+6) (actual time=9.03".1074 rows=1.45e+6 loops=1)

The only difference is that MySQL does an index lookup instead of a full table scan. But it's not faster. In this
case, it's a little bit slower than before. And the filter is also a bit slower. So it uses the index but there's no
increase in speed.

In fact, there are some cases when it takes 2.5s to run:

-> Sort: date (actual time=2699".2699 rows=13 loops=1)


-> Table scan on <temporary> (actual time=2699".2699 rows=13 loops=1)
-> Aggregate using temporary table (actual time=2699".2699 rows=13
loops=1)
-> Filter: (page_views.visited_at between '2023-03-10 00:00:00' and '2024-
03-10 13:53:33') (cost=95406 rows=239247) (actual time=3.08".2280
rows=1.42e+6 loops=1)
-> Index lookup on page_views using site_id (site_id=1) (cost=95406
rows=2.15e+6) (actual time=2.98".1771 rows=1.45e+6 loops=1)

Unfortunately, just by looking at these statements, it's almost impossible to tell why it's slower with an
index. However, if we run a simple explain we can get a hint:

No. 46 / 145
Martin Joo - Performance with Laravel

In the extra column there's no Using index . Using index means that the query uses only the index.
This is not the case here. This query uses an index but also performs disk I/O operations.

Here's what's happening:

The index contains only site IDs

MySQL reads 1.45 million rows from the index

But the query needs visisted_at dates for the select and the where expressions as well

So it performs tens of thousands of disk I/O operations to read the visited_at column for each
row

This is why the query got worse. In some cases, it takes 1.7s only to read data from the index. Of course,
that 1.7s includes disk I/O as well.

No. 47 / 145
Martin Joo - Performance with Laravel

visited_at and site_id composite index


The solution to that problem is to put the data on the index. I created a new, composite index with both
columns:

create index `visited_at_site_id_idx` on `page_views` (`visited_at`,`site_id`)


using btree;

Now the query took 1.7s:

-> Sort: date (actual time=1713".1713 rows=13 loops=1)


-> Table scan on <temporary> (actual time=1713".1713 rows=13 loops=1)
-> Aggregate using temporary table (actual time=1713".1713 rows=13
loops=1)
-> Filter: ((page_views.site_id = 1) and (page_views.visited_at between
'2023-03-10 00:00:00' and '2024-03-10 13:53:33')) (cost=436337 rows=215413)
(actual time=0.0635".1279 rows=1.42e+6 loops=1)
-> Covering index range scan on page_views using visited_at_site_id_idx
over ('2023-03-10 00:00:00' <= visited_at <= '2024-03-10 13:53:33' AND
site_id = 1) (cost=436337 rows=2.15e+6) (actual time=0.0514".676 rows=4.34e+6
loops=1)

We're back to where we started. For some reason, the query still takes 1.7s, and reading data from the
index takes 676ms which is quite a lot.

Can you spot the problem? Site 1 has 1.4 million page views overall. Now take a closer look at this line:

-> Covering index range scan on page_views using visited_at_site_id_idx over


('2023-03-10 00:00:00' <= visited_at <= '2024-03-10 13:53:33' AND site_id =
1) (cost=436337 rows=2.15e+6) (actual time=0.0514".676 rows=4.34e+6 loops=1)

MySQL returns 4.34 million rows (the whole table) instead of 1.4m

As we discussed earlier, an index is sorted. But you can only sort a dataset based on one column.

Here's a representation of the visited_at_site_id_idx index as a tree:

No. 48 / 145
Martin Joo - Performance with Laravel

1034, 1, 972, etc are site IDs. Nodes for site #1 are in completely random order since the tree is sorted by
visited_at . The first column in the index.

Here's the table representation of the index to better understand it:

visited_at site_id

2023-03-10 00:00:00 972

2023-03-10 00:00:02 1

2023-03-10 00:00:03 12

2023-03-10 00:00:04 1034

2023-03-10 00:00:05 1

2023-03-10 00:00:06 19

2023-03-10 00:00:07 6

When MySQL sees this where expression:

where
`visited_at` between '2023-03-10 00:00:00' and '2024-03-10 13:53:33'
and `site_id` = 1

It can only use the first column in the index to look for a range. Which is, of course, visisted_at so it
returns everything between March 2023 and March 2024. In my demo database. every page_views record
has a visited_at date between these dates so it means the whole table.

No. 49 / 145
Martin Joo - Performance with Laravel

With this composite index, we avoided I/O operations but we made MySQL work with 3x as much data as it
should. And please remember about cardinality that we have already discussed in the DB indexing
chapter.

No. 50 / 145
Martin Joo - Performance with Laravel

site_id and visited_at composite index


If our theory is right, we should just flip the columns in the index:

create index `site_id_visited_at_idx` on `page_views` (`site_id`,`visited_at`)


using btree;

By doing so, the index looks like this as a table:

site_id visited_at

1 2023-03-10 00:00:02

1 2023-03-10 00:00:05

6 2023-03-10 00:00:07

12 2023-03-10 00:00:03

19 2023-03-10 00:00:06

972 2023-03-10 00:00:00

1034 2023-03-10 00:00:04

Now it should be pretty easy to look up records in the index based on the site_id .

With this index, the query takes 1.1s with the following execution plan:

-> Sort: date (actual time=1173".1173 rows=13 loops=1)


-> Table scan on <temporary> (actual time=1173".1173 rows=13 loops=1)
-> Aggregate using temporary table (actual time=1173".1173 rows=13
loops=1)
-> Filter: ((page_views.site_id = 1) and (page_views.visited_at between
'2023-03-10 00:00:00' and '2024-03-10 13:53:33')) (cost=436337 rows=2.15e+6)
(actual time=0.0668".751 rows=1.42e+6 loops=1)
-> Covering index range scan on page_views using site_id_visited_at_idx
over (site_id = 1 AND '2023-03-10 00:00:00' <= visited_at <= '2024-03-10
13:53:33') (cost=436337 rows=2.15e+6) (actual time=0.0563".219 rows=1.42e+6
loops=1)

The index range scan takes 219ms instead of 676ms and the actual row number is 1.42m. We reduced the
number of rows by a factor of 3 and the index scan was 3x faster.

No. 51 / 145
Martin Joo - Performance with Laravel

If we now run a simple explain it gives the following items in the extra column:

Using where; Using index; Using temporary; Using filesort

Using index means the query uses only the index and does not perform disk I/O.

Just by playing with the columns and column orders in the index, we made MySQL to:

Perform tens of thousands of unwanted I/O operations

Scan 4.3 million records instead of 1.4m

Finally, the query only uses the index. It processes the right number of rows. And it still runs just over 1s.
The where expression still needs ~500ms to run and ~400ms to group and sort the results. Which is not
optimal.

Before moving on, let's summarize how all queries perform with this new index:

Query Time with index Time without index

Page views 1.1s 1.6s

Unique visitors 3.2s 3.5s

Most visited pages 5.3s 7s

As you can see the situation is a bit better, but still terrible.

In the last explain analyze we've seen that filtering the results takes 500ms, and grouping and sorting
takes another 400ms. Which is quite a lot. Just a quick reminder, this is what the query looks like:

select DATE_FORMAT(visited_at, "%Y-%m") as `date`, COUNT(*) AS total


""$
group by `date`
order by `date` asc

As you can see it uses the date_format function. I don't know how exactly MySQL performs these select,
group by, and order by operations but in the worst-case scenario, date_format is invoked 1.4 million times
and it makes the grouping process a lot harder.

What happens if we remove the function and (just for a moment) simply use the visited_at column?

No. 52 / 145
Martin Joo - Performance with Laravel

select visited_at as `date`, COUNT(*) AS total


""$
group by `date`
order by `date` asc

Now the query took around 800-900ms and had the following execution plan:

-> Group (no aggregates) (cost=375309 rows=1575) (actual time=0.0808".912


rows=83285 loops=1)
-> Filter: ((page_views.site_id = 1) and (page_views.visited_at between
'2023-03-10 00:00:00' and '2024-03-10 13:53:33')) (cost=251264 rows=1.24e+6)
(actual time=0.0384".780 rows=1.49e+6 loops=1)
-> Covering index range scan on page_views using site_id_visited_at_idx
over (site_id = 1 AND '2023-03-10 00:00:00' <= visited_at <= '2024-03-10
13:53:33') (cost=251264 rows=1.24e+6) (actual time=0.0302".226 rows=1.49e+6
loops=1)

Still uses the index in the same way. The filter is the same as before but grouping and sorting becomes a bit
simpler and faster. No need for temporary tables, and no table scans, and instead of ~400ms it takes only
132ms. So it looks like we can gain 20-25% overall performance (1.1s-1.2s vs 800-900ms) just by removing
the date_format function. But how can we do so?

No. 53 / 145
Martin Joo - Performance with Laravel

Denormalization
Denormalization means storing redundant or precomputed data in the table to improve query
performance.

What if instead of this:

select date_format(visited_at, "%Y") as date

we do something like this:

select visited_at_year

So instead of relying on one visited_at column and manipulating it with the date_format function we
extract the different parts into multiple columns:

visited_at_year

visited_at_month

visited_at_day

visited_at_hour

This way MySQL can work with simple int numbers instead of functions. A where expression would look like
this:

where
(
(visited_at_year = 2023 and visited_at_month = 3 and visited_at_day "+ 10)
or
(visited_at_year = 2023 and visited_at_month > 3)
or
(visited_at_year = 2024 and visited_at_month = 3 and visited_at_day "/ 10)
or
(visited_at_year = 2024 and visited_at_month < 3)
)
and site_id = 1;

No. 54 / 145
Martin Joo - Performance with Laravel

This query returns page_views records between 2023-03-10 and 2024-03-10 .

The first part selects rows that were created after 2023-03-10 but before 2023-04-01 (because
visited_at_month = 3 )

The second part selects everything after 2023-03-31 but before 2024-01-01 (because
visited_at_month > 3 )

The third part selects rows from 2024 March before 2024-03-10

And the last one selects page views in 2024 before March

This particular query is used for monthly charts such as the last 12 months, and the last 6 months.

To get the last 30 or 7 days the query looks like this:

where
(
visited_at_year = 2024 and visited_at_month = 3 and visited_at_day "+ 10
and
visited_at_year = 2024 and visited_at_month = 4 and visited_at_day "/ 10
)
and site_id = 1

This query returns with records between 2024-03-10 and 2024-04-10 .

The first part selects rows that were created after 2024-03-10

The second selects the ones that were created before 2024-04-10

Finally, this is an hourly query that returns the data for the last 24 hours chart:

where
(
visited_at_year = 2024 and visited_at_month = 3 and visited_at_day = 10 and
visited_at_hour > 15
or
visited_at_year = 2024 and visited_at_month = 3 and visited_at_day = 11 and
visited_at_hour "/ 15
)
and site_id = 1

No. 55 / 145
Martin Joo - Performance with Laravel

This query returns page views between 2024-03-10 15:00:00 and 2024-03-11 15:00:00 .

The first part returns rows that were created on the 10th of March after 15:00:00

The second part returns rows that were created on the 10th of March before 15:00:00

Sure, these queries are more complicated so let's see if it's worth it from a performance point of view.

First, we need to add a new index that contains the new columns and the site_id :

create index `site_id_visited_at_idx` on `page_views`


(`site_id`,`visited_at_year`,`visited_at_month`,`visited_at_day`,`visited_at_ho
ur`) USING BTREE

All of these columns are smallint unsigned .

The whole query looks like this:

select concat_ws('-', visited_at_year, visited_at_month) as `date`, count(*)


as total
from page_views
where
(
(visited_at_year = 2023 and visited_at_month = 3 and visited_at_day "+ 10)
or
(visited_at_year = 2023 and visited_at_month > 3)
or
(visited_at_year = 2024 and visited_at_month = 3 and visited_at_day "/ 10)
or
(visited_at_year = 2024 and visited_at_month < 3)
)
and site_id = 1
group by `date`
order by `date`;

The result is 791ms. That's a 33% improvement compared to 1.1s

This is the execution plan:

No. 56 / 145
Martin Joo - Performance with Laravel

-> Sort: date (actual time=791".791 rows=13 loops=1)


-> Table scan on <temporary> (actual time=791".791 rows=13 loops=1)
-> Aggregate using temporary table (actual time=791".791 rows=13
loops=1)
-> Filter: ((page_views.site_id = 1) and (((page_views.visited_at_month
= 3) and (page_views.visited_at_year = 2023) and (page_views.visited_at_day >=
10)) or ((page_views.visited_at_year = 2023) and (page_views.visited_at_month
> 3)) or ((page_views.visited_at_month = 3) and (page_views.visited_at_year =
2024) and (page_views.visited_at_day <= 10)) or ((page_views.visited_at_year =
2024) and (page_views.visited_at_month < 3)))) (cost=367003 rows=1.81e+6)
(actual time=0.297".378 rows=1.24e+6 loops=1)
-> Covering index range scan on page_views using site_id_visited_at
over (site_id = 1 AND visited_at_year = 2023 AND visited_at_month = 3 AND 10
<= visited_at_day) OR (site_id = 1 AND visited_at_year = 2023 AND 3 <
visited_at_month) OR (2 more) (cost=367003 rows=1.81e+6) (actual
time=0.247".243 rows=1.24e+6 loops=1)

It's a bit of a mess, so here's a simplified version:

-> Sort: date (actual time=791".791 rows=13 loops=1)


-> Table scan on <temporary> (actual time=791".791 rows=13 loops=1)
-> Aggregate using temporary table (actual time=791".791 rows=13 loops=1)
-> Filter (cost=367003 rows=1.81e+6) (actual time=0.297".378
rows=1.24e+6 loops=1)
-> Covering index range scan (cost=367003 rows=1.81e+6) (actual
time=0.247".243 rows=1.24e+6 loops=1)

Step-by-step:

Reading from the index: 243ms

Filtering: 135ms

Grouping and sorting: 413ms

Here's a comparison with the previous query (using the visited_at column):

No. 57 / 145
Martin Joo - Performance with Laravel

Time using denormalized Time using


Step Improvement
columns visited_at

Reading from index 243ms 219ms -10%

Filtering 135ms 532ms 75%

Grouping and
413ms 422ms 1%
sorting

Overall 728ms 1100ms 33%

Reading from the index got a bit slower which can be explained by the size of the index. Earlier it contained
only two columns, now it contains 5 of them. So it makes the reading just a bit slower. Filtering got much
much faster thanks to the denormalized columns. Overall, the 28% performance improvement is quite good
in my opinion.

The cost of this improvement is:

More complicated where statements

Extra code to calculate the new columns when creating a new page_views record

We've already seen the new where expressions. Here's the extra code needed to maintain the new
columns:

namespace App\Models;

class PageView extends Model


{
protected static function boot()
{
parent"&boot();

static"&creating(function (PageView $pageView) {


$pageView"%visited_at_year = $pageView"%visited_at"%format('Y');
$pageView"%visited_at_month = $pageView"%visited_at"%format('m');
$pageView"%visited_at_day = $pageView"%visited_at"%format('d');
$pageView"%visited_at_hour = $pageView"%visited_at"%format('H');
});
}
}

No. 58 / 145
Martin Joo - Performance with Laravel

This is the whole query using query builder:

$data = DB"&table('page_views')
"%select(
DB"&raw("concat_ws('-', visited_at_year, lpad(visited_at_month, 2, '0'))
as date"),
)
"%where(function ($query) use ($dateFilter) {
$query
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%startDate"%format('Y'))
"%where('visited_at_month', $dateFilter"%startDate"%format('m'))
"%where('visited_at_day', '"+', $dateFilter"%startDate-
>format('d'));
})
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%startDate"%format('Y'))
"%where('visited_at_month', '>', $dateFilter"%startDate-
>format('m'));
})
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%endDate"%format('Y'))
"%where('visited_at_month', $dateFilter"%endDate"%format('m'))
"%where('visited_at_day', '"/', $dateFilter"%endDate"%format('d'));
})
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%endDate"%format('Y'))
"%where('visited_at_month', '<', $dateFilter"%endDate-
>format('m'));
});
})
"%where('site_id', $this"%site"%id)
"%groupBy('date')

No. 59 / 145
Martin Joo - Performance with Laravel

"%orderBy('date')
"%get();

I think these changes are worth the performance improvement.

There are different variations of this query. You can find them all in a trait called BuildsDateFilterQuery .

Here's a summary table that contains the results so far:

Step Result

Starting point 1.2s

site_id index 2s

visited_at and site_id composite index 1.7s

site_id and visited_at composite index 1.1s

Denormalization 791ms

Here's a summary of the different requests in Telescope:

As you can see we are now able to query 24-hour data in 277ms without any cache or "architectural"
change. Querying the last 6 months takes 423ms. Before you say "well, it's quite a lot" let's take a look at
what happens when I click "last 6 months" in Splitbee:

These are requests way above 1 second. It takes a total of 3-4 seconds while the charts load the data. And
my site doesn't have 1.5 million page views, only 150k:

No. 60 / 145
Martin Joo - Performance with Laravel

So I think a range from 277ms to 808ms is not that bad. However, at the end of the chapter, it's going to be
down at the 50-80ms range (not by using Cache::get() ).

No. 61 / 145
Martin Joo - Performance with Laravel

Hashing URLs
If you go back to the beginning of this chapter you can see that most-visited-pages requests (queries,
really) took 5.7s for the last 12 months. It's quite a lot.

The query looks like this:

select
uri, count(*) as total
from
`page_views`
where visited_at between "2023-03-10" and "2023-04-10"
and site_id = 1
group by uri
order by total desc
limit 5

Of course, now we have visited_at_* columns, however, it won't help in this case because this query only
uses the visited_at date to filter results. So using the denormalized column won't give us any benefits.

This is the execution plan using the denormalized columns:

No. 62 / 145
Martin Joo - Performance with Laravel

"% Limit: 5 row(s) (actual time=5583".5583 rows=5 loops=1)


"% Sort: total DESC, limit input to 5 row(s) per chunk (actual
time=5583".5583 rows=5 loops=1)
"% Table scan on <temporary> (actual time=5476".5552 rows=570299
loops=1)
"% Aggregate using temporary table (actual time=5476".5476 rows=570298
loops=1)
"% Filter: ((page_views.site_id = 1) and
(((page_views.visited_at_month = 3) and (page_views.visited_at_year = 2023)
and (page_views.visited_at_day "+ 10)) or ((page_views.visited_at_year = 2023)
and (page_views.visited_at_month > 3)) or ((page_views.visited_at_month = 3)
and (page_views.visited_at_year = 2024) and (page_views.visited_at_day "/ 10))
or ((page_views.visited_at_year = 2024) and (page_views.visited_at_month <
3)))) (cost=276838 rows=1.82e+6) (actual time=0.069".739 rows=1.24e+6
loops=1)
"% Covering index scan on page_views using site_id_visited_at
(cost=276838 rows=2.57e+6) (actual time=0.065".551 rows=2.65e+6 loops=1)

Stil between 5.5s and 6s. As you can see aggregating the results takes 4.7s which is awful.

We use the uri column in the group by which is a varchar(255) column and contains a URI such as
Sony-Lightweight-Super-Compact-Extra-Durable-Waterproof/dp/B0C29CL98P . This one comes from
Amazon. They can be long. Very long. And it causes problems.

One technique we can use is to hash these URLs into unified 40-char long SHA1 values such as
e2b3cedbfad05bbd3e918acc0ccbcc5ba3b0454a . This is what the process looks like:

We add a new column called hashed_uri

When creating a new page_views record we calculate the hash value of the URI using PHP's sha1
function. Of course, it makes writing slower. In this application, it's acceptable for three reasons:

They happen in the background so it's okay if they take a little bit longer. It doesn't affect the user
experience

They don't have to be real-time. These are not heart rate data or something super crucial. It's okay
if we have a slight delay.

Reads are just way more important in this app. Even if a write takes 5 minutes (including the job
sitting in the queue for 4 minutes 59 seconds when there's a high load) it's not a problem for the
users. They rather want to see the last 7 days, 30 days, and 6 months on the UI fast.

When querying we use the new hashed_uri column

After we have the top 5 pages we query the real URIs based on the hash values

The benefits are:

No. 63 / 145
Martin Joo - Performance with Laravel

SHA1 hash values have a fixed length of 40-char we can use CHAR(40) in MySQL

The character_set can be ascii since hash values don't contain special characters

The grouping will become much faster (hopefully)

I know these look like small details. But having a fixed 40-char column in ascii makes a huge difference
when you need to aggregate 2.5 million rows. An ascii character takes 1 byte of storage no matter what. A
UTF8 can take up 1-4 bytes depending on the exact character. In the worst-case scenario, that can be a 4x
difference between the two of them. That means lots of memory and CPU.

To demonstrate these points, I'll try different settings and measure them:

nullable varchar(255) utf8

nullable varchar(40) utf8

nullable char(40) utf8

nullable char(40) ascii

not nullable char(40) ascii

First, of all, this is the code needed to calculate the hash values:

class PageView extends Model


{
protected static function boot()
{
parent"&boot();

static"&creating(function (PageView $pageView) {


$pageView"%visited_at_year = $pageView"%visited_at"%format('Y');
$pageView"%visited_at_month = $pageView"%visited_at"%format('m');
$pageView"%visited_at_day = $pageView"%visited_at"%format('d');
$pageView"%visited_at_hour = $pageView"%visited_at"%format('H');

$pageView"%hashed_uri = sha1($pageView"%uri);
});
}
}

No. 64 / 145
Martin Joo - Performance with Laravel

The query I'm running is the exact same as before but uses the denormalized column (no performance gain
from it), and the new hashed_uri column:

select
hashed_uri, count(*) as total
from
`page_views`
where
(
(visited_at_year = 2023 and visited_at_month = 3 and visited_at_day "+ 10)
or
(visited_at_year = 2023 and visited_at_month > 3)
or
(visited_at_year = 2024 and visited_at_month = 3 and visited_at_day "/ 10)
or
(visited_at_year = 2024 and visited_at_month < 3)
)
and site_id = 1
group by hashed_uri
order by total desc
limit 5

Setting #1

This is the first setting:

`hashed_uri` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci


DEFAULT NULL,

The result is 4.8s with the following plan:

No. 65 / 145
Martin Joo - Performance with Laravel

-> Limit: 5 row(s) (actual time=4820".4820 rows=5 loops=1)


-> Sort: total DESC, limit input to 5 row(s) per chunk (actual
time=4820".4820 rows=5 loops=1)
-> Table scan on <temporary> (actual time=4717".4791 rows=570299 loops=1)
-> Aggregate using temporary table (actual time=4717".4717 rows=570298
loops=1)
-> Filter: (cost=275704 rows=1.81e+6) (actual time=0.0825".805
rows=1.24e+6 loops=1)
-> Covering index scan on page_views using site_id_visited_at
(cost=275704 rows=2.56e+6) (actual time=0.0772".611 rows=2.65e+6 loops=1)

It's almost 1s faster than using the un-hashed URIs but it's still quite slow. Most of the time is spent
grouping the results.

Setting #2

The second setting:

`hashed_uri` varchar(40) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci


DEFAULT NULL,

Instead of varchar(255) it's varchar(40)

The result is 4.55s with the following plan:

-> Limit: 5 row(s) (actual time=4551".4551 rows=5 loops=1)


-> Sort: total DESC, limit input to 5 row(s) per chunk (actual
time=4551".4551 rows=5 loops=1)
-> Table scan on <temporary> (actual time=4449".4522 rows=570299 loops=1)
-> Aggregate using temporary table (actual time=4449".4449 rows=570298
loops=1)
-> Filter: (cost=381020 rows=1.75e+6) (actual time=0.0334".516
rows=1.24e+6 loops=1)
-> Covering index range scan on page_views using site_id_visited_at
(cost=381020 rows=1.75e+6) (actual time=0.0302".382 rows=1.24e+6 loops=1)

It's a bit better but still slow.

No. 66 / 145
Martin Joo - Performance with Laravel

Setting #3

The third one is:

`hashed_uri` char(40) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT


NULL,

Instead of varchar(40) it's char(40)

The result is 7.7s with the following plan:

-> Limit: 5 row(s) (actual time=7718".7718 rows=5 loops=1)


-> Sort: total DESC, limit input to 5 row(s) per chunk (actual
time=7718".7718 rows=5 loops=1)
-> Table scan on <temporary> (actual time=7466".7674 rows=570299 loops=1)
-> Aggregate using temporary table (actual time=7466".7466 rows=570298
loops=1)
-> Filter: (cost=381344 rows=1.75e+6) (actual time=0.0569".651
rows=1.24e+6 loops=1)
-> Covering index range scan on page_views using
site_id_visited_at(cost=381344 rows=1.75e+6) (actual time=0.0524".502
rows=1.24e+6 loops=1)

It just got way worse than before. It spends almost 7 seconds on the group by only. So using a fixed length
column is a bad thing after all? Well, let's try ascii first.

Setting #4

The 4th one is:

`hashed_uri` char(40) CHARACTER SET ascii COLLATE ascii_general_ci DEFAULT


NULL,

Instead of utf8 it's ascii

The result is 3.1s with the following plan:

No. 67 / 145
Martin Joo - Performance with Laravel

-> Limit: 5 row(s) (actual time=3144".3144 rows=5 loops=1)


-> Sort: total DESC, limit input to 5 row(s) per chunk (actual
time=3144".3144 rows=5 loops=1)
-> Table scan on <temporary> (actual time=3037".3111 rows=570299 loops=1)
-> Aggregate using temporary table (actual time=3037".3037 rows=570298
loops=1)
-> Filter: (cost=364160 rows=1.75e+6) (actual time=0.053".563
rows=1.24e+6 loops=1)
-> Covering index range scan on page_views using site_id_visited_at
(cost=364160 rows=1.75e+6) (actual time=0.0486".426 rows=1.24e+6 loops=1)

It still spends most of the time on the group by .

Setting #5

The last step is to make the column NOT NULLABLE :

`hashed_uri` char(40) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL

THe result is 2.99s with the following plan:

-> Limit: 5 row(s) (actual time=2995".2995 rows=5 loops=1)


-> Sort: total DESC, limit input to 5 row(s) per chunk (actual
time=2995".2995 rows=5 loops=1)
-> Table scan on <temporary> (actual time=2892".2965 rows=570299 loops=1)
-> Aggregate using temporary table (actual time=2892".2892 rows=570298
loops=1)
-> Filter: (cost=385085 rows=1.84e+6) (actual time=0.0532".481
rows=1.24e+6 loops=1)
-> Covering index range scan on page_views using site_id_visited_at
(cost=385085 rows=1.84e+6) (actual time=0.0487".345 rows=1.24e+6 loops=1)

Here's a quick summary:

No. 68 / 145
Martin Joo - Performance with Laravel

Setting Last 12 months query execution time

Raw URIs ~6s

Hash URIs 5.5s

nullable varchar(255) utf8 4.8s

nullable varchar(40) utf8 4.55s

nullable char(40) utf8 7.7s

nullable char(40) ascii 3.1s

not nullable char(40) ascii 2.99s

It's crazy how you can improve the performance of a query just by using a single trick and the right data
types. So here's what we have just learned:

When you have long string data that is used in a group or where expression try to "normalize" that
data somehow. In this case, we used a hash function to shorten the data to a fixed length. In some
cases, you can also use only a part of a column. For example, let's say you store products in your table
and you have a product_code column that is 20-char long. The first 2 characters identify the product
category. You can use the LEFT function in group by or where expressions such as this:

select *
from products
where left(product_code, 2) = 'PC';

In Laravel we used to do $table->string() whenever we want to store a string. However, if you have
a fixed-length string just use char if you can.

Character encoding matters. ascii takes up a lot less storage and is easier to deal with. Whenever
you have something like a product code, barcode, or some other string-based identifier, that has value
you can use it to make MySQL's life easier.

Even small things such as nullable vs not nullable can mean a small 2% or 3% increase here and
there. It won't change your life, but in the long term, it makes your database a bit faster.

We all know that varchar means variable length. So a column won't take up 255 bytes if it contains
the string "hello". It stores the string and the length of it. However, there are situations when MySQL
converts a varchar(255) into a char(255) during a query. It happens when it creates temporary
tables during some join , group by , or order by . So the column size can matter in these situations.
In my opinion, you should always try to use a more specific value than 255.

So now we have a query that returns decoded URIs:

No. 69 / 145
Martin Joo - Performance with Laravel

hashed_uri total

e2b3cedbfad05bbd3e918acc0ccbcc5ba3b0454a 1148

69598452eee500b84e4e8140c146ea867faadf5b 687

14c7ba2a256d28806397281ef0c80da9d90a2d07 685

384468fc1563a56b540a7b42a657a0c86c1055e8 668

6d2cce9f4eb799945e2dd3d69143bfd16ffb5a97 614

Obviously, it's not very user-friendly. We need to "encode" the values which can be done by this simple
query

select distinct uri


from page_views
where hashed_uri in (
"e2b3cedbfad05bbd3e918acc0ccbcc5ba3b0454a",
"69598452eee500b84e4e8140c146ea867faadf5b",
"14c7ba2a256d28806397281ef0c80da9d90a2d07",
"384468fc1563a56b540a7b42a657a0c86c1055e8",
"6d2cce9f4eb799945e2dd3d69143bfd16ffb5a97"
)
limit 5

Yes, it's an extra query but it only takes between 5ms and 10ms to run:

No. 70 / 145
Martin Joo - Performance with Laravel

Using query builder, the queries look like this:

public function getData(DateFilter $dateFilter): array


{
$hashedUris = DB"&table('page_views')
"%select('hashed_uri', DB"&raw('COUNT(*) as total'))
"%where(function ($query) use ($dateFilter) {
$query
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%startDate"%format('Y'))
"%where('visited_at_month', $dateFilter"%startDate"%format('m'))
"%where('visited_at_day', '"+', $dateFilter"%startDate-
>format('d'));
})
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%startDate"%format('Y'))
"%where('visited_at_month', '>', $dateFilter"%startDate-
>format('m'));
})
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%endDate"%format('Y'))
"%where('visited_at_month', $dateFilter"%endDate"%format('m'))
"%where('visited_at_day', '"/', $dateFilter"%endDate-
>format('d'));
})
"%orWhere(function ($subQuery) use ($dateFilter) {
$subQuery
"%where('visited_at_year', $dateFilter"%endDate"%format('Y'))
"%where('visited_at_month', '<', $dateFilter"%endDate-
>format('m'));
});
})
"%where('site_id', $this"%site"%id)
"%groupBy('hashed_uri')

No. 71 / 145
Martin Joo - Performance with Laravel

"%orderByDesc('total')
"%limit(5)
"%get();

$data = DB"&table('page_views')
"%selectRaw("distinct(uri)")
"%whereIn('hashed_uri', $hashedUris"%pluck('hashed_uri'))
"%get();

$results = [];

foreach ($data as $i "# $item) {


$results[$item"%uri] = $hashedUris[$i]"%total;
}

return $results;
}

It's more complicated than it was before but that's a trade-off. If you want your code to be fast and efficient
you usually need to satisfy some readability.

Let's recap our progress so far. At the end of the chapter, we started with some pretty ugly numbers. Now
the situation is a little bit better:

Query Execution time before Execution time now

Page Views

- Last 12 months 1275ms 714ms

- Last 6 month 1077ms 361ms

- Last 30 days 895ms 232ms

- Last 7 days 872ms 228ms

- Last 24 hours 912ms 82ms

As you can see, at the beginning it didn't really matter if we queried the last 24 hours or the last 6 months.
The difference was slight. The queries were pretty slow either way. Now there's a significant decrease in
execution time if we query less data. However, the execution time is not directly proportional to the amount
of data which is a good thing. For example, the last 12 months means 365 days which is 52 times larger than
7. However, querying the last 12 months doesn't take 52x more time than the last 7 days. It's only a 3x
difference.

No. 72 / 145
Martin Joo - Performance with Laravel

Here's the summary for unique visitors:

Query Execution time before Execution time now

Unique Visitors

- Last 12 months 3710ms 1380ms

- Last 6 month 2284ms 703ms

- Last 30 days 1077ms 394ms

- Last 7 days 966ms 390ms

- Last 24 hours 1281ms 95ms

Generally speaking, it's 3-4x faster than it was before.

And finally, the most visited pages:

Query Execution time before Execution time now

Most Visited Pages

- Last 12 months 6500ms 2952ms

- Last 6 month 3350ms 1213ms

- Last 30 days 1094ms 449ms

- Last 7 days 1041ms 444ms

- Last 24 hours 1053ms 77ms

Similar situation here. 2-3x faster on average.

Aren't these 1200ms and 2900ms queries a problem? Well, on the one hand, yeah, they are slow, on the
other hand, it's not necessarily a problem. Here's why:

I tested these queries with a site that has ~1.5m page_views . My blog has been live for about 2 years
and it has 200k views. Let's say 1.5m is an average customer. Querying the last 12 months and waiting
for 2.9s query is the worst that can happen to an average user. Put it in other words: running a 2.9s
query against our database is the "worst average" user we can get.

Imagine you're a marketing guy and you walk into the office on a Wednesday morning. What is more
important to you? The last 12 months or the last 24 hours? The last 6 months or this week? Imagine
you're running a black Friday campaign. What do you want to check every 5 minutes? The last 12
months or the last 24 hours hour by hour? Of course, when the campaign ends you want to compare
this November to last year's November. But generally speaking, querying the last day, week, and
month is by far the most common use case in an application such as this one. We can handle the last
24 hours just under 100ms. 444ms is the slowest value for the last 7 days. The app can serve the last
30 days just under 500ms in every situation. The point is that we can handle the "average average"
scenario (average average means that a site with 1.5m page views checks the last 30 days) and we can

No. 73 / 145
Martin Joo - Performance with Laravel

handle the "best average" scenario just under 100ms (a site with 1.5m page views checks the last 24
hours).

No. 74 / 145
Martin Joo - Performance with Laravel

Redis
Redis is a beast.

It can do so much more than the usual

Cache"&set('foo', 'bar');

Cache"&get('foo');

but in my experience developers don't use it to its full extent because they don't know the possibilities.
Most of us know Redis as a good cache and job queue solution and that's it. Because of that, often we don't
even know it can can be used to solve other problems.

Think about the queries from the previous chapters. Think about the table. Think about the optimizations
we did. What's the fundamental problem? Why are the queries slow? What's is the main reason we have to
fight MySQL just to get a yearly overview of page views? The problem is aggregated views.

This (the page_views table):

id site_id visited_at uri

1 1 2024-03-30 14:47:23 /blog

2 1 2024-03-30 14:48:11 /

3 1 2024-03-30 14:48:51 /blog/measuring-performance-in-laravel

And this (the most visited URIs on the UI):

uri total

/blog 1478

/ 1223

/blog/measuring-performance-in-laravel 917

are so different from each other. Modern applications present data in an aggregated, document-like
structure while MySQL stores them in a flat shape. This is one of the reasons it's hard to optimize these
kinds of queries with a larger dataset.

That's one of the reasons why document stores and key-value databases have become so popular in recent
years.

No. 75 / 145
Martin Joo - Performance with Laravel

Redis is an in-memory key-value database. Usually, this description has two implications:

It's in-memory so you shouldn't store long-term data in it

It's a key-value store so it cannot handle complicated structures only strings

Let's debunk these. Redis offers two kinds of persistence solutions:

Append only file (AOF) is a very large log file that contains every operation and can be replayed against
your database. Imagine if MySQL stored every insert , update , delete operation and you can click a
"replay" button to run the same operations on a fresh database. It's very reliable. This log file can get
pretty big so it's a bit slow.

Redis database (RDB) it's a file that contains a snapshot about your database. These snapshots can be
made on a regular basis and you can rotate these RDB files just like your log files. It's similar to a sqlite
file. RDB is usually a faster solution than AOF however it's not that stable. If you have an outage you
can lose the RDB.

In practice, you should mix the two solutions so it's going to be fast and reliable. As you can see, Redis has
two kinds of solutions to persistence so we can store long-term data in it.

Yes, Redis is a key-value store but it has lots of different data structures that can be used to a wide range of
problems. Just to name a few:

Strings

Lists

Sets

Sorted sets

Hash

Stream

Pub/sub

HyperLogLog

When you use

Cache"&set('foo', 'bar');

under the hood Laravel creates a new key called foo which is a string containing the value bar .

When you use

MyJob"&dispatch();

under the hood, Laravel adds a new item to a Redis list (probably using the command RPUSH ) with the
serialized version of your job. When you run

No. 76 / 145
Martin Joo - Performance with Laravel

php artisan queue:work

Laravel starts a process that retrieves the oldest item from the list (FIFO, probably using the Redis command
LPOP ) deserializes the job class, and executes it. In this article, I present the basic mechanism of a job
queue (without using Laravel's built-in tools).

In a hash you can store objects with properties and values such as this one:

$site = [
'name' "# 'Amazon',
'url' "# 'amazon.com',
""$
];

streams and pub/sub can be used to communicate across different applications of services. You can build
a whole microservice architecture around streams . You can implement notification systems with pub/sub .
You can implement event sourcing with streams . You can use them to collect data from other sources such
as IoT devices, or we can use them in this very application. The PageView service, instead of dispatching a
job, can add a new event to a stream when a page view happens and the analytics service can consume the
stream and build the database.

No. 77 / 145
Martin Joo - Performance with Laravel

Sets
A set is a unique list of strings while a sorted set is a sorted version of it. It's very similar to a list or an
array but it doesn't contain duplicates.

You can think of a Redis set as a collection that doesn't have duplicates:

$set = collect(['value1', 'value2', 'value1'])"%unique();

"( ['value1', 'value2']

Anytime you add a new item Redis will guarantee the uniqueness.

You can think of a sorted set as an associative array that has a value and a score. The score is the basis of
the sorting:

$players = [
[
'name' "# 'Mike',
'score' "# 1,
],
[
'name' "# 'John',
'score' "# 3,
],
];

Anytime you add a new item Redis will sort them based on the score property.

They don't sound too interesting, right? However, the single fact that a sorted set is sorted at all times will
improve our response time from 2.9s to ~50ms when users want to see the most visited pages of the last 12
months. We will not just improve the most visited pages query but all of them. Using a simple, boring sorted
set. First, let's see how they work and then we start implementing them in the application.

Let's get you familiar with some Redis commands. You can spin up a server by running this:

docker run "0name redis-db "0rm -d -p 63790:6379 redis redis-server "0save 60


1 "0loglevel warning

It's going to listen on localhost:63790 . Then you can open a shell with the following command:

No. 78 / 145
Martin Joo - Performance with Laravel

docker exec -it 9d9f053bfd85 sh

9d9f053bfd85 is the container ID which you can find out by running docker ps .

Then you need to run redis-cli to access the CLI where you can run commands and see the contents of
your database. For example keys * lists every key in the DB.

Let's try a simple set first. In Redis, we don't need to create keys, we just add elements. Redis commands are
pretty ugly at first, but they follow a convention. For example, if we want to add an item to a set the
command is SADD where the S refers to the data type which is Set in this case. And of course, add is the
command we want to run.

So here's how we can add a new item to a set:

SADD usernames mike


SADD usernames john
SADD usernames john

The first argument usernames is the name of the set. This is the key. The second one is the value.

The last command returns 0 which indicates that the second john wasn't added to the set because it's a
collection of unique values. We can validate it by running SMEMBERS usernames :

SMEMBERS returns all members (items) of a set.

No. 79 / 145
Martin Joo - Performance with Laravel

Sorted sets
Sorted sets are a bit more "complicated." As I said, they have a value and a "score." The score can be any
number. It is the base of the sorting. All sorted set commands start with the letter Z . So this is why we can
add items to a sorted set:

ZADD leaderboard 8 "mike"

This adds the value mike with the score 8 to a sorted set called leaderboard .

To list the members of the sorted set we can use the ZRANGE command:

ZRANGE leaderboard 0 -1

It takes two arguments: start and end . 0 is the start and -1 is the end. -1 is a special number that
means all members. So this command returns everything from position 0 to the end of the set:

As you can see, I added three items: John has a score of 4, Mike has 8, and Joe has 9 so the ascending order
is:

John (4)

Mike (8)

Joe (9)

If we also want to see the scores we can use the WITHSCORES flag in the command:

No. 80 / 145
Martin Joo - Performance with Laravel

Since this set represents some kind of leaderboard, we probably need to get the items in descending order.
We can use the ZREVRANGE for this. REV stands for reverse so you can actually make sense of Redis
commands:

Z for sorted set

REV for reverse

RANGE for expressing that the command returns a range of items

With this command, we have the full leaderboard in descending order. If you want to query the top 3
contenders you can run

ZREVRANGE leaderboard 0 2 WITHSCORES

The first item should be $leaderboard[0] and the last one should be $leaderboard[2] .

Another useful command is ZREVRANGEBYSCORE which can be used to query items based on their scores
instead of indexes. It stands for:

Z for sorted set

No. 81 / 145
Martin Joo - Performance with Laravel

REV for reverse

RANGE for a range query

BYSCORE for expressing that we want to query items based on their scores instead of indexes

This is how we can query players below 5 points:

ZREVRANGEBYSCORE leaderboard 5 0 WITHSCORES

The result is:

No. 82 / 145
Martin Joo - Performance with Laravel

Page views
If you want to follow along just seed the database of the sample app, and then, run:

php artisan app:dispatch

It will dispatch lots of BuildRedisDataJob so you can start a few workers and they will build up the data in
Redis.

A sorted set is a perfect data structure to store page views. In the queries, we need three kinds of
"resolution":

Hourly for the last 24 hours chart

Daily for the last 7 and 30 days charts

Monthly for the last 6 and 12 months charts

In MySQL, we solve these by grouping the records based on their date. However, in Redis, there's no
concept such as group by . It requires us to think in terms of data structures instead of queries. In MySQL,
the hard part is usually figuring out the best queries and indexes. In Redis, it's usually finding the best data
structure.

For now, let's focus on the last 24-hour query. If you remember, earlier I showed a helper function that
returns all the hours between two dates in a DateFilter object. It looks like this:

"( In the actual codebase $dateFilter is an object not array and


"( dates are Carbons not strings
$dateFilter = [
'startDate' "# '2024-04-01 15:00:00',
'endDate' "# '2024-04-01 15:00:00',
];

"( It contains all 24 hours between 'startDate' and 'endDate'


$allDates = [
'2024-04-01 15:00:00',
'2024-04-01 16:00:00',
'2024-04-01 17:00:00',
""$
'2024-04-02 15:00:00',
];

This was used to return all the dates between the periods even if there was no data.

No. 83 / 145
Martin Joo - Performance with Laravel

We can use this array to our advantage. What if we store page views every hour with a count? Something
like this:

{
'2024-04-01 15:00:00': 92
}

This means that there were 92 page views in that hour. Expressed as a zset it looks like this:

{
value: "2024-04-01 15:00:00", score: 92
value: "2024-04-01 16:00:00", score: 73
value: "2024-04-01 17:00:00", score: 42
"( ""$
}

The key would be page_views:1:hourly where 1 is the site ID. It contains the hourly count of page views
for a given site for the last 24 hours. Why only 24? Because that's the only hourly chart we have. If there was
a last 48 hours chart then the zset should contain 48 items.

In a sorted set, members are unique. We can use that to store each hour only once. So each date should be
in the format of H:00:00 . Minutes and seconds are set to 0.

I'm only using human-readable dates for demonstration purposes. In the app, the zset contains
timestamps:

{
value: "1712066400", score: 92
"( ""$
}

This, to me, looks like a pretty good data structure to serve page view queries. It's pretty close to the actual
API response that looks like this:

No. 84 / 145
Martin Joo - Performance with Laravel

{
"labels": [
"2024-04-01 15:00:00",
"2024-04-01 16:00:00"
],
"data": [
92,
73
]
}

As you can see, we need to sort the data based on the date, not the page views count. So why not store
timestamps as scores? Something like this:

{
value: 92, score: 1712066400
}

This would be the perfect sorted set to solve this problem. Querying would be that simple:

zrangebyscore page_views:1:hourly {end_date} {start_date}

zrangescore returns a range of members ordered by scores. {end_date} and {start_date} is the date
filter where we have the start and the end timestamps. Since scores are timestamps (every score represents
an hour), it filters out only the hours needed and it returns them in reversed order.

So if we have a sorted set such as this (timestamps represented as dates for better readability):

{
value: 62, score: 2024-04-05 10:00:00
value: 74, score: 2024-04-05 09:00:00
value: 92, score: 2024-04-05 08:00:00
}

And we want to query the last 24 hours where the first date is 09:00 the command looks like this:

No. 85 / 145
Martin Joo - Performance with Laravel

zrangebyscore page_views:1:hourly 2024-04-06 09:00:00 2024-04-05 09:00:00

The results would be:

{
value: 74, score: 2024-04-05 09:00:00
value: 62, score: 2024-04-05 10:00:00
}

Filtered and sorted in ascending just as we need.

Unfortunately, we have a problem with this structure: values in a sorted set are unique. There's no way you
can guarantee the uniqueness of page view numbers.

That's the reason I use a zset where values are timestamps. We can guarantee the uniqueness of them.

{
value: 2024-04-05 10:00:00, score: 62
value: 2024-04-05 09:00:00, score: 74
value: 2024-04-05 08:00:00, score: 92
}

No. 86 / 145
Martin Joo - Performance with Laravel

Querying page views

Knowing that limitation, unfortunately, we cannot use one simple Redis command that returns the final
results because:

zrange and zrevrange work with indexes. For example, running zrange page_views:1:hourly 0 1
will return the first and second members from the set

zrangebyscore and zrevrangebyscore work with scores as we have seen. But the scores are page
view counts and we need to sort the set by timestamps.

What we can do is query the whole set and sort it in PHP. It sounds far from optimal, but in a second I'll
explain why it's not that bad. Here's the plan:

Query the whole zset

Filter the required timestamps

Build an array that contains the results

Each step is quite straightforward.

Query the whole zset

This is done by running: zrange page_views:1:hourly 0 -1 WITHSCORES

0 is the start index -1 is the end index. -1 is a special number meaning "query till the last member." The
WITHSCORES option means Redis returns not just the members but the scores as well. We need both of
them to build an array with the results.

In PHP, the command looks like this:

public function getData(DateFilter $dateFilter, array $dateFormats): array


{
$response = Redis"&zrange(
'page_views:' . $this"%site"%id . ':' . $dateFilter"%resolution,
0,
-1,
'WITHSCORES',
);
}

When a user queries a chart, in the Controller we build a DateFilter object. That object contains a
resolution property which can be hourly , daily , or monthly based on the user's filter.

No. 87 / 145
Martin Joo - Performance with Laravel

Filter the required timestamps

This zrange command returns an array such as this one:

$response = [
1711929600 "# "101",
1677628800 "# "10812",
1709251200 "# "99955",
1706745600 "# "103655",
1698796800 "# "105999",
1685577600 "# "106532",
1680307200 "# "107142",
1693526400 "# "107398",
1688169600 "# "110179",
1690848000 "# "110190",
1682899200 "# "110486",
1704067200 "# "110531",
1696118400 "# "110675",
1701388800 "# "110810",
];

The next step is to filter this array based on the timestamps we actually need. The best way to do that is a
binary search that requires the array to be sorted:

$timestamps = array_keys($response);

sort($timestamps);

$startIndex = $this"%dateService"%findIndex($timestamps, $dateFilter-


>startDate"%timestamp, 'start');

$endIndex = $this"%dateService"%findIndex($timestamps, $dateFilter"%endDate-


>timestamp, 'end');

$timestampsInRange = array_slice($timestamps, $startIndex, $endIndex -


$startIndex + 1);

No. 88 / 145
Martin Joo - Performance with Laravel

First, the array is sorted. findIndex implements a basic binary search. You can check it in the DateService
class. The last step is to get the sub-array containing only the interesting rows.

Build an array that contains the results

That's the easiest part. We have the timestamps in the required range and the whole zset with the page
view counts. We need a third array that combines them:

foreach ($timestampsInRange as $timestamp) {


$results[
Carbon"&createFromTimestamp($timestamp)"%format($dateFormats['format'])
] = (int) $response[$timestamp];
}

return $results;

$dateFormats['format'] contains a string such as H:00 as described in an earlier chapter.

After this loop, $results looks like this:

$results = [
'09:00' "# 92,
'10:00' "# 63,
""$
]

The class has another function that transforms this array into an API response. We'll take a look at that
later. For now, it's not important. Just try to focus on how we use Redis to get the required data.

The request took 65ms to complete with 10,000 page views in the last 24 hours. The actual zrange call took
only 5.4ms:

No. 89 / 145
Martin Joo - Performance with Laravel

Using MySQL to run the same request took 76ms. The query itself took 28ms:

It's not much of an improvement, is it? The zrange call is much faster than the SQL query, but the PHP code
takes a little time to run. Before we analyze the situation in more depth, let's take care of daily and monthly
page views.

No. 90 / 145
Martin Joo - Performance with Laravel

Daily and monthly page views

The page_views:1:daily zset has only one purpose: storing the page view counts in the last 24 hours.

Similarly, we will use two other sorted sets per site:

page_views:1:daily to store daily page view counts in the last 30 days.

page_views:1:monthly to store monthly page view counts in the last 12 months.

Why exactly 30 days and 12 months? Because that's the application needs based on the time filters available
for users. If you don't remember they are:

Last 24 hours

Last 7 days

Last 30 days

Last 6 months

Last 12 months

With these 3 sorted sets per site, we can satisfy all the requirements. But what if we need custom date
filters? Where users can select their own interval? We cover every scenario with these sets (except a filter
that is greater than a year), for example:

2024, 1st of January - 2024, 1st of April can be served using the page_views:1:monthly set

2024, 1st of January - 2024, 15th of January can be served using the page_views:1:daily set

etc

The only thing that cannot be done is a filter for more than a year. Which is very easy to solve. You're going
to see it when we talk about how to create the sorted sets.

page_views:1:daily has members such as these:

{
value: "2024-04-05 00:00:00", score: 1146
value: "2024-04-04 00:00:00", score: 924
}

And page_views:1:monthly has members such as these:

{
value: "2024-04-01 00:00:00", score: 15671
value: "2024-03-01 00:00:00", score: 98994
}

The great thing about it is that the command won't change:

No. 91 / 145
Martin Joo - Performance with Laravel

$response = Redis"&zrange(
'page_views:' . $this"%site"%id . ':' . $dateFilter"%resolution,
0,
-1,
'WITHSCORES',
);

The only thing that changes is the value of $dateFilter->resolution which can be:

hourly

daily

monthly

just as the keys in Redis.

Now let's see the results:

Querying the last 24 hours takes 49ms and the last 12 months takes 52ms. I listed a table at the end of the
previous chapter with the results of MySQL queries. Let's compare the two:

Query MySQL Redis Improvement

Page Views

- Last 12 months 714ms 52ms 93%

- Last 6 month 361ms 48ms 87%

- Last 30 days 232ms 51ms 78%

- Last 7 days 228ms 47ms 79%

- Last 24 hours 82ms 49ms 40%

40% for the last 24 hours but 93% for the last 12 months. Why is that?

No. 92 / 145
Martin Joo - Performance with Laravel

Scaling down the problem

One of the biggest challenges in MySQL is the size of the page_views table. As the table grows queries will
be slower. Querying page views in the last 24 hours took 82ms but in the last 12 months, it took 714ms.
When using a 1-year period MySQL needs to aggregate potentially millions of rows. Even though we only
need to show 12 items on the UI. The execution time grows with the number of records.

Now think about how many items each of the sorted sets store.

Hourly: 24 items for the last 24 hours

Daily: 30 items for the last 30 days

Monthly: 12 items for the last 12 months

The page view counts don't represent rows in a huge database anymore. They are just numbers. Each set
stores a pretty small number of items and a count for each one.

This way we achieve two important things:

Have very small and space-efficient structures in memory

Have pretty fast queries

It takes only a few hundred bytes to store page view data in these sets:

The daily set is the biggest one with 384 bytes.

To sum it up we are able to make aggregated reports about millions of page views by storing 67 numbers
(24 hourly, 31 daily, 12 monthly) that take ~1kB of memory, and an average request takes about 50ms
where the actual Redis command takes only ~5ms.

The execution doesn't grow anymore with the size of the data. We were able to scale down the problem
from millions to a few dozen of data points. From hundreds or even thousands to a few dozens of
milliseconds.

As I said earlier, in Redis, you spend more time thinking about how to efficiently store your data. If you are
smart about this step it can do magic in your app. On the other side, in MySQL, you tend to spend more
time about thinking how to query your data.

No. 93 / 145
Martin Joo - Performance with Laravel

Inserting page views

Inserting page views into these sorted sets is relatively easy. We only need to determine three things:

The timestamp of the hour

The timestamp of the day

And the timestamp of the month

With Carbon it's straightforward:

$timestampHour = $pageView"%visited_at"%startOfHour()"%timestamp;

$timestampDay = $pageView"%visited_at"%startOfDay()"%timestamp;

$timestampMonth = $pageView"%visited_at"%startOfMonth()"%timestamp;

startOfDay returns a date such as 2024-05-05 00:00:00 and so on.

To increment the score of a member in a zset we can use the zincrby command that looks like this:

zincrby page_views:1:daily 1 1711929600

It will increase the score of the member 1711929600 . If the member doesn't exist yet it adds it with the
given initial score.

In PHP the three commands look like this:

Redis"&zincrby('page_views:' . $pageView"%site_id . ':hourly', 1,


$timestampHour);

Redis"&zincrby('page_views:' . $pageView"%site_id . ':daily', 1,


$timestampDay);

Redis"&zincrby('page_views:' . $pageView"%site_id . ':monthly', 1,


$timestampMonth);

The resulting zset is this:

No. 94 / 145
Martin Joo - Performance with Laravel

Each timestamp represents a day and each score means the page view count for that day. This image
comes from RedisInsight.

We need to execute these commands in the Page View service. This is the one that accepts HTTP requests
from the customers' websites and dispatches a job that creates the actual page_views record in the
database. You can check out the code in App\Jobs\SavePageViewJob .

Later, I'm going to show you how to clean the data because otherwise, each hourly, daily, monthly set would
just grow infinitely.

Time complexity

We already discussed that the number of members in the sets and the overall size of them are super small.
This is one of the most important considerations. The other one is the time complexity. If you check out the
Redis command documentation you can see there's a time complexity for each and every command.

For example, this is what zrange 's time complexity says: O(log(N)+M) with N being the number of
elements in the sorted set and M the number of elements returned.

If you're not familiar with the Big O notation, here's a quick summary. Consider this function:

No. 95 / 145
Martin Joo - Performance with Laravel

public function loop(array $items): void


{
foreach ($items as $item) {
echo $item;
}
}

How many iterations does it run? Exactly N number of iterations where N is the size of the array. This is
expressed as O(n) which can be read as "this function has a time complexity of an order of N" which
translates to the following human sentence: "You perform as many iterations as the size of your array." This
is called a linear time algorithm.

The next one is constant time:

public function get(array $items, int $index): string


{
return $array[$index];
}

This is expressed as O(1) since no matter how many elements the array has it always performs only one
operation.

The last example is quadratic time:

public function slow(array $items): void


{
for ($i = 0; $i < count($items); $i"1) {
for ($j = 0; $j < count($items); $j"1) {
echo 'hello';
}
}
}

This is expressed as O(n^2) . If you have an array of 3 items it'll call echo 9 times.

So let's determine the time complexity of this:

$results = [];

No. 96 / 145
Martin Joo - Performance with Laravel

$response = Redis"&zrange(
'page_views:' . $this"%site"%id . ':' . $dateFilter"%resolution,
0,
-1,
'WITHSCORES',
);

$timestamps = array_keys($response);

sort($timestamps);

$startIndex = $this"%dateService"%findIndex($timestamps, $dateFilter-


>startDate"%timestamp, 'start');

$endIndex = $this"%dateService"%findIndex($timestamps, $dateFilter"%endDate-


>timestamp, 'end');

$timestampsInRange = array_slice($timestamps, $startIndex, $endIndex -


$startIndex + 1);

foreach ($timestampsInRange as $timestamp) {


$results[
Carbon"&createFromTimestamp($timestamp)"%format($dateFormats['format'])
] = (int) $response[$timestamp];
}

return $results;

We already know that zrange has a time complexity of O(log n) . As far as I know array_keys has O(n)
since it loops through the items. The PHP manual says that sort runs a quicksort which has an average
complexity of O(n log n) .

In the find* function I use a binary search. It has a complexity of O(log n) . By the way, this is how you
can interpret an O(log n) algorithm. They often halve the input size in each iteration. For example, the
binary search needs a sorted array because it takes the middle element and checks if it's greater or less
than the needle. If the middle number is less, the binary search algorithm uses the second half of the array
to run the next iteration. The second half is from the middle number to the last one. And so on and so on
until it finds the element we're looking for.

So if you have the numbers from 1 to 100 in a sorted array and you're looking for 99 this is what the
iterations look like:

No. 97 / 145
Martin Joo - Performance with Laravel

It takes 7 iterations to find 99. If you calculate log 100 it is 6.64 so binary search has a logarithmic time
complexity.

As far as I know, array_slice has a time complexity of O(k) where k is the length of the new array it
creates. We can round it up to O(n) . And finally, the foreach is O(n) of course.

So here's the whole method:

$results = [];

No. 98 / 145
Martin Joo - Performance with Laravel

"( O(log n) = 5
$response = Redis"&zrange(
'page_views:' . $this"%site"%id . ':' . $dateFilter"%resolution,
0,
-1,
'WITHSCORES',
);

"( O(n) = 31
$timestamps = array_keys($response);

"( O(n * log n) = 155


sort($timestamps);

"( O(log n) = 5
$startIndex = $this"%dateService"%findIndex($timestamps, $dateFilter-
>startDate"%timestamp, 'start');

"( O(log n) = 5
$endIndex = $this"%dateService"%findIndex($timestamps, $dateFilter"%endDate-
>timestamp, 'end');

"( O(n) = 31
$timestampsInRange = array_slice($timestamps, $startIndex, $endIndex -
$startIndex + 1);

"( O(n) = 31
foreach ($timestampsInRange as $timestamp) {
$results[
Carbon"&createFromTimestamp($timestamp)"%format($dateFormats['format'])
] = (int) $response[$timestamp];
}

return $results;

If we work with the page_views:1:monthly set (which is the longest) it has 31 items. So the number of
iterations is about 263.

No. 99 / 145
Martin Joo - Performance with Laravel

Of course, you don't need to determine these numbers on a daily basis. I don't count them when I'm
working with Redis. But it's a good thing to see them at least once. And of course, if you need to write some
extra code (outside of some regular Redis commands) to get your data it's a good thing to at least think
about time complexity. For example, if you have an array with 1,000 items and you do a linear search
instead of binary the time complexity is O(n) or 1,000 instead of O(log n) or 7. That is a 142x difference.
This is still gonna be fast, but we're talking about requests of 45ms and 50ms so the diff will be visible.

No. 100 / 145


Martin Joo - Performance with Laravel

Unique visitors
I planted a "bug" in the unique visitors query in MySQL. The select statement looked like this:

select count(distinct(ip))

It counts unique IP addresses as unique visitors. So it means if you visited a page on the 3rd of March and
on the 2nd of April it counts as 1 unique visitor in a 30-day, 6-month, or 12-month query. Real analytics apps
don't count unique visitors like this. They operate with cookies, sessions, etc and they are pretty smart. For
example:

If you visit a site 5 times on the 9th of April between 13:00 and 14:00 it counts as one unique visit

But if you visit the same site on the 9th of April and the 19th of April it probably counts as 2 unique
visits

Each platform does it differently but probably none of them uses a naive calculation such as
count(distinct(ip)) . Of course, we won't deal with sessions, cookies, etc but we can improve it a little bit.
We can use the following rule:

If you visit a site multiple times a day it counts as one unique visit

If you visit a site multiple times through multiple days each counts as a unique visit

Or put it in other words: we will count unique visitors on a daily basis. It'll make things more complicated
which is a good thing because we can explore more about Redis.

By clarifying this new rule, we know two things for sure:

We need to "recount" IP addresses on a daily basis

The smallest "resolution" is hourly

No. 101 / 145


Martin Joo - Performance with Laravel

Querying unique visitors

It means that a sorted set is probably one of the best solutions to store the data of the hourly chart. We can
use a similar data structure as before. Something like this:

{
value: 1.2.3.4, score: "2024-04-08 23:03:00",
value: 1.2.3.5, score: "2024-04-08 23:13:00",
}

This structure leverages both of the important properties of a zset :

Values are unique. Each IP address can be present only once in the set.

The set is sorted based on scores. Scores are going to be timestamps. If they are sorted we can save
some processing power because we don't need to use asort or findIndex as earlier. It's all sorted
and ready to go.

The command that queries members from the 1st of April 2024 to the 8th of April looks like this:

zrangebyscore unique_visitors:1:hourly 1711929600 1712534400 WITHSCORES

As a quick reminder:

z stands for sorted set

range means the command returns a range of members

byscore means it will search in scores not in indexes

unique_visitors:1:hourly is the name of the set where 1 is the site ID

1711929600 is the minimum score (1st of April 2024)

1712534400 is the maximum score (8th of April 2024)

WITHSCORES will also return the scores not just values

The command in redis-cli returns this structure:

No. 102 / 145


Martin Joo - Performance with Laravel

With a PHP client it's going to be transformed into an associative array:

[
'1.87.194.208' "# '1712512800',
'10.110.180.193' "# '1712512800',
'1.2.3.4' "# '1712516400'
""$
]

At first, it seems strange that the indexes are IP addresses (the values) but they must be unique in the set so
they cannot collide, unlike timestamps.

After this data is ready we only need to count the number of IP addresses associated with a given
timestamp. In this example the result should be:

[
'1712512800' "# 2,
'1712516400' "# 1,
""$
]

Now timestamps can be indexes since the result is grouped so they won't collide. The actual function is
quite straightforward:

No. 103 / 145


Martin Joo - Performance with Laravel

public function getData(DateFilter $dateFilter, array $dateFormats): array


{
$results = [];

if ($dateFilter"%resolution ""2 'hourly') {


$data = Redis"&zrangebyscore(
'unique_visitors:' . $this"%site"%id . ':hourly',
$dateFormats['allDates'][0]"%timestamp,
$dateFormats['allDates'][count($dateFormats['allDates']) - 1]-
>timestamp,
'WITHSCORES',
);

foreach ($data as $timestamp) {


$formatted = Carbon"&createFromTimestamp($timestamp)
"%format($dateFormats['format']);

if (!isset($results[$formatted])) {
$results[$formatted] = 0;
}

$results[$formatted]"1;
}
}

return $results;
}

Of course, users want to see human-readable dates not timestamps so the function formats them.

No. 104 / 145


Martin Joo - Performance with Laravel

Daily and monthly unique visitors

For counting page views each site only needs:

An hourly set with 24 members

A daily set with 31 member

And a monthly set with 12 members

As we discussed, it's small and efficient because we only need counts for an hour/day/month. However,
with unique visitors, the picture is quite different. We need to store each and every unique IP address for
each hour/day/month.

In my example database, there are 21,125 unique visitors in the last 24 hours for site 1:

It takes 2MB of memory to store these 21k visitors. The more memory a key takes the slower it gets. Now
imagine a site with 10,000,000 monthly unique visitors. The zset would take up about 950MB of memory.
Now imagine there are 10,000 sites using your analytics platform. It would cost lots of money and servers to
just store the data.

What if I told you you can store 10,000,000 unique items using only 12kB of RAM?

What if 100,000,000 items can be stored using only 12kB of memory?

In Redis we can store 18,446,744,073,709,551,616 unique items using 12kB of memory and counting them
has a time complexity of O(1).

If you don't believe me, hold my HyperLogLog.

No. 105 / 145


Martin Joo - Performance with Laravel

HyperLogLog

When I first saw the word "HyperLogLog" written down I was panicking. I was thinking:

"Okay, so it's a log. But it's not an ordinary log, it's hyper. And it's not just a hyper log. It's a hyper double
log. A hyper log log? What?"

Then I read the first sentence of the documentation:

HyperLogLog is a probabilistic data structure that estimates the cardinality of a set.

Well, it didn't help. So here's my simplified explanation:

It's a set that always takes 12kB of memory no matter how many elements it has.

Since it always takes the same amount of memory it means it doesn't actually store the items. Which
means you cannot access the individual elements. You can only count them. There are no commands
such as zrange , etc.

Counting the elements has a time complexity of O(1) which is super fast.

It has a 0.81% error rate. Meaning, it's not perfect. However, the error rate is so low it's neglectable in
most cases.

You can read about how HyperLogLogs work under the hood here and here.

The prefix for HLL commands is PF . The two most important commands are:

PFADD to add an item to the log

PFCOUNT to count the items in the log

For example:

PFADD my-log 1.2.3.4


PFADD my-log 1.2.3.4
PFADD my-log 1.2.3.5
PFCOUNT my-log
2

No. 106 / 145


Martin Joo - Performance with Laravel

Daily unique visitors

The idea is that we have a HLL for each day. In the following chapter I'll show you how to insert the data but
if we assume there's a HLL for every day, querying a daily chart is incredibly easy:

public function getData(DateFilter $dateFilter, array $dateFormats): array


{
$results = [];

if ($dateFilter"%resolution ""2 'daily') {


/** @var Carbon $date ")
foreach ($dateFormats['allDates'] as $date) {
$results[$date"%format($dateFormats['format'])] = Redis"&pfcount(
'unique_visitors:' . $this"%site"%id . ':' . $dateFilter"%resolution
. ':' . $date"%timestamp
);
}
}

if ($dateFilter"%resolution ""2 'hourly') {


"( ""$
}

return $results;
}

As I said earlier, $dateFormat['allDates'] contains every date for the given resolution. In this case, 30, or
31 days as Carbon objects. The function iterates through them and counts an HLL using pfcount . That's it.
We have just calculated a 7-day chart running 7 pfcount commands or a 30-day chart running 30 pfcount
commands. Each of them counts an HLL that is 12kB and the operations takes a constant time regardless of
the number of elements.

Just as with page views there are going to be 31 daily HLLs.

No. 107 / 145


Martin Joo - Performance with Laravel

Monthly unique visitors

Counting monthly unique visitors is a little bit different. Our rule is that IP addresses should be counted as
unique on a daily basis. If that's the case, we cannot just store a month's worth of visitors in one unique
HLL. It would mean that you count as one unique visitor if you visited the site on the 1st and on the 7th, for
example. That's not the intended behavior.

Instead of one HLL for the whole month, we need to count the individual days and sum them up.
Fortunately, it's not that complicated:

public function getData(DateFilter $dateFilter, array $dateFormats): array


{
$results = [];

if ($dateFilter"%resolution ""2 'monthly') {


$days = $this"%dateService"%getDaysBetween($dateFilter);

foreach ($days as $day) {


$count = Redis"&pfcount(
'unique_visitors:' . $this"%site"%id . ':daily:' . $day"%timestamp
);

$key = $day"%format($dateFormats['format']);

if (!isset($results[$key])) {
$results[$key] = 0;
}

$results[$key] += $count;
}
}

if ($dateFilter"%resolution ""2 'hourly') {


"( ""$
}

return $results;
}

No. 108 / 145


Martin Joo - Performance with Laravel

We need all the days between the start and end dates and then count the individual days and sum them up.
In the worst-case scenario, we run 365 O(1) operations Since we will serve a 12-month chart request with
daily HLLs we need to keep 365 of them at any given time.

Hourly unique visitors

Why is the hourly unique visitors query different? It uses a zset while the other two use HLLs. Well, as we
discussed earlier, IP addresses should be counted as unique on a daily basis. With a zset we can satisfy
this requirement. However, with HLLs, we cannot. There are two different solutions with HLLs:

24 HLLs for every hour. In this case, the problem is that if you visited a site at 13:01 and then again at
15:09 on the same day your IP would be counted as two unique visitors. According to the rule, it should
be counted as one.

1 HLL for the last 24 hours. In this case, we wouldn't be able to show an hourly graph.

This is why we went with a zset that has IP addresses as values and timestamps as scores.

I know it sounds overwhelming at first. Sorted sets, HyperLogLogs, hourly, daily monthly, etc. Especially if
you're not very experienced with Redis. I advise you to seed a smaller database (using the DatabaseSeeer
class), for example, 1 site with 100 page views. Then run the BuildRedisDataJob that creates these data
structures in Redis. Install RedisInsight and play see the actual data for yourself. It's going to be much easier
if you can visually see the data for yourself.

Scaling down the problem

We did the same as with page views. Instead of scanning, grouping, and filtering millions of rows, we have:

A zset for the last 24 hours with a few thousand (maybe tens of thousands) unique IP addresses

365 HyperLogLogs for the last 365 days

Each HLL is super small and super fast. The sorted set can be bigger is the site has pretty huge traffic. With
21,000 unique IP addresses for the last 24 hours (which is quite a lot) taking up 3MB of memory the whole
request takes 174ms. I think you need 5-10 million unique visitors per day before you experience any kind
of performance problems with this solution.

Here are the results:

Now let's compare the performance:

No. 109 / 145


Martin Joo - Performance with Laravel

Query MySQL Redis Improvement

Unique visitors

- Last 12 months 1380ms 452ms 67%

- Last 6 month 703ms 313ms 55%

- Last 30 days 394ms 59ms 85%

- Last 7 days 390ms 46ma 88%

- Last 24 hours 95ms 186ms -48%

Since the last 24 hours query uses a big zset it got slower. Every other query increased increased
incredibly. And once again, the great thing about this is that the execution time won't slow down as the
number of visitors grows.

Inserting unique visitors

Inserting unique visitors is just as easy as inserting page views:

Redis"&zadd('unique_visitors:' . $pageView"%site_id . ':hourly',


$timestampHour, $pageView"%ip);

Redis"&pfadd('unique_visitors:' . $pageView"%site_id . ':daily:' .


$timestampDay, $pageView"%ip);

Redis"&pfadd('unique_visitors:' . $pageView"%site_id . ':monthly:' .


$timestampMonth, $pageView"%ip);

The only difference is that the hourly data is stored in a zset while daily and monthly data is in multiple
HLLs.

The resulting zset and the overall structure look like this in RedisInsight:

No. 110 / 145


Martin Joo - Performance with Laravel

HyperLogLogs are shown as string keys and you cannot see the value of them, you can only run PFCOUNT
and other commands.

We need to execute these commands in the Page View service. This is the one that accepts HTTP requests
from the customers' websites and dispatches a job that creates the actual page_views record in the
database. You can check out the code in App\Jobs\SavePageViewJob .

Time complexity

Now let's take a look at the code from a time complexity point-of-view. Calculating daily data is the simplest
and fastest:

if ($dateFilter"%resolution ""2 'daily') {


/** @var Carbon $date ")
foreach ($dateFormats['allDates'] as $date) {
$results[$date"%format($dateFormats['format'])] = Redis"&pfcount(
'unique_visitors:' . $this"%site"%id . ':' . $dateFilter"%resolution .
':' . $date"%timestamp
);
}
}

There are 365 daily HLLs but for the last 30 days chart we only need to scan 30 of them. So in the worst-case
scenario, we are running 30 O(1) operations ( pfcount ). It's extremely fast.

No. 111 / 145


Martin Joo - Performance with Laravel

Calculating hourly data is a bit more complex and slower:

if ($dateFilter"%resolution ""2 'hourly') {


$data = Redis"&zrangebyscore(
'unique_visitors:' . $this"%site"%id . ':hourly',
$dateFormats['allDates'][0]"%timestamp,
$dateFormats['allDates'][count($dateFormats['allDates']) - 1]"%timestamp,
'WITHSCORES',
);

foreach ($data as $timestamp) {


$formatted = Carbon"&createFromTimestamp($timestamp)
"%format($dateFormats['format']);

if (!isset($results[$formatted])) {
$results[$formatted] = 0;
}

$results[$formatted]"1;
}
}

zrangebyscore is an O(log(N) + M) operation where N is a number of elements in the set and M is the
number of elements being returned. In this case, both numbers can be relatively big. Having a sorted with
21k elements this command takes 112ms for me. It seems fast, but in Redis, it's actually pretty slow. On top
of that, this command might return thousands or tens of thousands of items. That also takes some extra
memory to store them in the $data variable.

It consumed 10MB of RAM:

No. 112 / 145


Martin Joo - Performance with Laravel

Compare that to the last 30 days request that consumed only 4MB of memory:

On top of that, after we query the data from Redis we run a loop that invokes a Carbon function for every
timestamp. It's not a heavy operation but it runs tens of thousands of times. So this is why this request
takes ~180ms compared to 59ms.

No. 113 / 145


Martin Joo - Performance with Laravel

Memoization

One small thing we can do is memoization. Take a look at the daily sorted set:

These IP addresses all have the same score. The same timestamp. There can be hundreds or thousands of
IP addresses with the same timestamp. It means that this line:

$data = Redis"&zrangebyscore(""$);

foreach ($data as $timestamp) {


$formatted = Carbon"&createFromTimestamp($timestamp)
"%format($dateFormats['format']);
}

calculates the formatted date for the same timestamp hundreds or thousands of lines. We are wasting
valuable milliseconds doing the same thing over and over again.

Memoization is a simple technique where we cache results that the function has already calculated and re-
use them when it makes sense. In Laravel 11 there's a once function that does exactly that. As I'm writing
these lines L10 is the available version so let's create a specialized function that works only with this use
case:

No. 114 / 145


Martin Joo - Performance with Laravel

private function format(string $timestamp, string $format): string


{
if (isset(self"&$dateCache[$timestamp])) {
return self"&$dateCache[$timestamp];
}

$value = Carbon"&createFromTimestamp($timestamp)
"%format($format);

self"&$dateCache[$timestamp] = $value;

return $value;
}

$dateCache is a static property of the class and is a simple array. The format function checks if the given
timestamp has already been set in the array and if that's the case it doesn't invoke the Carbon function but
returns it from the cache.

If the given timestamp has never been seen before it calculates the date using Carbon and adds the result
to $dateCache .

So instead of invoking Carbon::createFromTimestamp in the loop, we can use the format function:

$data = Redis"&zrangebyscore(""$);

foreach ($data as $timestamp) {


$formatted = $this"%format($timestamp, $dateFormats['format']);
}

Here's a request that doesn't use memoization:

No. 115 / 145


Martin Joo - Performance with Laravel

118ms overall. The Redis command took 32ms so this request spent 86ms in PHP.

Here's the request using memoization:

No. 116 / 145


Martin Joo - Performance with Laravel

83ms overall. The Redis command took 30ms so this request spent only 53ms in PHP land. That's a 38%
increase just by applying a simple trick.

Earlier I showed you that this request took 186ms. In that case, it had to process 21k unique visitors. Now, in
this test, I only have ~11k so this is why it was a bit faster by default. But it doesn't change the fact that
memoization is a smart move here.

No. 117 / 145


Martin Joo - Performance with Laravel

Most visited pages


When querying the most visited pages the app returns a JSON such as this:

{
"/home": 18,
"/about": 16,
"/blog": 15
}

If you think about it a sorted is the perfect data structure to store data such as this. Since members have to
be unique, each page will be stored once, and scores can be counted:

{
value: "/home", score: 18,
value: "/about", score: 16
}

The set is sorted which is quite helpful for this use case.

So the data structure should be zset . The only question is how to handle hourly, daily, and monthly data?

The "best" way should be if each chart had its own zset, for example,
most_visited_pages:1:last_12_months that contains the most visited pages for site 1 in the last 12
months at any given time. All the data is there and we need to run only one command to get the top 5
elements. This approach, however, will introduce other problems because you need to maintain this zset
all the time. For example, the last 12 months mean from the 9th of April, 2023 to the 9th of April, 2024 when
I'm writing these lines. But tomorrow it changes to 10th of April. So you need a background process that
recalculates the data each and every day. When it comes to hourly charts you need to maintain the data
every hour. You need to do it for hundreds of sites. Each query will take seconds to run. Even though your
Redis command will be super fast, the background job puts a considerable amount of work on your
database and workers. If one of them fails, your users will get false data immediately. So this approach can
work, of course, but it introduces problems that are hard to see and debug. We'll go with a simpler solution.

We can store hourly, daily, and monthly page visits in separate zsets. For example:

most_visited_pages:1:daily:1709769600 :

No. 118 / 145


Martin Joo - Performance with Laravel

{
value: "/home", score: 12,
value: "/blog", score: 11,
}

This sorted set stores URIs and counts for a given site on a given day. We can do the same with hourly and
monthly data as well.

Querying most visited pages

There's a Redis command called zunion :

zunion 2 most_visited_pages:1:daily:1709769600
most_visited_pages:1:daily:1709856000

This command calculates the union of the given zsets. The first argument ( 2 in this case) is the number of
zsets to be merged, and the following arguments are the names of the zsets.

The Redis documentation says:

By default, the resulting score of an element is the sum of its scores in the sorted sets where it exists.

So if both zsets look like this:

most_visited_pages:1:daily:1709769600
{
value: "/home", score: 12
}

most_visited_pages:1:daily:1709856000
{
value: "/home", score: 27
}

The result will be:

{
value: "/home", score: 39
}

No. 119 / 145


Martin Joo - Performance with Laravel

If we store a zset for every day, we can calculate the most visited pages for the last 7 days by calculating the
union of 7 zsets. We can do the same with hourly, and monthly data as well.

However, the result of this command can be big. If the site has thousands of pages (such as a larger
webshop) it will return an array of thousands or tens of thousands of items which consumes lots of
unnecessary memory.

Fortunately, there's another command called zunionstore :

zunionstore my_results 2 most_visited_pages:1:daily:1709769600


most_visited_pages:1:daily:1709856000

Instead of returning the results, it stores them as a new zset called my_results in this example. If we can
store the results, then we can query the top N members and delete them later. It'll be a more efficient way
compared to storing and processing thousands of items in PHP.

This is how we can query the most visited pages for any given time period:

public function getData(DateFilter $dateFilter, array $dateFormats): array


{
$keys = collect($dateFormats['allDates'])
"%map(fn (Carbon $date) "#
'most_visited_pages:' . $this"%site"%id . ':' . $dateFilter"%resolution
. ':' . $date"%timestamp
)
"%toArray();

$tempKey = 'temp:' . Str"&uuid();

Redis"&zunionstore($tempKey, $keys);

$data = Redis"&zrevrange($tempKey, 0, 4, 'WITHSCORES');

Redis"&del($tempKey);

return $data;
}

As I said earlier, $dateFormats['allDates'] contains all dates between the given time period. If the user
requests the last 24-hour chart it contains the last 24 hours as Carbon objects. The map creates a Redis key
from these dates. Keys like this:

No. 120 / 145


Martin Joo - Performance with Laravel

most_visited_pages:1:daily:1709769600

Then the function generates a random key. It's going to be used to store the results of zunionstore . Which
happens on the next line. When using predis you don't need to pass the number of keys. It takes a
destination key and an array with the source keys.

Now that the union is calculated and stored at the temp key, we only need to run this command:

zrevrange temp:asdf-1234 0 4 WITHSCORES

A zset is sorted in ascending order. So if you run zrange you'll get the least visited pages first. This is why
we need to run zrevrange . It's the equivalent of ORDER BY desc . 0 is the start index (the member with the
highest scores) and 4 is the end index. The WITHSCORES option returns not just the values (URIs) but scores
(counts) as well.

The command returns an array such as this:

[
"/home" "# 127,
"/blog" "# 113,
]

Finally, the temporary zset can be deleted.

Scaling down the problem

At this point, you already know the beauty of building data structures in Redis. We can scale down the
problem from millions of records to hundreds or thousands in the worst case. Querying these smaller sets,
lists, and HLLs is orders of magnitude faster than querying millions of rows in MySQL.

We did the same thing with the most visited pages as well. All we need is:

24 members in a zset for the last 24 hours

31 members for the last 31 days

12 members for the last 12 months

How many records will these sets have? It's hard to say because it all depends on how many unique pages
an average customer has inside a site.

For these tests, I went with the worst-case scenario. The site I'm using has 687,462 unique pages. It
represents a quite big webshop or news site. Just for comparison, an average blog has something like a few
hundred pages. An average SaaS landing page with documentation included might have a few hundred
unique pages or a few dozen.

So the following request times represent one of the worst-case scenarios:

No. 121 / 145


Martin Joo - Performance with Laravel

As you can see, when there are 687k unique URIs for a site it takes about 1400ms to store the union and
then return the top 5 elements.

How does it perform compared to MySQL?

Query MySQL Redis Improvement

Most visited pages

- Last 12 months 2952ms 1402ms 53%

- Last 6 month 1213ms 759ms 37%

- Last 30 days 449ms 471ms -5%

- Last 7 days 444ms 119ms 73%

- Last 24 hours 77ms 23ms 70%

On average, we were able to improve the performance by a very healthy 45%.

Inserting most visited page views

Building up the sorted sets couldn't be easier:

Redis"&zincrby('most_visited_pages:' . $pageView"%site_id . ':hourly:' .


$timestampHour, 1, $pageView"%uri);

Redis"&zincrby('most_visited_pages:' . $pageView"%site_id . ':daily:' .


$timestampDay, 1, $pageView"%uri);

Redis"&zincrby('most_visited_pages:' . $pageView"%site_id . ':monthly:' .


$timestampMonth, 1, $pageView"%uri);

When a new page view happens we need to increase the score of the given URI.

No. 122 / 145


Martin Joo - Performance with Laravel

We need to execute these commands in the Page View service. This is the one that accepts HTTP requests
from the customers' websites and dispatches a job that creates the actual page_views record in the
database. You can check out the code in App\Jobs\SavePageViewJob .

Time complexity

Now let's take a look at the code from a time complexity point-of-view. This is the whole function:

public function getDataFromRedis(DateFilter $dateFilter, array $dateFormats):


array
{
$keys = collect($dateFormats['allDates'])
"%map(fn (Carbon $date) "#
'most_visited_pages:' . $this"%site"%id . ':' . $dateFilter"%resolution
. ':' . $date"%timestamp
)
"%toArray();

$tempKey = 'temp:' . Str"&uuid();

Redis"&zunionstore($tempKey, $keys, [0, 1]);

$data = Redis"&zrevrange($tempKey, 0, 4, 'WITHSCORES');

Redis"&del($tempKey);

return $data;
}

Generating the Redis keys is neglectable. It's an O(n) operation where n is the number of dates. It's 31 in
the worst-case scenario and the loop doesn't execute any performance-heavy operation.

zunionstore on the other hand is an O(N) + O(M log(M)) where N is the sum of the sizes of the input
sets and M is the number of elements in the resulting set. For example, if we have 2 sets with a size of 1,000
and 3,000 and the resulting set will have 3,000 elements zunionstore runs 4000 + (3000 * log(3000))
operations which translates to 4000 + (3000 * 11) or 37,000 operations.

del has a time complexity of O(1) . But only if your key is a string. In the case of a sorted set it's an O(M)
operation where M is the number of keys in the set. So, unfortunately, it can be quite "slow." This is because
del needs to reclaim the memory used by the set.

There are three ways to solve this:

No. 123 / 145


Martin Joo - Performance with Laravel

Use unlink instead of del . It's an async command. Meaning, it won't block the server and it won't
reclaim all the memory in one request. When you call it, it just unlinks the key from the keyspace. So
from the client's perspective, the key won't be available anymore. And then in a different thread
unlink actually cleans up and reclaims the memory used by the key. It's a great solution, however,
predis doesn't support it (as far as I know). phpredis on the other hand does support unlink .

Don't delete the key. Leave it as it is and run a background job every hour that deletes temp keys. You
can use the command keys temp* to get all deletable keys and then run a del command.

Use the temp key as a "cache" for a few minutes. When a user requests the last 12 months' chart and
you calculate this expensive sorted set, you leave it in Redis for the next 5 minutes (for example), and if
the user requests the same chart you can skip the calculation and respond with only a zrevrange
command. In Redis, every key has a TTL (time-to-live) value. If the TTL expires the key will be deleted
automatically. You can set the TTL by running the expire command:

expire temp:asdf-1234 300

This command sets a TTL value of 5 minutes (300 seconds) for the temp key.

No. 124 / 145


Martin Joo - Performance with Laravel

Cleaning up data
The last piece of the puzzle is to clean up old data. For example, we assumed that page_views:1:hourly
should contain only 24 items for the last 24 hours. But right now whenever a page view happens we just
add it to the zset so it will end up with 24+ items and it'll grow over time.

We need to run a background job that cleans up these sorted sets and HyperLogLogs. I won't show you the
whole class because it's very repetitive, but here's how hourly page views are deleted:

namespace App\Jobs;

class CleanUpRedisDataJob implements ShouldQueue


{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

public function handle(): void


{
foreach (Site"&all() as $site) {
$this"%removeHourlyPageViews($site);
}
}

private function removeHourlyPageViews(Site $site): void


{
$timestamps = Redis"&zrange(
'page_views:' . $site"%id . ':hourly', 0, -1
);

foreach ($timestamps as $timestamp) {


if (Carbon"&createFromTimestamp($timestamp)"%diffInHours(now()) > 24) {
Redis"&zrem('page_views:' . $site"%id . ':hourly', $timestamp);
}
}
}
}

zrange key 0 -1 returns every member of the set so we can check each of them if they are too old or not.
If a timestamp is older than 24 hours it can be deleted by running zrem . The same logic applies to daily and
monthly page views. Only the diffInHours condition is different.

No. 125 / 145


Martin Joo - Performance with Laravel

Daily unique visitors are stored in HyperLogLogs that don't have commands such as zrange and zrem . This
is how we can remove the unnecessary items:

private function removeDailyUniqueVisitors(Site $site): void


{
$keys = Redis"&keys('unique_visitors:' . $site"%id . ':daily*');

foreach ($keys as $key) {


$timestamp = Str"&after($key, 'daily:');

if (Carbon"&createFromTimestamp($timestamp)"%diffInDays() > 365) {


Redis"&del($key);
}
}
}

The keys command can use wildcards such as * . In this example, it is called with:

unique_visitors:1:daily*

It returns every key that starts with this string. After the word daily there's a timestamp for the given day.
We get this part by using Str::after and then checking if the timestamp is older than 1 year.

Most visited pages have a zset for every hour/day/month so we can use a similar keys command:

private function removeHourlyMostVisitedPages(Site $site): void


{
$keys = Redis"&keys('most_visited_pages:' . $site"%id . ':hourly*');

foreach ($keys as $key) {


$timestamp = Str"&after($key, 'hourly:');

if (Carbon"&createFromTimestamp($timestamp)"%diffInHours() > 24) {


Redis"&del($key);
}
}
}

No. 126 / 145


Martin Joo - Performance with Laravel

It's almost identical to the previous one.

The job is called CleanUpRedisDataJob and is scheduled to run once every day:

namespace App\Console;

class Kernel extends ConsoleKernel


{
protected function schedule(Schedule $schedule): void
{
$schedule"%job(CleanUpRedisDataJob"&class)"%daily();
}
}

We don't need to run it every hour. It's okay if an hourly list has 48 items, for example. It won't cause bugs
or performance issues. Redis will consume a little bit more memory, however, the extra data can be helpful
for debugging sometimes.

No. 127 / 145


Martin Joo - Performance with Laravel

Falling back to MySQL


Now the application has two databases: MySQL is the source of truth with all the data, and Redis is "just"
another layer on top of that with pre-processed data to get better performance.

Lots of things can happen to Redis:

You need to maintain it and you shut it down

The clean-up script accidentally removes current data

The data is just not correct because of some bug

You use the same Redis database for cache and for page views/unique visitors/most visited pages data
and you run php artisan cache:clear . Now all the data is lost. In a minute, we'll talk about how to
solve this issue.

So it's a good thing to fall back to MySQL if Redis doesn't have the data. In each service class ( PageViews ,
UniqueVisitors , and Pages ) we can do something like this:

namespace App\Services\Dashboard;

class PageViews
{
public function "'construct(
private Site $site,
private DateService $dateService,
) {}

/**
* @return array{labels: array<string>, data: array<int>}
")
public function getData(
DateFilter $dateFilter,
DateFilterPeriod $period
): array {
$dateFormats = $this"%dateService"%getDateFormats($dateFilter, $period);

$data = $this"%getDataFromRedis($dateFilter, $dateFormats);

if (empty($data)) {
$data = $this"%getDataFromDatabase($dateFilter, $dateFormats);
}

No. 128 / 145


Martin Joo - Performance with Laravel

return [
'labels' "# $dateFormats['allDates']"%map(
fn (Carbon $day) "# $day"%format($dateFormats['format'])
),
'data' "# $dateFormats['allDates']"%map(
fn (Carbon $date) "# $data[$date"%format($dateFormats['format'])] ",
0
),
];
}
}

getData serves as an entry point to all these service classes. It tries to get the data from Redis. If it's not
there it falls back to MySQL invoking another function.

In a scenario where you have the wrong data in Redis (for example, the clean-up script removed the current
month) you have two choices:

You quickly run a script that fixes the data (something like the BuildRedisDataJob )

Or if it takes a longer time to fix everything, you can just run flushall in your Redis database, and
then your app falls back to MySQL because there's no data in Redis

Now imagine that you had to run flushall and you need to rebuild everything in Redis and that takes 20
minutes. In these 20 minutes, you'll have partial data in Redis. To avoid serving partial data from your app
you can introduce a setting, feature flag, or env variable:

No. 129 / 145


Martin Joo - Performance with Laravel

public function getData(


DateFilter $dateFilter,
DateFilterPeriod $period
): array {
$dateFormats = $this"%dateService"%getDateFormats($dateFilter, $period);

$data = [];

$useRedis = Setting"&where('key', 'USE_REDIS')"%first();

if ($useRedis) {
$data = $this"%getDataFromRedis($dateFilter, $dateFormats);
}

if (empty($data)) {
$data = $this"%getDataFromDatabase($dateFilter, $dateFormats);
}
}

You need to set this setting to false while you're rebuilding the data in Redis.

No. 130 / 145


Martin Joo - Performance with Laravel

Avoiding accidental data loss


If you use the default Laravel settings your app will use one Redis database. It means that:

Cache::set('foo', 'foo')

And Redis::zadd()

Will use the same database. So when you run php artisan cache:clear or php artisan optimize:clear
your data is gone.

To avoid this situation you need two separate databases in Redis. You can set it up in the database.php

'redis' "# [
'client' "# env('REDIS_CLIENT', 'predis'),

'options' "# [
'cluster' "# env('REDIS_CLUSTER', 'redis'),
],

'default' "# [
'url' "# env('REDIS_URL'),
'host' "# env('REDIS_HOST', '127.0.0.1'),
'username' "# env('REDIS_USERNAME'),
'password' "# env('REDIS_PASSWORD'),
'port' "# env('REDIS_PORT', '6379'),
'database' "# env('REDIS_DB', '0'), "( Important
],

'cache' "# [
'url' "# env('REDIS_URL'),
'host' "# env('REDIS_HOST', '127.0.0.1'),
'username' "# env('REDIS_USERNAME'),
'password' "# env('REDIS_PASSWORD'),
'port' "# env('REDIS_PORT', '6379'),
'database' "# env('REDIS_DB', '1'), "( Important
],
],

The cache connection uses database 1 but the default one uses db 0 .

No. 131 / 145


Martin Joo - Performance with Laravel

Redis::zrange() uses the default connection by default

Cache::get('foo') uses the cache connection by default

You can validate this by setting some cache keys and then changing the database in redis-cli using the
select command.

Database 1 is the cache:

Database 0 is the data store:

No. 132 / 145


Martin Joo - Performance with Laravel

Caching
In a near real-time application such as this one, you cannot benefit too much from caching. However,
consider the following scenario:

A user logs in

He checks the last 24 hours

He checks the last 12 months to compare the current month to last year's

He switches back to the last 24 hours

I always do this, because I forget the previous chart when switching to another one. Imagine when a
marketing guy runs a black Friday campaign and he monitors the result. In a 5-minute window, he'll check
the same chart multiple times. And each time we run expensive queries on the backend. We can remove
some load and improve the overall UX if we apply a 5 or 10-minute cache for each chart. So when users
switch between charts the app can serve them from the cache.

It's not a game-changing thing but it'll save a few requests and implementing this is super easy:

No. 133 / 145


Martin Joo - Performance with Laravel

class PageViews
{
public function "'construct(
private Site $site,
private DateService $dateService,
) {}

/**
* @return array{labels: array<string>, data: array<int>}
")
public function getData(
DateFilter $dateFilter,
DateFilterPeriod $period
): array {
$dateFormats = $this"%dateService"%getDateFormats($dateFilter, $period);

$cacheKey = 'page_views:' . $this"%site"%id . ':' . $period"%value;

return Cache"&remember($cacheKey, 5 * 60, function () use ($dateFilter,


$dateFormats) {
$data = $this"%getDataFromRedis($dateFilter, $dateFormats);

if (empty($data)) {
$data = $this"%getDataFromDatabase($dateFilter, $dateFormats);
}

return [
'labels' "# $dateFormats['allDates']"%map(
fn (Carbon $day) "# $day"%format($dateFormats['format'])
),
'data' "# $dateFormats['allDates']"%map(
fn (Carbon $date) "# $data[
$date"%format($dateFormats['format'])
] ", 0
),
];

No. 134 / 145


Martin Joo - Performance with Laravel

});
}

The cache looks like this: page_views:1:24h where 1 is the site ID and 24h is the chart type. The TTL is 5
minutes which covers a good portion of users switching back and forth between charts. Of course, if you
have production data and experience you can fine-tune this number.

The worst case scenario that can happen with a cache like this is this: a user checks the 24-hour chart at
23:59 switches to another chart then switches back to the 24-hour chart at 00:01. He'll see yesterday's data
for another 3 minutes.

You can validate your cache in Telescope. The first-page load misses the cache and then sets it:

The second one hits it:

A cache like this is mainly good for improving UX in my experience. Unfortunately, it won't serve enough
requests to help the servers. Maybe if it's black Friday and everyone goes crazy.

No. 135 / 145


Martin Joo - Performance with Laravel

Conclusion
I know it's a lot to take in. Using Redis for the first time feels very different compared to MySQL. It requires a
different thinking. But as you see we were able to gain 50%, 60%, and sometimes even 80% performance
improvements. Which is awesome.

This example was complicated. If we didn't need to handle different resolutions such as hourly, daily, and
monthly charts everything would be much much easier. I chose this example because it was complicated
enough. If you go through these examples, and understand them you'll be able to use Redis very efficiently
in easier situations.

I'd like to end this chapter with the same four words I started with: Redis is a beast.

No. 136 / 145


Martin Joo - Performance with Laravel

Architecture v6: CQRS


Right now, the architecture looks like this:

There's one problem with it: the two services share the same database. PageView API writes into MySQL
(using workers) and the Analytics API reads from it. But the Analytics API also writes data into it. For
example, registering users, and creating sites.

This is problematic for a few reasons:

First of all, who will migrate the database? It doesn't seem like a big deal, but imagine you're working
on the PageView API you want to add a new column and there is no migration for the page_views
table. Do you need to go to another service to add a column that you will use in another service? On
top of that, you need to make a rule. A. rule about which service will run migrations. And every
developer on the team needs to follow that rule. It's confusing.

No. 137 / 145


Martin Joo - Performance with Laravel

Tight coupling and low cohesion. Imagine you're working on the Analytics API and you need to change
the schema. For example, renaming a column. You can break the entire PageView API without even
noticing it.

These problems lead to shared responsibility. You don't really know who's in charge of the database
which leads into even more trouble. Imagine if you have 6 services.

These problems are not strictly related to performance. These are microservice and distributed system
problems. I'll cover the possible solutions and show you some code examples but I won't implement the
whole thing. If you want to read more about these topics check out this and this article.

The solution to this one-database-two-services problem is CQRS. It stands for Command Query
Responsibility Segregation. There are three important terms here:

Command is a write operation such as a MySQL insert

Query is a read operation such as a select statement

Segregation means that CQRS likes to separate the two

It means that we have two separate databases for write and read operations. At first, it sounds weird so let
me put it into context:

The PageView API only writes data to a MySQL database

The Analytics API only reads data from a Redis database

The two systems communicate via an event stream (don't worry about it for now, there are going to be
examples later)

Whenever a new page view happens the PageView API writes it into MySQL and dispatches a
PageViewCreated event to an event store. The Analytics API reads the event store for new events and
builds its own database. In this case, it's a Redis DB with sorted sets and HyperLogLogs. In other cases, it
could be another MySQL database, a Mongo DB, or any other engine. In this specific application, the
PageView API would have another MySQL database to store users and sites. Or you can extract a third
service called User API that has its own database with user information.

What's the benefit of CQRS?

Storing data in a row-based format (MySQL in this example) is an excellent solution.

The problem comes when you want to read it. Users don't want to look at row-based data on a UI. They
want charts, infographics, visual stuff, aggregated reports, etc. Generating these data structures from a row-
based format is not always a great idea. We've seen this. Without spending serious effort on optimization
some of the queries took 6.5s to run. After optimization queries become much more complicated. You can
easily spend days if not weeks improving the performance. The root cause of the problem almost always
boils down to the single fact that the raw data in the table is so far from what users want to see on the UI.

If it's so bad then why did I say this? "Storing data in a row-based format (MySQL in this example) is an
excellent solution." Because it's true. We can store the data in MySQL and then generate hundreds of
different aggregated reports out of it using Redis, Mongo, or other document stores.

The PageView API stores the data in MySQL as a single source of truth and the Analytics API builds its own
"pre-processed" data in Redis. This way, we have the following benefits:

No. 138 / 145


Martin Joo - Performance with Laravel

Querying data is super fast. The data structures are all optimized for how the users want to see them.
It's not raw anymore, it's pre-processed for the exact use case of the Analytics API.

Storing data is easy and "raw." Meaning we can add any number of new services in the future. They
can build their own specialized databases using the raw-based data.

Put it into other words: we separate the read and write operations to optimize the app.

This is how the architecture can be imagined as:

It looks confusing at first but here are the most important changes:

The PageView API writes page views into its own MySQL database (using a queue and workers)

The Analytics API doesn't communicate with MySQL anymore

On a scheduled basis (for example every minute) it dispatches a job

This job reads new events from the event stream

And writes these new events into Redis using the same techniques we've seen earlier

This way we have two completely separated databases with two different responsibilities:

MySQL stores the raw data as it is.

No. 139 / 145


Martin Joo - Performance with Laravel

Redis stores a "pre-processed" version of the page views that are optimized for the Analytics API's
charts

The event stream acts as a communication layer between the two services

I added these blue rectangles to the image. They represent service boundaries. You can see that Redis is
now owned by the Analytics API and MySQL is owned by the PageView API. These are two separate
subsystems with many components. The event stream is not owned by anyone which is important. It's an
independent communication layer.

The event stream


Redis is an excellent choice for an event stream. It's easy to learn, fast, has a high throughput and it's
probably already in your tech stack which is very important in my opinion. I don't think you should learn
Kafka or RabbitMQ just to have an event stream.

There's a data type in Redis called stream. You can imagine it as a huge log file with timestamps at the
beginning of every line. Each line represents an entry in the stream. An entry is a hash or an associative
array with fields and values. We'll have only one field called data and the value will be the JSON
representation of a PageView record.

An event stream looks like this:

No. 140 / 145


Martin Joo - Performance with Laravel

The three records represent three different page views that happened between the 10th of April, 2024
13:49:09 and 13:49:29. In the Entry ID column you can see the human-readable time. This is something
RedisInsight does. In the stream itself there are only the timestamp-based IDs such as 1712749769953-0 . It
contains the timestamp and an integer identifier (0 in this case). If more than one entry is created at the
same time they will get IDs such 1712749769953-0 , 1712749769953-1 , and so on.

We can add new entries to the stream with the xadd command:

xadd page_views * data "{\"id\": 1, \"ip\": \"1.2.3.4\", \"uri\": \"/home\"}"

data is the field and the JSON string is the value. * is used if you don't want to set an ID manually. In that
case, Redis will set it for you.

This is the command the PageView API needs to execute:

namespace App\Jobs;

class SavePageViewJob implements ShouldQueue


{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

public function "'construct(private readonly PageViewData $data)


{
}

public function handle(): void


{
$pageView = PageView"&create([
'browser' "# $this"%data"%browser,
'device' "# $this"%data"%device,
'ip' "# $this"%data"%ip,
'visited_at' "# $this"%data"%visited_at,
'site_id' "# $this"%data"%site_id,
'uri' "# $this"%data"%uri,
]);

$eventStream = Redis"&connection('event_stream');

$eventStream"%xadd('page_views', ['data' "# json_encode($pageView)]);


}

No. 141 / 145


Martin Joo - Performance with Laravel

If you want to use Redis as an event stream it's probably a good idea to have a dedicated instance just for
that purpose. That instance is independent of every service and it has one job. In that case, you should add
a new entry to the database.php config:

'redis' "# [

'client' "# env('REDIS_CLIENT', 'predis'),

'options' "# [
'cluster' "# env('REDIS_CLUSTER', 'redis'),
],

'default' "# [
'url' "# env('REDIS_URL'),
'host' "# env('REDIS_HOST', '127.0.0.1'),
'username' "# env('REDIS_USERNAME'),
'password' "# env('REDIS_PASSWORD'),
'port' "# env('REDIS_PORT', '6379'),
'database' "# env('REDIS_DB', '0'),
],

'event_stream' "# [
'url' "# env('REDIS_EVENT_STREAM_URL'),
'host' "# env('REDIS_EVENT_STREAM_HOST', '127.0.0.1'),
'username' "# env('REDIS_EVENT_STREAM_USERNAME'),
'password' "# env('REDIS_EVENT_STREAM_PASSWORD'),
'port' "# env('REDIS_EVENT_STREAM_PORT', '6379'),
'database' "# env('REDIS_EVENT_STREAM_DB', '0'),
],
],

This way, you use to connect to it by running: $eventStream = Redis::connection('event_stream');

So every time a new page view happens the PageView API adds a new entry to the page_views stream.

To read from a stream the xrange command can be used:

No. 142 / 145


Martin Joo - Performance with Laravel

xrange page_views 1712749769953-0 +

The first argument is the start ID (timestamp) and the second one is the end ID. The + means "read every
entry till the end." So this command reads entries from 1712749769953-0 to the end.

In the Analytics API, we need to consume these events. The best way to do this is to schedule a job every
minute that reads only new events. Every time this job is executed we need to store the ID (timestamp) of
the last processed event and in the next schedule we use that to get events after that ID.

class ProcessPageViewEventsJob implements ShouldQueue


{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

public function handle(): void


{
$eventStream = Redis"&connection('event_stream');

$database = Redis"&connection('default');

$lastProcessedEventId = $database"%get('last_processed_event_id', '0-0');

$pageViews = $eventStream"%xrange(
'page_views', $lastProcessedEventId, '+',
);

"( ""$

$ids = array_keys($pageViews);

$database"%set('last_processed_event_id', $ids[count($ids) - 1]);


}
}

There's a simple string value in Redis called last_processed_event_id . The function gets this value to be
the first argument to xrange . If there's no such value it defaults to 0-0 .

xrange in predis returns the following array:

No. 143 / 145


Martin Joo - Performance with Laravel

[
"1712749769953-0" "# [
"data" "# "{"id": 1, "ip": "1.2.3.4", "uri": "/home"}",
],
"1712751027451-0" "# [
"data" "# "{"id":2,"ip": "1.2.3.4", "uri": "/about"}",
],
]

These two lines process the keys of the array (timestamps) and set the last_processed_event_id to a new
value:

$ids = array_keys($pageViews);

$database"%set('last_processed_event_id', $ids[count($ids) - 1]);

The last thing to do is to process the new entries. For this we need to use the same logic I showed you
earlier. We need to build the data in Redis:

foreach ($pageViews as $pageViewJson) {


$pageView = PageViewData"&fromJson($pageViewJson);

$timestampDay = $pageView"%visited_at"%startOfDay()"%timestamp;

$timestampMonth = $pageView"%visited_at"%startOfMonth()"%timestamp;

$timestampHour = $pageView"%visited_at"%startOfHour()"%timestamp;

$database"%zincrby(
'page_views:' . $pageView"%site_id . ':hourly', 1, $timestampHour,
);

"( ""$
}

No. 144 / 145


Martin Joo - Performance with Laravel

This is the same logic that is implemented in the SavePageViewJob class in the PageView API. The only
difference is that here we have JSON strings instead of models so my go-to approach would be to create
DTOs and use them.

As I said earlier, using event stream, a different database for read/write operations is a distributed system
problem and not strictly related to performance so we won't go into further details here. However, this is
the core of the communication. If you want you can make a distributed system with CQRS based on these
few pages.

No. 145 / 145

You might also like