100% found this document useful (1 vote)
397 views2 pages

How Digg Com Uses The LAMP Stack To Scale Upward

Uploaded by

Matt Jaynes
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
397 views2 pages

How Digg Com Uses The LAMP Stack To Scale Upward

Uploaded by

Matt Jaynes
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

How Digg.com uses the LAMP stack to scale upward http://www.computerworld.com/action/article.do?command=printArti...

How Digg.com uses the LAMP stack to scale upward


Eric Lai

April 24, 2007 (Computerworld) Digg.com credits two particular features of its LAMP (Linux, Apache, MySQL and
PHP) server cluster for helping the news aggregation site maintain speedy performance in the face of high growth.
The site, which lets its users vote on, or "digg," their favorite news stories hosted on other sites, recently passed the 1.2
million-user mark according to Elliot White III, an engineer at San Francisco-based Digg Inc. He spoke at MySQL’s
annual conference in Santa Clara, Calif. on Tuesday.
Today, Digg.com boasts 100 servers scattered in multiple data centers that host a total of 30GB of data, but the site
started off in late 2004 as a single Linux server running Apache 1.3, PHP 4, and MySQL 4.0 using the
default MyISAM storage engine, White said.
As more users dug Digg, the site moved to an
architecture that uses a load balancer in the
front that sends queries to PHP servers,
MySQL slave servers that feed the PHP
servers, and a MySQL master server that feeds
data to the slaves.
That's a fairly standard setup. But to get away
from "sending raw queries against the
database," White said Digg.com uses a
software called Memcached. First developed
for use by the Livejournal site, Memcached is
tailored for dynamic sites like Digg.com, which
serve Web pages with content that is constantly
changing and is personalized according to user
preferences, White said.
Memcached stores chunks of data that can be
pulled and used to dynamically create a Web
page. Conventional caching systems, which
store whole Web pages, would be too slow and
inefficient for a site like Digg.
The other atypical feature of Digg’s setup is its
use of what Tim Ellis, another Digg engineer,
calls "sharding."
A term apparently coined by Google engineers, sharding involves breaking a database into smaller parts in order to
isolate heavy loads for better performance.
"If 90% of your data is within a certain range, and you can get that part working really fast, then you can help
customers," Ellis said. "Then it’s OK if the remaining 10% is slower."
A database can be sharded by table, date or range. It is similar to partitioning, says Ellis, but with several key
differences. Sharding usually involves divvying up data onto different physical machines. Partitioning, in contrast,
typically occurs on the same piece of hardware. And while MySQL does not natively allow sharding, it does support
partitioned tables, federated tables and clusters.
Digg only recently began sharding. While sharding is helping Digg.com achieve much faster performance overall,
breaking a database into several smaller ones increases complexity, Ellis said. That can mean more work for
developers and database administrators, because of the inability to use common SQL commands such as joining
tables. "Developers don’t like this crazy stuff. That can create pushback," he said.
Digg’s current architecture includes about 20 database servers, 30 Web servers, and a few search servers
running Lucene; the balance operate as backup servers. All but one of the database servers run some version of

1 of 2 5/29/07 11:23 PM
How Digg.com uses the LAMP stack to scale upward http://www.computerworld.com/action/article.do?command=printArti...

MySQL 5. The transaction-heavy servers as well as the backup units use the InnoDB database engine, while the OLAP
ones use MyISAM.
Ellis acknowledges that Digg.com "is really lucky" in that 98% of the time the database is accessed, it is being read, as
opposed to experiencing more intensive data writes.
"Most people come to Digg’s front page, read it and leave, which is kind of nice," said Ellis, drawing a knowing laugh
from the audience of mostly PHP developers and DBAs.
Ellis also noted that although many users have complained that upgrading to MySQL 5 from 4.1 caused performance to
drop, that was not true in Digg.com’s case.
Maintaining Digg.com's high performance as the site grows more and more popular presents challenges to Digg
engineers. For one thing, the company is unable to keep scaling by buying more physical memory. "We can’t afford that
anymore," Ellis said.
Preventing Digg’s enthusiastic developers from adding powerful but CPU-intensive features is "a political thing I
constantly have to deal with as a DBA," said Ellis.
Also, Digg was having a problem with its storage misreporting the status of data synchronizations. "Our hardware
wanted to be fast," Ellis said. "It was telling us things were synced to disk when it was not."
Finally, there is the mundane challenge of minimizing "schema cruft," or redundant tables of data which, if read, can
slow down performance, said Ellis.
"Everyone has to do this," he said.

2 of 2 5/29/07 11:23 PM

You might also like