0% found this document useful (0 votes)

39 views83 pages

Crawler

Uploaded by

varunprint1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views83 pages

Crawler

Uploaded by

varunprint1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 83

System Design for Big Data [tinyurl]

What is tinyurl?

tinyurl is a URL service that users enter a long URL and then the service return a shorter
and unique url such as "http://tiny.me/5ie0V2". The highlight part can be any string with 6 letters
containing [0-9, a-z, A-Z]. That is, 62^6 ~= 56.8 billions unique strings.

How it works?

On Single Machine
Suppose we have a database which contains three columns: id (auto increment), actual url, and
shorten url.

Intuitively, we can design a hash function that maps the actual url to shorten url. But string to string
mapping is not easy to compute.

Notice that in the database, each record has a unique id associated with it. What if we convert the id
to a shorten url?
Basically, we need a Bijective function f(x) = y such that

 Each x must be associated with one and only one y;

 Each y must be associated with one and only one x.
In our case, the set of x's are integers while the set of y's are 6-letter-long strings. Actually, each 6-
letter-long string can be considered as a number too, a 62-base numeric, if we map each distinct
character to a number,
e.g. 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z.
Then, the problem becomes Base Conversion problem which is bijection (if not overflowed :).

public String shorturl(int id, int base, HashMap map) {

StringBuilder res = new StringBuilder();
while (id > 0) {
int digit = id % base;
res.append(map.get(digit));
id /= base;
}
while (res.length() < 6) res.append('0');
return res.reverse().toString();
}
For each input long url, the corresponding id is auto generated (in O(1) time). The base conversion
algorithm runs in O(k) time where k is the number of digits (i.e. k=6).

On Multiple Machine
Suppose the service gets more and more traffic and thus we need to distributed data onto multiple
servers.

We can use Distributed Database. But maintenance for such a db would be much more complicated
(replicate data across servers, sync among servers to get a unique id, etc.).
Alternatively, we can use Distributed Key-Value Datastore.
Some distributed datastore (e.g. Amazon's Dynamo) uses Consistent Hashing to hash servers and
inputs into integers and locate the corresponding server using the hash value of the input. We can
apply base conversion algorithm on the hash value of the input.

The basic process can be:

Insert

1. Hash an input long url into a single integer;

2. Locate a server on the ring and store the key--longUrl on the server;
3. Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.
Retrieve

1. Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
2. Locate the server containing that key and return the longUrl.
---------

 StackOverflow: How to code a url shortener

Ask Question

up I want to create a URL shortener service where you can write a long URL into an input
vote473do field and the service shortens the URL to "http://www.example.org/abcdef".
wn vote
Edit: Due to the ongoing interest in this topic, I've published an efficient solution to
GitHub, with implementations for JavaScript, PHP, Python and Java. Add your
favorite

515 solutions if you like :)

Instead of "abcdef" there can be any other string with six characters containing a-z,
A-Z and 0-9. That makes 56~57 billion possible strings.
My approach:

I have a database table with three columns:

1. id, integer, auto-increment

2. long, string, the long URL the user entered
3. short, string, the shortened URL (or just the six characters)
I would then insert the long URL into the table. Then I would select the auto-increment
value for "id" and build a hash of it. This hash should then be inserted as "short". But
what sort of hash should I build? Hash algorithms like MD5 create too long strings. I
don't use these algorithms, I think. A self-built algorithm will work, too.
My idea:

For "http://www.google.de/" I get the auto-increment id 239472. Then I do the

following steps:
short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.
That could be repeated until the number isn't divisible any more. Do you think this is a
good approach? Do you have a better idea?

algorithm url

shareimprove this question edited Sep 27 '16 at 14:56 asked Apr 12 '09 at 16:29

caw

8,79646126238

3 @gudge The point of those functions is that they have an inverse function. This means you can have
both encode() and decode() functions. The steps are, therefore: (1) Save URL in database (2) Get unique row
for that URL from database (3) Convert integer ID to short string with encode(), e.g. 273984 to
short string (e.g. f4a4) in your sharable URLs (5) When receiving a request for a short string (e.g.
string to an integer ID with decode() (6) Look up URL in database for given ID. For conversion,
use: github.com/delight-im/ShortURL – caw Feb 10 '15 at 10:31

@Marco, what's the point of storing the hash in the database? – Maksim Vi. Jul 11 '15 at 9:04

2 @MaksimVi. If you have an invertible function, there's none. If you had a one-way hash function, there would be
one. – caw Jul 14 '15 at 14:47

would it be wrong if we used simple CRC32 algorithm to shorten a URL? Although very unlikely of a collision (a CRC
output is usually 8 characters long and that gives us over 30 million possibilities) If a generated CRC32 output was
already used previously and was found in the database, we could salt the long URL with a random number until we
find a CRC32 output which is unique in my database. How bad or different or ugly would this be for a simple
solution? – Syed Rakib Al Hasan Mar 22 '16 at 9:41

Typical number to short string conversion approach in Java – Aniket Thakur May 14 '16 at 9:41

add a comment
22 Answers
activeoldest votes

up I would continue your "convert number to string" approach. However you will realize
vote608do that your proposed algorithm fails if your ID is a prime and greater than 52.
wn vote accepted
Theoretical background

You need a Bijective Function f. This is necessary so that you can find a inverse
function g('abc') = 123 for your f(123) = 'abc' function. This means:
 There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
 and for every y you must be able to find an x so that f(x) = y.
How to convert the ID to a shortened URL

1. Think of an alphabet we want to use. In your case that's [a-zA-Z0-9]. It

contains 62 letters.
2. Take an auto-generated, unique numerical key (the auto-incremented id of a
MySQL table for example).
For this example I will use 12510 (125 with a base of 10).
3. Now you have to convert 12510 to X62 (base 62).
12510 = 2×621 + 1×620 = [2,1]
This requires use of integer division and modulo. A pseudo-code example:

digits = []

while num > 0

remainder = modulo(num, 62)
digits.push(remainder)
num = divide(num, 62)

digits = digits.reverse
Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an
array for example) could look like:
0 → a
1 → b
...
25 → z
...
52 → 0
61 → 9
With 2 → c and 1 → b you will receive cb62 as the shortened URL.
http://shor.ty/cb
How to resolve a shortened URL to the initial ID

The reverse is even easier. You just do a reverse lookup in your alphabet.

1. e9a62 will be resolved to "4th, 61st, and 0th letter in alphabet".

e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810
2. Now find your database-record with WHERE id = 19158 and do the redirect.
Some implementations (provided by commenters)

 Ruby
 Python
 CoffeeScript
 Haskell
 Perl
 C#
shareimprove this answer edited Dec 10 '14 at 4:18 community wiki

14 revs, 9 users 75%

Marcel Jackwerth

12 Don't forget to sanitize the URLs for malicious javascript code! Remember that javascript can be base64 encoded
a URL so just searching for 'javascript' isn't good enough.j – Bjorn Tipling Apr 14 '09 at 8:05

2 A function must be bijective (injective and surjective) to have an inverse. – Gumbo May 4 '10 at 20:28

30 Food for thought, it might be useful to add a two character checksum to the url. That would prevent direct iteratio
of all the urls in your system. Something simple like f(checksum(id) % (62^2)) + f(id) = url_id – koblas
13:53

6 As far as sanitizing the urls go, one of the problems you're going to face is spammers using your service to mask
their URLS to avoid spam filters. You need to either limit the service to known good actors, or apply spam filtering
the long urls. Otherwise you WILL be abused by spammers. – Edward Falk May 26 '13 at 15:34

43 Base62 may be a bad choice because it has the potential to generate f* words (for
example, 3792586=='F_ck' with u in the place of _). I would exclude some characters like u/U in order to
minimize this. – Paulo Scardine Jun 28 '13 at 16:02

show 15 more comments

up Why would you want to use a hash?

vote40do You can just use a simple translation of your auto-increment value to an alphanumeric
wn vote
value. You can do that easily by using some base conversion. Say you character space
(A-Z,a-z,0-9 etc') has 40 characters, convert the id to a base-40 number and use the
characters are the digits.

shareimprove this answer edited May 4 '10 at 20:25 answered Apr 12 '09 at 16:34
shoosh

42k38159269

4 asides from the fact that A-Z, a-z and 0-9 = 62 chars, not 40, you are right on the mark. – Evan Teran
16:39

Thanks! Should I use the base-62 alphabet then? en.wikipedia.org/wiki/Base_62 But how can I convert the ids to a
base-62 number? – caw Apr 12 '09 at 16:46

Using a base conversion algorithm ofcourse - en.wikipedia.org/wiki/Base_conversion#Change_of_radix

12 '09 at 16:48

Thank you! That's really simple. :) Do I have to do this until the dividend is 0? Will the dividend always be 0 at some
point? – caw Apr 12 '09 at 17:04

2 with enough resources and time you can "browse" all the URLs of of any URL shortening service. –
at 21:10

show 3 more comments

up public class UrlShortener {

private static final String ALPHABET =
vote35do "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
wn vote private static final int BASE = ALPHABET.length();

public static String encode(int num) {

StringBuilder sb = new StringBuilder();
while ( num > 0 ) {
sb.append( ALPHABET.charAt( num % BASE ) );
num /= BASE;
}
return sb.reverse().toString();
}

public static int decode(String str) {

int num = 0;
for ( int i = 0; i < str.length(); i++ )
num = num * BASE + ALPHABET.indexOf(str.charAt(i));
return num;
}
}
shareimprove this answer edited Feb 16 '16 at 0:11 answered Jul 11 '12 at 1:30
Feeco richard

2701421 1,7051126

I really like the idea, the only problem i have with it is that i keep getting the num variable in the decode function out
of bounds(even for long), do you have any idea how to make it work? or is it theoretical only? – user1322801
'16 at 19:07

@user1322801: Presumably you're trying to decode something that was far larger than what the encode function ca
actually handle. You could get some more mileage out of it if you converted all of the "ints" to BigInteger, but unless
you've got > 9223372036854775807 indexes, long should probably be enough. – biggusjimmus Jul 19 '16 at 6:08

How about add custom URL shortener? – Yosua Lijanto Binar Aug 31 '16 at 14:13

add a comment

up Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They
vote27do are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other
wn vote
characters very very similar that they are near impossible to tell the difference) and
downright error prone. Try to use lower or upper case only.

Also, try to have a format where you mix the numbers and characters in a predefined
form. There are studies that show that people tend to remember one form better than
others (think phone numbers, where the numbers are grouped in a specific form). Try
something like num-char-char-num-char-char. I know this will lower the combinations,
especially if you don't have upper and lower case, but it would be more usable and
therefore useful.

shareimprove this answer answered Apr 12 '09 at 17:50

Ash

531166

1 Thank you, very good idea. I haven't thought about that yet. It's clear that it depends on the kind of use whether
that makes sense or not. – caw Apr 12 '09 at 18:22

14 It won't be an issue if people are strictly copy-and-pasting the short urls. – Edward Falk May 26 '13 at 15:35

1 The purpose of short url's is not to be memorable or easy to speak. Is only click or copy/paste. – hugomn
14:12

add a comment
up My approach: Take the Database ID, then Base36 Encode it. I would NOT use both
vote22do Upper AND Lowercase letters, because that makes transmitting those URLs over the
wn vote
telephone a nightmare, but you could of course easily extend the function to be a base 62
en/decoder.
shareimprove this answer answered Apr 14 '09 at 8:02

Michael Stum♦

101k91335487

Thanks, you're right. Whether you have 2,176,782,336 possibilities or 56,800,235,584, it's the same: Both will be
enough. So I will use base 36 encoding. – caw Apr 14 '09 at 18:22

It may be obvious but here is some PHP code referenced in wikipedia to do base64 encode in
php tonymarston.net/php-mysql/converter.html – Ryan White Jul 13 '10 at 15:33

add a comment

up Here is my PHP 5 class.

vote6do
<?php
wn vote class Bijective
{
public $dictionary =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

public function __construct()

{
$this->dictionary = str_split($this->dictionary);
}

public function encode($i)

{
if ($i == 0)
return $this->dictionary[0];

$result = '';
$base = count($this->dictionary);

while ($i > 0)

{
$result[] = $this->dictionary[($i % $base)];
$i = floor($i / $base);
}

$result = array_reverse($result);
return join("", $result);
}

public function decode($input)

{
$i = 0;
$base = count($this->dictionary);

$input = str_split($input);

foreach($input as $char)
{
$pos = array_search($char, $this->dictionary);

$i = $i * $base + $pos;
}

return $i;
}
}
shareimprove this answer answered Nov 4 '11 at 20:10

Xeoncross

22.7k49184281

add a comment

up You could hash the entire URL, but if you just want to shorten the id, do as marcel
vote3do suggested. I wrote this python implementation:
wn vote
https://gist.github.com/778542
shareimprove this answer answered Jan 17 '11 at 21:35

bhelx

673

add a comment

up C# version:
public class UrlShortener
vote3do
{
wn vote private static String ALPHABET =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static int BASE = 62;

public static String encode(int num)

{
StringBuilder sb = new StringBuilder();

while ( num > 0 )

{
sb.Append( ALPHABET[( num % BASE )] );
num /= BASE;
}

StringBuilder builder = new StringBuilder();

for (int i = sb.Length - 1; i >= 0; i--)
{
builder.Append(sb[i]);
}
return builder.ToString();
}

public static int decode(String str)

{
int num = 0;

for ( int i = 0, len = str.Length; i < len; i++ )

{
num = num * BASE + ALPHABET.IndexOf( str[(i)] );
}

return num;
}
}
shareimprove this answer edited Aug 2 '14 at 12:30 answered Mar 8 '13 at 20:17

user1477388

10.3k1573154

add a comment

up If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/

shareimprove this answer answered Apr 12 '09 at 17:12
vote2do
wn vote
Alister Bulman

19.9k44489

"Sorry, it looks like spammers got to this. Try tinyurl instead." – takeshin Jan 31 '10 at 17:24

to the demo site. The source code is still downloadable from Sourceforge. – Alister Bulman Feb 12 '10 at 22:02

add a comment

up alphabet = map(chr, range(97,123)+range(65,91)) +

map(str,range(0,10))
vote2do
wn vote def lookup(k, a=alphabet):
if type(k) == int:
return a[k]
elif type(k) == str:
return a.index(k)

def encode(i, a=alphabet):

'''Takes an integer and returns it in the given base with
mappings for upper/lower case letters and numbers 0-9.'''
try:
i = int(i)
except Exception:
raise TypeError("Input must be an integer.")

def incode(i=i, p=1, a=a):

# Here to protect p.
if i <= 61:
return lookup(i)

else:
pval = pow(62,p)
nval = i/pval
remainder = i % pval
if nval <= 61:
return lookup(nval) + incode(i % pval)
else:
return incode(i, p+1)

return incode()

def decode(s, a=alphabet):

'''Takes a base 62 string in our alphabet and returns it in
base10.'''
try:
s = str(s)
except Exception:
raise TypeError("Input must be a string.")
return sum([lookup(i) * pow(62,p) for p,i in
enumerate(list(reversed(s)))])a
Here's my version for whomever needs it.

shareimprove this answer edited Nov 21 '09 at 22:51 answered Nov 21 '09 at 22:21

MrChrisRodriguez

12315

add a comment

up Why not just translate your id to a string? You just need a function that maps a digit
vote1do between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to
wn vote
create, say, 4-letter codes, and you've got 14.7 million URLs covered.

shareimprove this answer answered Apr 12 '09 at 16:35

cr333

605713

add a comment

up // simple approach
vote1do $original_id = 56789;
wn vote
$shortened_id = base_convert($original_id, 10, 36);

$un_shortened_id = base_convert($shortened_id, 36, 10);

shareimprove this answer answered Dec 20 '11 at 10:59

phirschybar
3,84273553

This is probably correct but you can't chose your alphabet. – caw Dec 20 '11 at 17:51

add a comment

up Here is a decent URL encoding function for PHP...

vote1do
// From http://snipplr.com/view/22246/base62-encode--decode/
wn vote private function base_encode($val, $base=62,
$chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY
Z') {
$str = '';
do {
$i = fmod($val, $base);
$str = $chars[$i] . $str;
$val = ($val - $i) / $base;
} while($val > 0);
return $str;
}
shareimprove this answer answered Feb 13 '12 at 1:10

Simon East

22.2k88381

add a comment

up Don't know if anyone will find this useful - it is more of a 'hack n slash' method, yet is
vote1do simple and works nicely if you want only specific chars.
wn vote
$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);

// Encode
$str_id = '';
$base = count($dictionary);

while($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $dictionary[$rem];
}

// Decode
$id_ar = str_split($str_id);
$id = 0;
for($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i -
1);
}
shareimprove this answer edited Mar 29 '12 at 22:00 answered Mar 29 '12 at 21:42

Ryan Charmley

815814

add a comment

up This is what I use:

vote0do
# Generate a [0-9a-zA-Z] string
wn vote ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) +
range(65, 91))

def encode_id(id_number, alphabet=ALPHABET):

"""Convert an integer to a string."""
if id_number == 0:
return alphabet[0]

alphabet_len = len(alphabet) # Cache

result = ''
while id_number > 0:
id_number, mod = divmod(id_number, alphabet_len)
result = alphabet[mod] + result

return result

def decode_id(id_string, alphabet=ALPHABET):

"""Convert a string to an integer."""
alphabet_len = len(alphabet) # Cache
return sum([alphabet.index(char) * pow(alphabet_len, power) for
power, char in enumerate(reversed(id_string))])
It's very fast and can take long integers.

shareimprove this answer answered Mar 10 '11 at 18:34

Davmuz
610611

add a comment

up For a similar project, to get a new key, I make a wrapper function around a random string
vote0do generator that calls the generator until I get a string that hasn't already been used in my
wn vote
hashtable. This method will slow down once your name space starts to get full, but as you
have said, even with only 6 characters, you have plenty of namespace to work with.
shareimprove this answer answered Apr 22 '11 at 18:04

Joel Berger

16.9k43785

Has this approach worked out for you in the long run? – Chris May 10 '16 at 13:40

To be honest, I have no idea to which project I was referring there :-P – Joel Berger May 10 '16 at 16:34

add a comment

up did you omit O, 0, i on purpose ?

vote0do
Just created a php class based on Ryan's solution.
wn vote
<?php

$shorty = new App_Shorty();

echo 'ID: ' . 1000;

echo '<br/> Short link: ' . $shorty->encode(1000);
echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty-
>encode(1000));

/**
* A nice shorting class based on Ryan Charmley's suggestion see
the link on stackoverflow below.
* @author Svetoslav Marinov (Slavi) | http://WebWeb.ca
* @see http://stackoverflow.com/questions/742013/how-to-code-a-
url-shortener/10386945#10386945
*/
class App_Shorty {
/**
* Explicitely omitted: i, o, 1, 0 because they are
confusing. Also use only lowercase ... as
* dictating this over the phone might be tough.
* @var string
*/
private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
private $dictionary_array = array();

public function __construct() {

$this->dictionary_array = str_split($this->dictionary);
}

/**
* Gets ID and converts it into a string.
* @param int $id
*/
public function encode($id) {
$str_id = '';
$base = count($this->dictionary_array);

while ($id > 0) {

$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $this->dictionary_array[$rem];
}

return $str_id;
}

/**
* Converts /abc into an integer ID
* @param string
* @return int $id
*/
public function decode($str_id) {
$id = 0;
$id_ar = str_split($str_id);
$base = count($this->dictionary_array);

for ($i = count($id_ar); $i > 0; $i--) {

$id += array_search($id_ar[$i - 1], $this-
>dictionary_array) * pow($base, $i - 1);
}

return $id;
}
}

?>
shareimprove this answer edited Apr 30 '12 at 16:42 answered Apr 30 '12 at 16:17

Svetoslav Marinov
79078

Yes. Did you see the comment just below the class declaration ? – Svetoslav Marinov Nov 5 '15 at 10:08

add a comment

up I have a variant of the problem, in that I store web pages from many different authors and
vote0do need to prevent discovery of pages by guesswork. So my short URLs add a couple of
wn vote
extra digits to the Base-62 string for the page number. These extra digits are generated
from information in the page record itself and they ensure that only 1 in 3844 URLs are
valid (assuming 2-digit Base-62). You can see an outline description
at http://mgscan.com/MBWL.
shareimprove this answer answered Mar 15 '15 at 9:42

Graham

6127

add a comment

up Very good answer, I have created a Golang implementation of the bjf:

vote0do
package bjf
wn vote
import (
"math"
"strings"
"strconv"
)

const alphabet =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

func Encode(num string) string {

n, _ := strconv.ParseUint(num, 10, 64)
t := make([]byte, 0)

/* Special case */
if n == 0 {
return string(alphabet[0])
}

/* Map */
for n > 0 {
r := n % uint64(len(alphabet))
t = append(t, alphabet[r])
n = n / uint64(len(alphabet))
}
/* Reverse */
for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
t[i], t[j] = t[j], t[i]
}

return string(t)
}

func Decode(token string) int {

r := int(0)
p := float64(len(token)) - 1

for i := 0; i < len(token); i++ {

r += strings.Index(alphabet, string(token[i])) *
int(math.Pow(float64(len(alphabet)), p))
p--
}

return r
}
Hosted at github: https://github.com/xor-gate/go-bjf
shareimprove this answer edited Jan 3 '16 at 20:18 answered Dec 6 '15 at 20:50

Jerry Jacobs

688

add a comment

up /**
* <p>
vote0do * Integer to character and vice-versa
wn vote * </p>
*
*/
public class TinyUrl {

private final String characterMap =

"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private final int charBase = characterMap.length();

public String covertToCharacter(int num){

StringBuilder sb = new StringBuilder();

while (num > 0){

sb.append(characterMap.charAt(num % charBase));
num /= charBase;
}
return sb.reverse().toString();
}

public int covertToInteger(String str){

int num = 0;
for(int i = 0 ; i< str.length(); i++)
num += characterMap.indexOf(str.charAt(i)) *
Math.pow(charBase , (str.length() - (i + 1)));

return num;
}
}

class TinyUrlTest{

public static void main(String[] args) {

TinyUrl tinyUrl = new TinyUrl();
int num = 122312215;
String url = tinyUrl.covertToCharacter(num);
System.out.println("Tiny url: " + url);
System.out.println("Id: " + tinyUrl.covertToInteger(url));
}
}
shareimprove this answer answered Jul 22 '16 at 12:16

Hrishikesh Mishra

1,0261018

add a comment

up My python3 version
vote0do
base_list =
wn vote list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
)
base = len(base_list)

def encode(num: int):

result = []
if num == 0:
result.append(base_list[0])

while num > 0:

result.append(base_list[num % base])
num //= base
print("".join(reversed(result)))

def decode(code: str):

num = 0
code_list = list(code)
for index, code in enumerate(reversed(code_list)):
num += base_list.index(code) * base ** index
print(num)

if __name__ == '__main__':
encode(341413134141)
decode("60FoItT")
shareimprove this answer edited Sep 19 '16 at 7:10 answered Sep 19 '16 at 7:04

wyx

1821212

add a comment

up Here is Node.js implementation that is likely to bit.ly. generate highly random 7 character
vote0do string. using Node.js crypto to generate highly random 25 charset than random select 7
wn vote
character.

var crypto = require("crypto");

exports.shortURL = new function () {
this.getShortURL = function () {
var sURL = '',
_rand = crypto.randomBytes(25).toString('hex'),
_base = _rand.length;
for (var i = 0; i < 7; i++)
sURL += _rand.charAt(Math.floor(Math.random() *
_rand.length));
return sURL;
};
}
shareimprove this answer answered Jan 5 at 7:22

Hafiz Arslan
167213

add a comment

protected by YOU Apr 22 '11 at 3:06

Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be
removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

Not the answer you're looking for? Browse other questions
tagged algorithm url or ask your own question.
asked
8 years ago
viewe
d 157608 times

active
3 months ago

Work from anywhere


(Senior) Java Developer

foryouandyourcustomers Düsseldorf GmbHEssen, Deutschland

€35K - €90KREMOTERELOCATION

springhybris


ColdFusion Developer

AnsiraNo office location

REMOTE

coldfusion


C# (.NET) Software Architect - $60k

CrossoverNo office location

REMOTE

.nettdd


Lisp Software Architect - $60k

CrossoverNo office location

REMOTE
cc++

38 People Chatting
JavaScript
2 hours ago - RaisingAgent

PHP
2 hours ago - NikiC

Linked

how to generate shortest possible string in php?

How to make shorten URL like bit.ly

How do URL shorteners guarantee unique URLs when they don't expire?

URL shortening algorithm

How do URL shortener calculate the URL key? How do they work?

Shortening long urls with a hash?

generating unique id's in php (for url shortener)

Query String unique string generator?

C# - Get Seed from Sequence

Logic Ideas - Building a URL Shortener Service with C# and WCF

see more linked questions…

2989

What is the difference between a URI, a URL and a URN?

1822

Encode URL in JavaScript?

3413

What is the maximum length of a URL in different browsers?

2706

How can I get query string values in JavaScript?

2063

Get current URL in web browser

2208

Change the URI (URL) for a remote Git repository

PHP URL Shortening Algorithm

3106

How to pair socks from a pile efficiently?

How to code a URL shortener with encryption included?

Customize Shorten URL Algorithm

Hot Network Questions

 Why did the Ottomans not change their capital to an Arab city?
 Are there any applications of Quantum Computation to Cryptography? (besides Cryptanalysis)
 Can a grappler grapple the same target multiple times?
 What is the broad overview of the differences between Wizard and Sorceror in D&D 5e?
 What is "radial blur"?
 Where should I host a (simple) personal academic webpage?
 What is the physics behind shake flashlights?
 How to approach coworker who has cancer, if I'm not supposed to know?
 How to write a ternary if statement without repeating yourself?
 Can a paladin use the Magic Initiate feat or racial spells to smite?
 Tex symbols and package for falling factorial power. and other symbols
 Batch replace substring with another in filenames
 Show that there are infinitely many powers of two starting with the digit 7
 What do you call the imprinted pattern left behind a car in mud?
 Can a fighter jet be stolen?
 Why did the PDP-11 include a JMP instruction?
 Dealing with a racist work environment
 Wherre are the adjacent characters in the title? [3, 4]!
 Rock, Polyglot, Scissors
 Precision CC CV circuit or power supply
 Another big picture question
 Would it be a bad idea to periodically run code formatters on a repository?
 A word with two meanings though misuse could be lethal
 Am I over the speed limit?
question feed

about us tour help blog chat data legal privacy policy work here advertising info developer jobs

directory mobile contact us feedback

TECHNOLOGY LIFE / ARTS CULTURE / RECREATION

Stack Overflow Geographic Information Code Review Photography English Language & Usa
Server Fault Systems Magento Science Fiction & Fantasy Skeptics
Super User Electrical Engineering Signal Processing Graphic Design Mi Yodeya (Judaism)
Web Applications Android Enthusiasts Raspberry Pi Movies & TV Travel
Ask Ubuntu Information Security Programming Puzzles & Music: Practice & Theory Christianity
Webmasters Database Administrators Code Golf Seasoned Advice (cooking) English Language Learne
Game Development Drupal Answers more (7) Home Improvement Japanese Language
TeX - LaTeX SharePoint Personal Finance & Money Arqade (gaming)
Software Engineering User Experience Academia Bicycles
Unix & Linux Mathematica more (8) Role-playing Games
Ask Different (Apple) Salesforce Anime & Manga
WordPress Development ExpressionEngine® Answers Motor Vehicle
Cryptography Maintenance & Repair
more (17)

Early detection of
Twitter
trends explained
« Previous

By snikolov / November 14, 2012 / projects / 69 comments

A couple of weeks ago on Halloween night, I was out with some friends when
my advisor sent me a message to check web.mit.edu, right now. It took me a
few seconds of staring to realize that an article about my masters thesis work on
a nonparametric approach to detecting trends on Twitter was on the homepage
of MIT. Over the next few days, it was picked up
by Forbes, Wired, Mashable, Engadget, The Boston Herald and others, and my
inbox and Twitter Connect tab were abuzz like never before.
There was a lot of interest in how this thing works and in this post I want to give
an informal overview of the method Prof. Shah and I developed. But first, let me
start with a story…

A scandal

On June 27, 2012, Barclays Bank was fined $450 million for manipulating
the Libor interest rate, in what was possibly the biggest banking fraud scandal in
history. People were in uproar about this, and many took their outrage to
Twitter. In retrospect, “#Barclays” was bound to become a popular, or
“trending” topic. But how soon could one have known this with reasonable
certainty? Twitter’s algorithm detected “#Barclays” as a trend at 12:49pm GMT
following a big jump in activity around noon (Figure 1).

Figure 1

But is there something about the preceding activity that would have allowed us
to detect it earlier? It turns out that there is. We detected it at 12:03, more
than 45 minutes in advance. Overall, we were able to detect trends in
advance of Twitter 79% of the time, with a mean early advantage of
1.43 hours and an accuracy of 95%.

In this post I’ll tell you how we did it. But before diving into our approach, I want
to motivate the thinking behind it by going over another approach to detecting
trends.

The problem with parametric models

A popular approach to trend detection is to have a model of the type of activity

that comes before a topic is declared trending, and to try to detect that type of
activity. One possible model is that activity is roughly constant most of the time
but has occasional jumps. A big jump would indicate that something is becoming
popular. One way to detect trends would be to estimate a “jumpiness”
parameter, say p, from a window of activity and declare something trending or
not based on whether p exceeds some threshold.

Figure 2

This kind of method is called parametric, because it estimates parameters from

data. But such a “constant + jumps” model does not fully capture the types of
patterns that can precede a topic becoming trending. There could be several
small jumps leading up to a big jump. There could be a gradual rise and no clear
jump. Or any number of other patterns (Figure 3).

Figure 3

Of course, we could build parametric models to detect each of these kinds of

patterns. Or even one master parametric model that detects all of them. But
pretty soon, we get into a mess. Out of all the possible parametric models one
could use, which one should we pick? A priori, it is not clear.
We don’t need to do this — there’s another way.

A data-driven approach

Instead of deciding what the parametric model should be, we take

a nonparametric approach. That is, we let the data itself define the model. If we
gather enough data about patterns that precede trends and patterns that don’t
we can sufficiently characterize all possible types of patterns that can happen.
Then instead of building a model from the data, we can use the data directly to
decide whether a new pattern is going to lead to a trend or not. You might ask:
aren’t there an unlimited number of patterns that can happen? Don’t we need
an unreasonable amount of data to characterize all these possibilities?
It turns out that we don’t, at least in this case. People acting in social networks
are reasonably predictable. If many of your friends talk about something, it’s
likely that you will as well. If many of your friends are friends with person X, it is
likely that you are friends with them too. Because the underlying system has, in
this sense, low complexity, we should expect that the measurements from that
system are also of low complexity. As a result, there should only be a few types
of patterns that precede a topic becoming trending. One type of pattern could
be “gradual rise”; another could be “small jump, then a big jump”; yet another
could be “a jump, then a gradual rise”, and so on. But you’ll never get a
sawtooth pattern, a pattern with downward jumps, or any other crazy pattern.
To see what I mean, take a look at this sample of patterns (Figure 4) and how it
can be clustered into a few different “ways” that something can become
trending.

Figure 4: The patterns of activity in black are a sample of patterns of activity

leading up to a topic becoming trending. Each subsequent cluster of patterns
represents a “way” that something can become trending.

Having outlined this data-driven approach, let’s dive into the actual algorithm.

Our algorithm

Suppose we are tracking the activity of a new topic. To decide whether a topic is
trending at some time we take some recent activity, which we call
the observation , and compare it to example patterns of activity from topics
that became trending in the past and topics that did not.
Each of these examples takes a vote on whether the topic is trending or
not trending (Figure 5). Positive, or trending examples ( in Figure 5) vote
“trending” and negative, or non-trending examples ( in Figure 5) vote “non-
trending”. The weight of each vote depends on the similarity, or distance
between the example and the observation according to a decaying exponential

where is a scaling parameter that determines the “sphere of influence” of

each example. Essentially, each example says, with a certain confidence, “The
observation looks like me, so it should have the same label as me.” We used a
Euclidean distance between activity patterns.

Figure 5

Finally, we sum up all of the “trending” and “non-trending” votes, and see if the
ratio of these sums is greater than or less than 1.

One could think of this as a kind of weighted majority vote k-nearest-neighbors

classification. It also has a probabilistic interpretation that you can find in
Chapter 2 of my thesis.
In general, the examples will be much longer than the observations . In that
case, we look for the “best match” between and and define the
distance to be the minimum distance over all -sized chunks of .

This approach has some nice properties. The core computations are pretty
simple, as we only compute distances. It is scalable since computation of
distances can be parallelized. Lastly, it is nonparametric, which means we don’t
have to decide what model to use.

Results

To evaluate our approach, we collected 500 topics that trended in some time
window (sampled from previous lists of trending topics) and 500 that did not
(sampled from random phrases in tweets, with trending topics removed). We
then tried to predict, on a holdout set of 50% of the topics, which one would
trend and which one would not. For topics that both our algorithm and Twitter’s
detected as trending, we measured how early or late our algorithm was relative
to Twitter’s.

Our most striking result is that we were able to detect Twitter trends in advance
of Twitter’s trend detection algorithm a good percent of the time, while
maintaining a low rate of error. In 79% percent of cases, we detected trending
topics earlier than Twitter (1.43 hours earlier), and we managed to keep an error
rate of around 95% (4% false positive rate, 95% true positive rate).

Naturally, our algorithm has various parameters (most notably the scaling
parameter and the length of an observation signal) that affect the tradeoff
between the types of error and how early we can detect trends. If we are very
aggressive about detecting trends, we will have a high true positive rate and
early detection, but also a high false positive rate. If we are very conservative,
we will have a low false positive rate, but also a low true positive rate and late
detection. And there are various tradeoffs in between these extremes. Figure 6
shows a scatterplot of errors in the FPR(false positive rate)-TPR(true positive
rate) plane, where each point corresponds to a different combination of
parameters. The FPR-TPR plane is split up into three regions corresponding to
the aggressive (“top”), conservative (“bottom”), and in between (“center”)
strategies. Figure 6 also shows histograms of detection times for each of these
strategies.
Figure 6

Conclusion

We’ve designed a new approach to detecting Twitter trends in a nonparametric

fashion. But more than that, we’ve presented a general time series analysis
method that can be used not only for classification (as in the case of trends), but
also for prediction and anomaly detection (cartooned in Figure 7).
Figure 7

And it has the potential to work for a lot more than just predicting trends on
Twitter. We can try this on traffic data to predict the duration of a bus ride, on
movie ticket sales, on stock prices, or any other time-varying measurements.

We are excited by the early results, but there’s a lot more work ahead. We are
continuing to investigate, both theoretically and experimentally, how well this
does with different kinds and amounts of data, and on tasks other than
classification. Stay tuned!

________________________________________________________

Notes:
Thanks to Ben Lerner, Caitlin Mehl, and Coleman Shelton for reading drafts of
this.

I gave a talk about this at the Interdisciplinary Workshop on Information and

Decision in Social Networks at MIT on November 9th, and I’ve included the slides
below. A huge thank you to the many people who listened to dry runs of it and
gave me wonderful feedback.
Detecting Trends from Stanislav Nikolov

For a less technical look, Prof. Shah gave a great talk at the MIT Museum on
Friday, November 9th:

Advertisements
Share this:

 Share


Related
Information Diffusion on TwitterIn "projects"
The Statistical Structure of RhythmIn "projects"
Do Something That Moves YouIn "thoughts"
Tags: data, machine learning, MIT, Nikolov, nonparametric, Shah, statistics, trends, twitter

69 comments
1. Pingback: Early detection of Twitter trends explained | My Daily Feeds

2.
b0b0b0b
November 16, 2012 at 11:35 pm

How did you get your tweet data? Without firehose access (maybe even with), is
it possible there is an inherent bias in twitter’s infrastructure that your algorithm
has learned?

o
snikolov
November 17, 2012 at 12:02 am

You can get the tweet data through the twitter API, which gives you a random
sample of the firehose (though I have no idea if it is truly uniform).

3.
Ivan Yurchenko
November 17, 2012 at 6:19 am

Great work! Want to become able to do such things

BTW, how what tools did you use to draw such pretty figures as 2, 3, 5, 7?

o
snikolov
November 17, 2012 at 3:13 pm

Thanks!

Believe it or not, Powerpoint (for mac)!

4.
Uzair
November 17, 2012 at 7:20 am

Isn’t this incredibly basic? It’s basically photometric similarity (except you
haven’t mentioned the importance of making the data scale-free here). I
thought this was fairly common for comparing time series?

o
snikolov
November 17, 2012 at 3:11 pm

Indeed, the method is surprisingly simple. As far as making the data scale-free, I
bet this would improve performance in practice (better accuracy given the
amount of data, or less data needed for a desired accuracy). In theory, though,
if we have enough data, the data itself should sufficiently cover all time scales,
without doing any extra normalization.

5.
acoulton
November 17, 2012 at 11:53 am

This is a really interesting approach. I could also see it being useful with
software/systems metrics (given access to a wide enough range of data).
For example, predicting when to scale up or down clusters based on what’s
about to happen (rather than waiting until the servers are already somewhat
overloaded). Or alerting an engineer when the system might be about to crash
(rather than waiting for the metrics that indicate it has already) so they can take
preventative access or at least be at their desk when it happens. Lots of
possibilities where currently we’re limited to rough guesses based on point in
time values and again where there’s a range of semi-predicatable patterns
leading up to an event.

o
snikolov
November 17, 2012 at 3:16 pm

Yes! I’d love to apply this to other domains that are serious pain-points for
people. Monitoring a complex software system or forecasting cluster usage
would be two excellent things to try.

6. Pingback: pinboard November 18, 2012 — arghh.net

Implementing Real-Time Trending Topics

With a Distributed Rolling Count
Algorithm in Storm
JAN 18TH, 2013
Table of Contents
 About Trending Topics and Sliding Windows
Sliding Windows
 Before We Start
 About storm-starter
 The Old Code and My Goals for the New Code
 Implementing the Data Structures
 SlotBasedCounter
 SlidingWindowCounter
 Rankings and Rankable
 Implementing the Rolling Top Words Topology
 Overview of the Topology
 TestWordSpout
 Excursus: Tick Tuples in Storm 0.8+
 RollingCountBolt
 Unit Test Example
 AbstractRankerBolt
 IntermediateRankingsBolt
 TotalRankingsBolt
 RollingTopWords
 Running the Rolling Top Words topology
 Example Logging Output
 What I Did Not Cover
 Summary
 Related Links
A common pattern in real-time data workflows is performing
rolling counts of incoming data points, also known as sliding
window analysis. A typical use case for rolling counts is identifying
trending topics in a user community – such as on Twitter – where
a topic is considered trending when it has been among the top N
topics in a given window of time. In this article I will describe how
to implement such an algorithm in a distributed and scalable
fashion using the Storm real-time data processing platform. The
same code can also be used in other areas such as infrastructure
and security monitoring.
Update 2014-06-04: I updated several references to point
to the latest version of storm-starter, which is now part of
the official Storm project.

About Trending Topics and Sliding

Windows
First, let me explain what I mean by “trending topics” so that we
have a common understanding. Here is an explanation taken from
Wikipedia:

Trending topics
A word, phrase or topic that is tagged at a greater rate than other
tags is said to be a trending topic. Trending topics become
popular either through a concerted effort by users or because of
an event that prompts people to talk about one specific topic.
These topics help Twitter and their users to understand what is
happening in the world.

Wikipedia page on Twitter en.wikipedia.org/wiki/…

In other words, it is a measure of “What’s hot?” in a user
community. Typically, you are interested in trending topics for a
given time span; for instance, the most popular topics in the past
five minutes or the current day. So the question “What’s hot?” is
more precisely stated as “What’s hot today?” or “What’s hot this
week?”.
In this article we assume we have a system that uses the Twitter
API to pull the latest tweets from the live Twitter stream. We
assume further that we have a mechanism in place that extracts
topical information in the form of words from those tweets. For
instance, we could opt to use a simple pattern matching
algorithm that treats #hashtags in tweets as topics. Here, we
would consider a tweet such as

1
@miguno The #Storm project rocks for real-time distributed #data processing!

to “mention” the topics

1
2storm
data

We design our system so that it considers topic A more popular

than topic B (for a given time span) if topic A has been mentioned
more often in tweets than topic B. This means we only need
to count the number of occurrences of topics in tweets.
popularity(A)≥popularity(B)⟺mentions(A)≥mentions(B)p
opularity(A)≥popularity(B)⟺mentions(A)≥mentions(B)
For the context of this article we do not care how the topics are
actually derived from user content or user activities as long as the
derived topics are represented as textual words. Then, the Storm
topology described in this article will be able to identify in real-
time the trendingtopics in this input data using a time-sensitive
rolling count algorithm (rolling counts are also known as sliding
windows) coupled with a ranking step. The former aspect takes
care of filtering user input by time span, the latter of ranking the
most trendy topics at the top the list.
Eventually we want our Storm topology to periodically produce
the top N of trending topics similar to the following example
output, where t0 to t2 are different points in time:
1
2Rank @ t0 -----> t1 -----> t2
3---------------------------------------------
41. java (33) ruby (41) scala (32)
52. php (30) scala (28) python (29)

63. scala (21) java (27) ruby (24)

74. ruby (16) python (21) java (22)

5. python (15) php (14) erlang (18)

In this example we can see that over time “scala” has become the
hottest trending topic.

Sliding Windows
The last background aspect I want to cover are sliding windows
aka rolling counts. A picture is worth a thousand words:
Figure 1: As the sliding window advances, the slice of its input
data changes. In the example above the algorithm uses the
current sliding window data to compute the sum of the window’s
elements.
A formula might also be worth a bunch of words – ok, ok, maybe
not a full thousand of them – so mathematically speaking we
could formalize such a sliding-window sum algorithm as follows:
m-sized rolling sum=∑i=ti+melement(i)m-sized rolling
sum=∑i=ti+melement(i)
where t continually advances (most often with time) and m is the
window size.
From size to time: If the window is advanced with time, say
every N minutes, then the individual elements in the input
represent data collected over the same interval of time
(here: N minutes). In that case the window size is equivalent to N
x m minutes. Simply speaking, if N=1 and m=5, then our sliding
window algorithm emits the latest five-minute aggregates every
one minute.
Now that we have introduced trending topics and sliding
windows we can finally start talking about writing code for Storm
that implements all this in practice – large-scale, distributed, in
real time.

Before We Start
About storm-starter
The storm-starter project on GitHub provides example
implementations of various real-time data processing topologies
such as a simple streaming WordCount algorithm. It also includes
a Rolling Top Words topology that can be used for computing
trending topics, the purpose of which is exactly what I want to
cover in this article.
When I began to tackle trending topic analysis with Storm I
expected that I could re-use most if not all of the Rolling Top
Words code in storm-starter . But I soon realized that the old code
would need some serious redesigning and refactoring before one
could actually use it in a real-world environment – including being
able to efficiently maintain and augment the code in a team of
engineers across release cycles.
In the next section I will briefly summarize the state of the Rolling
Top Words topology before and after my refactoring to highlight
some important changes and things to consider when writing your
own Storm code. Then I will continue with covering the most
important aspects of the new implementation in further detail.
And of course I contributed the new implementation back to the
Storm project.

The Old Code and My Goals for the New Code

Just to absolutely clear here: I am talking about the defects of the
old code to highlight some typical pitfalls during software
development for a distributed system such as Storm. My intention
is to make other developers aware of these gotchas so that we
make less mistakes in our profession. I am by no means implying
that the authors of the old code did a bad job (after all, the old
code was perfectly adequate to get me started with trending
topics in Storm) or that the new implementation I came up with is
the pinnacle of coding. :-)
My initial reaction to the old code was that, frankly speaking, I
had no idea what and how it was doing its job. The various logical
responsibilities of the code were mixed together in the existing
classes, clearly not abiding by the Single Responsibility Principle.
And I am not talking about academic treatments of SRP and such
– I was hands-down struggling to wrap my head around the old
code because of this.
Also, I noticed a few synchronized statements and threads being
launched manually, hinting at additional parallel operations
beyond what the Storm framework natively provides you with.
Here, I was particularly concerned with those functionalities that
interacted with the system time (calls to System.currentTimeMillis() ). I
couldn’t help the feeling that they looked prone to concurrency
issues. And my suspicions were eventually confirmed when I
discovered a dirty-write bug in the RollingCountObjects bolt code for
the slot-based counting (using long[] ) of object occurrences. In
practice this dirty-write bug in the old rolling count
implementation caused data corruption, i.e. the code was not
carrying out its main responsibility correctly – that of counting
objects. That said I’d argue that it would not have been trivial to
spot this error in the old code prior to refactoring (where it was
eventually plain to see), so please don’t think it was just
negligence on the part of the original authors. With the new tick
tuple feature in Storm 0.8 I was feeling confident that this part of
the code could be significantly simplified and fixed.
In general I figured that completely refactoring the code and
untangling these responsibilities would not only make the code
more approachable and readable for me and others – after all
the storm-starter code’s main purpose is to jumpstart Storm
beginners – but it would also allow me to write meaningful unit
tests, which would have been very difficult to do with the old
code.
Wh Before refactoring After refactoring
at

Stor RollingCountObject RollingCountBolt,IntermediateRank

m s, RankObjects,Mer ingsBolt,TotalRankingsBolt,
Bolt geObjects
s

Stor TestWordSpout TestWordSpout (not modified)

m
Spo
uts

Dat - SlotBasedCounter,SlidingWindowC
a ounter, Rankings,Rankable,Rankabl
Stru eObjectWithFields
ctur
es

Unit - Every class has its own suite of

Tes tests.
ts

Add Uses manually Uses new tick tuple feature in

itio launched Storm 0.8 to trigger periodic
nal background activities in Storm components.
Not threads instead of
es native Storm
features to execute
periodic activities.

Table 1: The state of the trending topics Storm implementation

before and after the refactoring.
The design and implementation that I will describe in the
following sections are the result of a number of refactoring
iterations. I started with smaller code changes that served me
primarily to understand the existing code better (e.g. more
meaningful variable names, splitting long methods into smaller
logical units). The more I felt comfortable the more I started to
introduce substantial changes. Unfortunately the existing code
was not accompanied by any unit tests, so while refactoring I was
in the dark, risking to break something that I was not even aware
of breaking. I considered writing unit tests for the existing code
first and then go back to refactoring but I figured that this would
not be the best approach given the state of the code and the time
I had available.

In summary my goals for the new trending topics implementation

were:

1. The new code should be clean and easy to understand, both for
the benefit of other developers when adapting or maintaining
the code and for reasoning about its correctness. Notably, the
code should decouple its data structures from the Storm sub-
system and, if possible, favor native Storm features for
concurrency instead of custom approaches.
2. The new code should be covered by meaningful unit tests.
3. The new code should be good enough to contribute it back to
the Storm project to help its community.

Implementing the Data Structures

Eventually I settled down to the following core data structures for
the new distributed Rolling Count algorithm. As you will see, an
interesting characteristic is that these data structures are
completely decoupled from any Storm internals. Our Storm bolts
will make use of them, of course, but there is no dependency in
the opposite direction from the data structures to Storm.

 Classes used for counting

objects: SlotBasedCounter, SlidingWindowCounter
 Classes used for ranking objects by their
count: Rankings, Rankable, RankableObjectWithFields
Another notable improvement is that the new code removes any
need and use of concurrency-related code such
as synchronized statements or manually started background
threads. Also, none of the data structures are interacting with the
system time. Eliminating direct calls to system time and manually
started background threads makes the new code much simpler
and testable than before.
No more interacting with system time in the low level data structures, yay!

1// such code from the old RollingCountObjects bolt is not needed anymore
2long delta = millisPerBucket(_numBuckets)
3 - (System.currentTimeMillis() % millisPerBucket(_numBuckets));
4Utils.sleep(delta);

SlotBasedCounter
The SlotBasedCounter class provides per-slot counts of the
occurrences of objects. The number of slots of a given counter
instance is fixed. The class provides four public methods:
SlotBasedCounter API

1public void incrementCount(T obj, int slot);

2public void wipeSlot(int slot):
3public long getCount(T obj, int slot)
4// get the *total* counts of all objects across all slots
5public Map<T, Long> getCounts();

Here is a usage example:

Using SlotBasedCounter

1// we want to count Object's using five slots

2SlotBasedCounter counter = new SlotBasedCounter<Object>(5);
3
4// counting
5Object trackMe = ...;
6int currentSlot = 0;
7counter.incrementCount(trackMe, currentSlot);
8
9// the counts of an object for a given slot
1long counts = counter.getCount(trackMe, currentSlot);
0
1// the total counts (across all slots) of all objects
1Map<Object, Long> counts = counter.getCounts();
1
2
1
3

Internally SlotBasedCounter is backed by a Map<T, long[]> for the

actual count state. You might be surprised to see the low-
level long[] array here – wouldn’t it be better OO style to introduce
a new, separate class that is just used for the counting of a single
slot, and then we use a couple of these single-slot counters to
form the SlotBasedCounter? Well, yes we could. But for
performance reasons and for not deviating too far from the old
code I decided not to go down this route. Apart from updating the
counter – which is a WRITE operation – the most common
operation in our use case is a READ operation to get
the total counts of tracked objects. Here, we must calculate the
sum of an object’s counts across all slots. And for this it is
preferable to have the individual data points for an object close to
each other (kind of data locality), which the long[] array allows us
to do. Your mileage may vary though.
Figure 2: The SlotBasedCounter class keeps track of multiple counts of
a given object. In the example above, the SlotBasedCounter has
five logical slots which allows you to track up to five counts per
object.
The SlotBasedCounter is a primitive class that can be used, for
instance, as a building block for implementing sliding window
counting of objects. And this is exactly what I will describe in the
next section.

SlidingWindowCounter
The SlidingWindowCounter class provides rolling counts of the
occurrences of “things”, i.e. a sliding window count for each
tracked object. Its counting functionality is based on the
previously described SlotBasedCounter . The size of the sliding
window is equivalent to the (fixed) number of slots number of a
given SlidingWindowCounter instance. It is used by RollingCountBolt for
counting incoming data tuples.
The class provides two public methods:

SlidingWindowCounter API

1public void incrementCount(T obj);

2Map<T, Long> getCountsThenAdvanceWindow();

What might be surprising to some readers is that this class does

not have any notion of time even though “sliding window”
normally means a time-based window of some kind. In our case
however the window does not advance with time but whenever
(and only when) the method getCountsThenAdvanceWindow() is called.
This means SlidingWindowCounter behaves just like a normal ring
buffer in terms of advancing from one window to the next.
Note: While working on the code I realized that parts of my
redesign decisions – teasing apart the concerns – were close in
mind to those of the LMAX Disruptor concurrent ring buffer, albeit
much simpler of course. Firstly, to limit concurrent access to the
relevant data structures (here: mostly what
SlidingWindowCounter is being used for). In my case I followed
the SRP and split the concerns into new data structures in a way
that actually allowed me to eliminate the need for ANY concurrent
access. Secondly, to put a strict sequencing concept in place (the
way incrementCount(T obj) and getCountsThenAdvanceWindow() interact) that
would prevent dirty reads or dirty writes from happening as was
unfortunately possible in the old, system time based code.

If you have not heard about LMAX Disruptor before, make sure to
read their LMAX technical paper (PDF) on the LMAX homepage for
inspirations. It’s worth the time!
Figure 3: The SlidingWindowCounter class keeps track of
multiple rolling counts of objects, i.e. a sliding window count for
each tracked object. Please note that the example of an 8-slot
sliding window counter above is simplified as it only shows a
single count per slot. In reality SlidingWindowCounter tracks multiple
counts for multiple objects.
Here is an illustration showing the behavior
of SlidingWindowCounter over multiple iterations:
Figure 4: Example of SlidingWindowCounter behavior for a counter of
size 4. Again, the example is simplified as it only shows a single
count per slot.

Rankings and Rankable

The Rankings class represents fixed-size rankings of objects, for
instance to implement “Top 10” rankings. It ranks its objects
descendingly according to their natural order, i.e. from largest
to smallest. This class is used by AbstractRankerBolt and its derived
bolts to track the current rankings of incoming objects over time.
Note: The Rankings class itself is completely unaware of the bolts’
time-based behavior.
The class provides five public methods:

Rankings API

1public void updateWith(Rankable r);

2public void updateWith(Rankings other);
3public List<Rankable> getRankings();
4public int maxSize(); // as supplied to constructor
5public int size(); // current size, might be less than maximum size

Whenever you update Rankings with new data, it will discard any
elements that are smaller than the updated top N , where N is the
maximum size of the Rankings instance (e.g. 10 for a top 10
ranking).
Now the sorting aspect of the ranking is driven by the natural
order of the ranked objects. In my specific case, I created
a Rankable interface that in turn implements
the Comparable interface. In practice, you simply pass
a Rankable object to the Rankings class, and the latter will update its
rankings accordingly.
Using the Rankings class

1Rankings topTen = new Rankings(10);

2Rankable C = ...;
3topTen.updateWith(r);
4
5List<Rankable> rankings = topTen.getRankings();

As you can see it is really straight-forward and intuitive in its use.

Figure 5: The Rankings class ranks Rankable objects descendingly
according to their natural order, i.e. from largest to smallest.
The example above shows a Rankings instance with a maximum
size of 10 and a current size of 8.
The concrete class
implementing Rankable is RankableObjectWithFields. The
bolt IntermediateRankingsBolt , for instance, creates Rankables from
incoming data tuples via a factory method of this class:
IntermediateRankingsBolt.java

1@Override
2void updateRankingsWithTuple(Tuple tuple) {
3 Rankable rankable = RankableObjectWithFields.from(tuple);
4 super.getRankings().updateWith(rankable);
5}

Have a look
at Rankings, Rankable and RankableObjectWithFields for details.
If you run into a situation where you have to implement classes
like these yourself, make sure you follow good engineering
practice and add standard methods such
as equals() and hashCode() as well to your data structures.

Implementing the Rolling Top Words

Topology
So where are we? In the sections above we have already
discussed a number of Java classes but not even a single one of
them has been directly related to Storm. It’s about time that we
start writing some Storm code!

In the following sections I will describe the Storm components

that make up the Rolling Top Words topology. When reading the
sections keep in mind that the “words” in this topology represent
the topics that are currently being mentioned by the users in our
imaginary system.
Overview of the Topology
The high-level view of the Rolling Top Words topology is shown in
the figure below.

Figure 6: The Rolling Top Words topology consists of instances

of TestWordSpout , RollingCountBolt , IntermediateRankingsBolt and TotalRankingsBolt
. The length of the sliding window (in secs) as well as the various
emit frequencies (in secs) are just example values – depending on
your use case you would, for instance, prefer to have a sliding
window of five minutes and emit the latest rolling counts every
minute.
The main responsibilities are split as follows:

1. In the first layer the topology runs many TestWordSpout instances

in parallel to simulate the load of incoming data – in our case
this would be the names of the topics (represented as words)
that are currently being mentioned by our users.
2. The second layer comprises multiple instances of RollingCountBolt ,
which perform a rolling count of incoming words/topics.
3. The third layer uses multiple instances
of IntermediateRankingsBolt (“I.R. Bolt” in the figure) to distribute the
load of pre-aggregating the various incoming rolling counts into
intermediate rankings. Hadoop users will see a strong similarity
here to the functionality of a combiner in Hadoop.
4. Lastly, there is the final step in the topology. Here, a single
instance of TotalRankingsBolt aggregates the incoming intermediate
rankings into a global, consolidated total ranking. The output of
this bolt are the currently trending topics in the system. These
trending topics can then be used by downstream data
consumers to provide all the cool user-facing and backend
features you want to have in your platform.
In code the topology wiring looks as follows in RollingTopWords:
RollingTopWords.java

1builder.setSpout(spoutId, new TestWordSpout(), 2);

2builder.setBolt(counterId, new RollingCountBolt(9, 3), 3)
3 .fieldsGrouping(spoutId, new Fields("word"));
4builder.setBolt(intermediateRankerId, new IntermediateRankingsBolt(TOP_N), 2)
5 .fieldsGrouping(counterId, new Fields("obj"));
6builder.setBolt(totalRankerId, new TotalRankingsBolt(TOP_N))
7 .globalGrouping(intermediateRankerId);

Note: The integer parameters of the setSpout() and setBolt() methods

(do not confuse them with the integer parameters of the bolt
constructors) configure the parallelism of the Storm components.
See my article Understanding the Parallelism of a Storm
Topology for details.

TestWordSpout
The only spout we will be using is the TestWordSpout that is part
of backtype.storm.testing package of Storm itself. I will not cover the
spout in detail because it is a trivial class. The only thing it does is
to select a random word from a fixed list of five words (“nathan”,
“mike”, “jackson”, “golda”, “bertels”) and emit that word to the
downstream topology every 100ms. For the sake of this article,
we consider these words to be our “topics”, of which we want to
identify the trending ones.
Note: Because TestWordSpout selects its output words at random
(and each word having the same probability of being selected) in
most cases the counts of the various words are pretty close to
each other. This is ok for example code such as ours. In a
production setting though you most likely want to generate
“better” simulation data.
The spout’s output can be visualized as follows. Note that
the @XXXms milliseconds timeline is not part of the actual output.
1
2@100ms: nathan
3@200ms: golda
4@300ms: golda
5@400ms: jackson
6@500ms: mike
7@600ms: nathan
8@700ms: bertels
...

Excursus: Tick Tuples in Storm 0.8+

A new and very helpful (read: awesome) feature of Storm 0.8 is
the so-called tick tuple. Whenever you want a spout or bolt
execute a task at periodic intervals – in other words, you want to
trigger an event or activity – using a tick tuple is normally the
best practice.
Nathan Marz described tick tuples in the Storm 0.8 announcement
as follows:

Tick tuples: It’s common to require a bolt to “do something” at a

fixed interval, like flush writes to a database. Many people have
been using variants of a ClockSpout to send these ticks. The
problem with a ClockSpout is that you can’t internalize the need
for ticks within your bolt, so if you forget to set up your bolt
correctly within your topology it won’t work correctly. 0.8.0
introduces a new “tick tuple” config that lets you specify the
frequency at which you want to receive tick tuples via the
“topology.tick.tuple.freq.secs” component-specific config, and
then your bolt will receive a tuple from the __system component
and __tick stream at that frequency.
Nathan Marz on the Storm mailing
list groups.google.com/forum/#!msg/…
Here is how you configure a bolt/spout to receive tick tuples every
10 seconds:

Configuring a bolt/spout to receive tick tuples every 10 seconds

1@Override
2public Map<String, Object> getComponentConfiguration() {
3 Config conf = new Config();
4 int tickFrequencyInSeconds = 10;
5 conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, tickFrequencyInSeconds);
6 return conf;
7}

Usually you will want to add a conditional switch to the

component’s execute method to tell tick tuples and “normal”
tuples apart:
Telling tick tuples and normal tuples apart

1@Override
2public void execute(Tuple tuple) {
3 if (isTickTuple(tuple)) {
4 // now you can trigger e.g. a periodic activity
5 }
6 else {
7 // do something with the normal tuple
8 }
9}
1
0private static boolean isTickTuple(Tuple tuple) {
1 return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
1 && tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
1}
2
1
3
1
4

Be aware that tick tuples are sent to bolts/spouts just like

“regular” tuples, which means they will be queued behind other
tuples that a bolt/spout is about to process via
its execute() or nextTuple() method, respectively. As such the time
interval you configure for tick tuples is, in practice, served on a
“best effort” basis. For instance, if a bolt is suffering from high
execution latency – e.g. due to being overwhelmed by the
incoming rate of regular, non-tick tuples – then you will observe
that the periodic activities implemented in the bolt will get
triggered later than expected.
I hope that, like me, you can appreciate the elegance of solely
using Storm’s existing primitives to implement the new tick tuple
feature. :-)

RollingCountBolt
This bolt performs rolling counts of incoming objects, i.e. sliding
window based counting. Accordingly it uses
the SlidingWindowCounter class described above to achieve this. In
contrast to the old implementation only this bolt (more correctly:
the instances of this bolt that run as Storm tasks) is interacting
with the SlidingWindowCounter data structure. Each instance of the
bolt has its own private SlidingWindowCounter field, which eliminates
the need for any custom inter-thread communication and
synchronization.
The bolt combines the previously described tick tuples (that
trigger at fix intervals in time) with the time-agnostic behavior
of SlidingWindowCounter to achieve time-based sliding window
counting. Whenever the bolt receives a tick tuple, it will advance
the window of its private SlidingWindowCounter instance and emit its
latest rolling counts. In the case of normal tuples it will simply
count the object and ack the tuple.
RollingCountBolt

1@Override
2public void execute(Tuple tuple) {
3 if (TupleHelpers.isTickTuple(tuple)) {
4 LOG.info("Received tick tuple, triggering emit of current window counts");
5 emitCurrentWindowCounts();
6 }
7 else {
8 countObjAndAck(tuple);
9 }
1}
0
1private void emitCurrentWindowCounts() {
1 Map<Object, Long> counts = counter.getCountsThenAdvanceWindow();
1 ...
2 emit(counts, actualWindowLengthInSeconds);
1}
3
1private void emit(Map<Object, Long> counts) {
4 for (Entry<Object, Long> entry : counts.entrySet()) {
1 Object obj = entry.getKey();
5 Long count = entry.getValue();
1 collector.emit(new Values(obj, count));
6 }
1}
7
1private void countObjAndAck(Tuple tuple) {
8 Object obj = tuple.getValue(0);
1 counter.incrementCount(obj);
9 collector.ack(tuple);
2}
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0

That’s all there is to it! The new tick tuples in Storm 0.8 and the
cleaned code of the bolt and its collaborators also make the code
much more testable (the new code of this bolt has 98% test
coverage). Compare the code above to the old implementation of
the bolt and decide for yourself which one you’d prefer adapting
or maintaining:

RollingCountObjects BEFORE Storm tick tuples and refactoring

1public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {

2 _collector = collector;
3 cleaner = new Thread(new Runnable() {
4 public void run() {
5 Integer lastBucket = currentBucket(_numBuckets);
6
7 while(true) {
8 int currBucket = currentBucket(_numBuckets);
9 if(currBucket!=lastBucket) {
1 int bucketToWipe = (currBucket + 1) % _numBuckets;
0 synchronized(_objectCounts) {
1 Set objs = new HashSet(_objectCounts.keySet());
1 for (Object obj: objs) {
1 long[] counts = _objectCounts.get(obj);
2 long currBucketVal = counts[bucketToWipe];
1 counts[bucketToWipe] = 0;
3 long total = totalObjects(obj);
1 if(currBucketVal!=0) {
4 _collector.emit(new Values(obj, total));
1 }
5 if(total==0) {
1 _objectCounts.remove(obj);
6 }
1 }
7 }
1 lastBucket = currBucket;
8 }
1 long delta = millisPerBucket(_numBuckets) - (System.currentTimeMillis() %
9millisPerBucket(_numBuckets));
2 Utils.sleep(delta);
0 }
2 }
1 });
2 cleaner.start();
2}
2
3public void execute(Tuple tuple) {
2 Object obj = tuple.getValue(0);
4 int bucket = currentBucket(_numBuckets);
2 synchronized(_objectCounts) {
5 long[] curr = _objectCounts.get(obj);
2 if(curr==null) {
6 curr = new long[_numBuckets];
2 _objectCounts.put(obj, curr);
7 }
2 curr[bucket]++;
8 _collector.emit(new Values(obj, totalObjects(obj)));
2 _collector.ack(tuple);
9 }
3}
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9
Unit Test Example
Since I mentioned unit testing a couple of times in the previous
section, let me briefly discuss this point in further detail. I
implemented the unit tests with TestNG, Mockito and FEST-Assert.
Here is an example unit test for RollingCountBolt , taken
from RollingCountBoltTest.
Example unit test

1
2
3
4
@Test
5
public void shouldEmitNothingIfNoObjectHasBeenCountedYetAndTickTupleIsReceived() {
6
// given
7
Tuple tickTuple = MockTupleHelpers.mockTickTuple();
8
RollingCountBolt bolt = new RollingCountBolt();
9
Map conf = mock(Map.class);
1
TopologyContext context = mock(TopologyContext.class);
0
OutputCollector collector = mock(OutputCollector.class);
1
bolt.prepare(conf, context, collector);
1
1
// when
2
bolt.execute(tickTuple);
1
3
// then
1
verifyZeroInteractions(collector);
4
}
1
5
1
6
AbstractRankerBolt
This abstract bolt provides the basic behavior of bolts that rank
objects according to their natural order. It uses the template
method design pattern for its execute() method to allow actual bolt
implementations to specify how incoming tuples are processed,
i.e. how the objects embedded within those tuples are retrieved
and counted.
This bolt has a private Rankings field to rank incoming tuples (those
must contain Rankable objects, of course) according to their natural
order.
AbstractRankerBolt

1
2
3// This method functions as a template method (design pattern).
4@Override
5public final void execute(Tuple tuple, BasicOutputCollector collector) {
6 if (TupleHelpers.isTickTuple(tuple)) {
7 getLogger().info("Received tick tuple, triggering emit of current rankings");
8 emitRankings(collector);
9 }
1 else {
0 updateRankingsWithTuple(tuple);
1 }
1}
1
2abstract void updateRankingsWithTuple(Tuple tuple);
1
3

The two actual implementations used in the Rolling Top Words

topology, IntermediateRankingsBolt and TotalRankingsBolt , only need to
implement the updateRankingsWithTuple() method.

IntermediateRankingsBolt
This bolt extends AbstractRankerBolt and ranks incoming objects by
their count in order to produce intermediate rankings. This type of
aggregation is similar to the functionality of a combiner in
Hadoop. The topology runs many of such intermediate ranking
bolts in parallel to distribute the load of processing the incoming
rolling counts from the RollingCountBolt instances.
This bolt only needs to
override updateRankingsWithTuple() of AbstractRankerBolt :
IntermediateRankingsBolt

1@Override
2void updateRankingsWithTuple(Tuple tuple) {
3 Rankable rankable = RankableObjectWithFields.from(tuple);
4 super.getRankings().updateWith(rankable);
5}

TotalRankingsBolt
This bolt extends AbstractRankerBolt and merges incoming
intermediate Rankings emitted by
the IntermediateRankingsBolt instances.
Like IntermediateRankingsBolt , this bolt only needs to override
the updateRankingsWithTuple() method:
TotalRankingsBolt

1@Override
2void updateRankingsWithTuple(Tuple tuple) {
3 Rankings rankingsToBeMerged = (Rankings) tuple.getValue(0);
4 super.getRankings().updateWith(rankingsToBeMerged);
5}

Since this bolt is responsible for creating a global, consolidated

ranking of currently trending topics, the topology must run only a
single instance of TotalRankingsBolt . In other words, it must be a
singleton in the topology.
The bolt’s current code in storm-starter does not enforce this
behavior though – instead it relies on the RollingTopWords class to
configure the bolt’s parallelism correctly (if you ask yourself why
it doesn’t: that was simply oversight on my part, oops). If you
want to improve that, you can provide a so-called per-
component Storm configuration for this bolt that sets its
maximum task parallelism to 1:
TotalRankingsBolt

1@Override
2public Map<String, Object> getComponentConfiguration() {
3 Map<String, Object> conf = new HashMap<String, Object>();
4 conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, emitFrequencyInSeconds);
5 // run only a single instance of this bolt in the Storm topology
6 conf.setMaxTaskParallelism(1);
7 return conf;
8}

RollingTopWords
The class RollingTopWords ties all the previously discussed code
pieces together. It implements the actual Storm topology,
configures spouts and bolts, wires them together and launches
the topology in local mode (Storm’s local mode is similar to
a pseudo-distributed, single-node Hadoop cluster).
By default, it will produce the top 5 rolling words (our trending
topics) and run for one minute before terminating. If you want to
twiddle with the topology’s configuration settings, here are the
most important:

 Configure the number of generated trending topics by setting

the TOP_N constant in RollingTopWords .
 Configure the length and emit frequencies (both in seconds) for
the sliding window counting in the constructor
of RollingCountBolt in RollingTopWords#wireTopology() .
 Similarly, configure the emit frequencies (in seconds) of the
ranking bolts by using their corresponding constructors.
 Configure the parallelism of the topology by setting
the parallelism_hint parameter of each bolt and spout accordingly.
Apart from this there is nothing special about this class. And
because we have already seen the most important code snippet
from this class in the section Overview of the Topology I will not
describe it any further here.
Running the Rolling Top Words topology
Update 2014-06-04: I updated the instructions below
based on the latest version of storm-starter, which is
now part of the official Storm project.
Now that you know how the trending topics Storm code works it is
about time we actually launch the topology! The topology is
configured to run in local mode, which means you can just grab
the code to your development box and launch it right away. You
do not need any special Storm cluster installation or similar setup.

First you must checkout the latest code of the storm-

starter project from GitHub:
1$ git clone [email protected]:apache/incubator-storm.git
2$ cd incubator-storm

Then you must build and install the (latest) Storm jars locally, see
the storm-starter README:
1# Must be run from the top-level directory of the Storm code repository
2$ mvn clean install -DskipTests=true

Now you can compile and run the RollingTopWords topology:

1$ cd examples/storm-starter
2$ mvn compile exec:java -Dstorm.topology=storm.starter.RollingTopWords

By default the topology will run for one minute and then
terminate automatically.

Example Logging Output

Here is some example logging output of the topology. The first
colum is the current time in milliseconds since the topology was
started (i.e. it is 0 at the very beginning). The second colum is the
ID of the thread that logged the message. I deliberately removed
some entries in the log flow to make the output easier to read.
For this reason please take a close look on the timestamps (first
column) when you want to compare the various example outputs
below.
Also, the Rolling Top Words topology has debugging output
enabled. This means that Storm itself will by default log
information such as what data a bolt/spout has emitted. For that
reason you will see seemingly duplicate lines in the logs below.

Lastly, to make the logging output easier to read here is some

information about the various thread IDs in this example run:

Thread Java Class

Thread- TestWordSpout
37

Thread- TestWordSpout
39

Thread- RollingCountBolt
19

Thread- RollingCountBolt
21

Thread- RollingCountBolt
25

Thread- IntermediateRanking
31 sBolt

Thread- IntermediateRanking
33 sBolt

Thread- TotalRankingsBolt
27
Note: The Rolling Top Words code in the storm-starter repository runs
more instances of the various spouts and bolts than the code
used in this article. I downscaled the settings only to make the
figures etc. easier to read. This means your own logging output
will look slightly different.
The topology has just started to run. The spouts generate their
first output messages:

2056 [Thread-37] INFO backtype.storm.daemon.task - Emitting: wordGenerator default

[golda]

12057 [Thread-19] INFO backtype.storm.daemon.executor - Processing received message

2source: wordGenerator:11, stream: default, id: {}, [golda]
32063 [Thread-39] INFO backtype.storm.daemon.task - Emitting: wordGenerator default
4[nathan]
52064 [Thread-25] INFO backtype.storm.daemon.executor - Processing received message
6source: wordGenerator:12, stream: default, id: {}, [nathan]
2069 [Thread-37] INFO backtype.storm.daemon.task - Emitting: wordGenerator default
[mike]
2069 [Thread-21] INFO backtype.storm.daemon.executor - Processing received message
source: wordGenerator:13, stream: default, id: {}, [mike]
The three RollingCountBolt instances start to emit their first
sliding window counts:

14765 [Thread-19] INFO backtype.storm.daemon.executor - Processing received message

2source: __system:-1, stream: __tick, id: {}, [3]
34765 [Thread-19] INFO storm.starter.bolt.RollingCountBolt - Received tick tuple, triggering
4emit of current window counts
54765 [Thread-25] INFO backtype.storm.daemon.executor - Processing received message
6source: __system:-1, stream: __tick, id: {}, [3]
74765 [Thread-25] INFO storm.starter.bolt.RollingCountBolt - Received tick tuple, triggering
8emit of current window counts
94766 [Thread-21] INFO backtype.storm.daemon.executor - Processing received message
source: __system:-1, stream: __tick, id: {}, [3]
4766 [Thread-21] INFO storm.starter.bolt.RollingCountBolt - Received tick tuple, triggering
emit of current window counts
4766 [Thread-19] INFO backtype.storm.daemon.task - Emitting: counter default [golda,
24, 2]
4766 [Thread-25] INFO backtype.storm.daemon.task - Emitting: counter default [nathan,
33, 2]
4766 [Thread-21] INFO backtype.storm.daemon.task - Emitting: counter default [mike, 27,
2]
The two IntermediateRankingsBolt instances emit their intermediate
rankings:
5774 [Thread-31] INFO backtype.storm.daemon.task - Emitting: intermediateRanker

1default [[[mike|27|2], [golda|24|2]]]

25774 [Thread-33] INFO backtype.storm.daemon.task - Emitting: intermediateRanker
3default [[[bertels|31|2], [jackson|19|2]]]
45774 [Thread-31] INFO storm.starter.bolt.IntermediateRankingsBolt - Rankings: [[mike|27|
2], [golda|24|2]]
5774 [Thread-33] INFO storm.starter.bolt.IntermediateRankingsBolt - Rankings: [[bertels|
31|2], [jackson|19|2]]
The single TotalRankingsBolt instance emits its global rankings:
3765 [Thread-27] INFO storm.starter.bolt.TotalRankingsBolt - Rankings: []

15767 [Thread-27] INFO storm.starter.bolt.TotalRankingsBolt - Rankings: []

27768 [Thread-27] INFO storm.starter.bolt.TotalRankingsBolt - Rankings: [[nathan|33|2],
3[bertels|31|2], [mike|27|2], [golda|24|2], [jackson|19|2]]
49770 [Thread-27] INFO storm.starter.bolt.TotalRankingsBolt - Rankings: [[bertels|76|5],
5[nathan|58|5], [mike|49|5], [golda|24|2], [jackson|19|2]]
611771 [Thread-27] INFO storm.starter.bolt.TotalRankingsBolt - Rankings: [[bertels|76|5],
[nathan|58|5], [jackson|52|5], [mike|49|5], [golda|49|5]]
13772 [Thread-27] INFO storm.starter.bolt.TotalRankingsBolt - Rankings: [[bertels|110|8],
[nathan|85|8], [golda|85|8], [jackson|83|8], [mike|71|8]]
Note: During the first few seconds after startup you will observe
that IntermediateRankingsBolt and TotalRankingsBolt instances will emit
empty rankings. This is normal and the expected behavior –
during the first seconds the RollingCountBolt instances will collect
incoming words/topics and fill their sliding windows before
emitting the first rolling counts to
the IntermediateRankingsBolt instances. The same kind of thing happens
for the combination of IntermediateBolt instances and
the TotalRankingsBolt instance. This is an important behavior of the
code that must be understood by downstream data consumers of
the trending topics emitted by the topology.
What I Did Not Cover
I introduced a new feature to the Rolling Top Words code that I
contributed back to storm-starter . This feature is a metric that tracks
the difference between the configured length of the sliding
window (in seconds) and the actual window length as seen in the
emitted output data.
14763 [Thread-25] WARN storm.starter.bolt.RollingCountBolt - Actual window length is 2
seconds when it should be 9 seconds (you can safely ignore this warning during the startup
phase)
This metric provides downstream data consumers with additional
meta data, namely the time range that a data tuple actually
covers. It is a nifty addition that will make the life of your fellow
data scientists easier. Typically, you will see a difference between
configured and actual window length a) during startup for the
reasons mentioned above and b) when your machines are under
high load and therefore not respond perfectly in time. I omitted
the discussion of this new feature to prevent this article from
getting too long.

Also, there are some minor changes in my own code that I did not
contribute back to storm-starter because I did not want to introduce
too many changes at once (such as a
refactored TestWordSpout class).

Summary
In this article I described how to implement a distributed, real-
time trending topics algorithm in Storm. It uses the latest features
available in Storm 0.8 (namely tick tuples) and should be a good
starting point for anyone trying to implement such an algorithm
for their own application. The new code is now available in the
official storm-starter repository, so feel free to take a closer look.
You might ask whether there is a use of a distributed sliding
window analysis beyond the use case I presented in this article.
And for sure there is. The sliding window analysis described here
applies to a broader range of problems than computing trending
topics. Another typical area of application is real-time
infrastructure monitoring, for instance to identify broken servers
by detecting a surge of errors originating from problematic
machines. A similar use case is identifying attacks against your
technical infrastructure, notably flood-type DDoS attacks. All of
these scenarios can benefit from sliding window analyses of
incoming real-time data through tools such as Storm.
If you think the starter code can be improved further, please
contribute your changes back to thestorm-starter component in
the official Storm project.

Related Links
 Understanding the Parallelism of a Storm Topology
Interested in more? You can subscribe to this blog, or follow me
on Twitter.
Posted by Michael G. Noll Jan 18th, 2013 Filed
under Java, Programming, Real-time, Storm
« Understanding the Parallelism of a Storm Topology Bootstrapping a Java
project with Gradle, TestNG, Mockito and Cobertura for Eclipse and Jenkins »
Comments

How can I build a web crawler from scratch?

What language or framework would you recommend? Any tips or tricks?
Answer Wiki

 Why reinvent the wheel? Use a web crawling/scraping service that

somebody else maintains (http://www.webcrawling.org)
17 Answers

Raghavendran Balu, Data Researcher+Engineer | Left | Nomadic

Written Jun 8, 2014
A web crawler might sound like a simple fetch-parse-append system, but watch out!
you may over look the complexity. I might deviate from the question intent by
focussing more on architecture than implementation specifics.I believe it is necessary
because, to build a web scale crawler, the architecture of the crawler is more
important than the choice of language/ framework.

A good starting point to learn about architecture is Web Crawling[1] (highly

suggested) and Crawling the Web[2]. Some of the well known systems in academia
are Mercator[3], UbiCrawler[4], IRLbot [5] (single-sever crawler)
and MultiCrawler[6].

Architecture:
A bare minimum crawler needs at least these components:

1. HTTP Fetcher: To retrieve web page from the server.

2. Extractor: Minimal support to extract URL from page like anchor links.
3. Duplicate Eliminator: To make sure same content is not extracted twice
unintentionally. Consider it as a set based data structure.
4. URL Frontier: To prioritize URL that has to fetched and parsed. Consider it as
a priority queue
5. Datastore: To store retrieve pages and URL and other meta data.

A distributed crawling is more challenging compared to single server because it has to

coordinate on duplicate detection(3) and URL frontier(4). This calls for a distributed
set implementation and priority queue. The approach described in Mercator[3]
efficiently solves this by maintaining front end and backend queues. Considering the
volume of the data, you have to balance between a disk based data structure and in-
memory cache. Compared to these data structures, Fetching and Link extraction are
not so problematic, as they are independent operations and scale well with hardware.

Notable Problems /tips-tricks :

1. DNS resolving: This minor operation, when crawling on a large scale adds up
to a big bottleneck. If not tackled, your system might spend more time on
waiting to resolve domain names than on fetching and parsing. It is advisable
to maintain local caches to avoid repeated requests.
2. Politeness: Every site has different politeness requirements and ignoring
might lead to IP blocks. Expect mails from web admins! [5] . You need to
cache the robot.txt files for every domain and follow the rules to avoid getting
blocked. Also sites usually have time-gap requirements and this means
slowing the fetching rate. If you system is distributed, better to aggregate all
requests pertaining to a domain on the same node.
3. Adversaries: Don't expect the web world to be nice to your bot!. There are
many crawler traps, spam sites and cloaked content. Crawler traps have large
automated URLS with potential loops which leads to the crawler getting
struck there. You need to smartly classify valid domains from spam domains.
Domain reputation from user input could be one such. There are other
redundant web content (some unintentional) too.
4. Seed and Content selection: You need to come up with a good set of seed
domains and it is desirable to have some mechanism to rank domains/URL
based on content for effective prioritisation.
More problems (Research oriented):

1. Freshness/Coverage dilemma: Some websites like wikipedia/news sites are

frequently updated and requires you to update your crawled content too. You
need to come up with a strategy to roughly estimate the crawl schedule for
each domain. You also need to wisely balance covering new domain and
updating covered domains.
2. Deep crawling: This is the most challenging one and even internet giants like
Google are still solving. The web content hidden behind HTML forms are of
several magnitude larger than link based crawlable content on the web. You
need to come up a smart query generator apt for the particular HTML form
and also a size estimator for the underlying data that is extracted.
3. Focussed crawling: If your target content is narrower than the www, then you
need to predict domain relevance based on your interests. This ensures
higher content quality and lower computation/storage overhead. Solution
ranges from simple whitelisting to advanced classifiers.

Implementation specifics:
This thesis work Building blocks of a scalable web crawler[7] details crawling from
implementation point of view.

Programming Language: Any high level language with good network library that you
are comfortable with is fine. I personally prefer Python/Java. As your crawler project
might grow in terms of code size it will be hard to manage if you develop in a design-
restricted programming language. While it is possible to build a crawler using just unix
commands and shell script, you might not want to do so for obvious reasons.

Framework/Libraries: Many frameworks are already suggested in other answers. I

shall summarise here:

1. Apache Nutch and Heritrix (Java): Mature, Large scale, configurable

2. Scrapy (Python): Technically a scraper but can be used to build a crawler.
3. node.io (Javascript): Scraper. Nascent, but worth considering, if you are
ready to live with javascript.

Suggestions for scalable distributed crawling:

1. It is better to go for a asynchronous model, given the nature of the problem.

2. Choose a distributed data base for data storage ex. Hbase.
3. A distributed data structure like redis is also worth considering for URL
frontier and duplicate detector.
Happy Crawling!

[1] Olston, C., & Najork, M. (2010). Web crawling. Foundations and Trends in
Information Retrieval, 4(3), 175-246.
[2] Pant, G., Srinivasan, P., & Menczer, F. (2004). Crawling the web. In Web Dynamics
(pp. 153-177). Springer Berlin Heidelberg.
[3] Heydon, A., & Najork, M. (1999). Mercator: A scalable, extensible web
crawler.World Wide Web, 2(4), 219-229.
[4] Boldi, P., Codenotti, B., Santini, M., & Vigna, S. (2004). Ubicrawler: A scalable fully
distributed web crawler. Software: Practice and Experience, 34(8), 711-726.
[5] Lee, H. T., Leonard, D., Wang, X., & Loguinov, D. (2009). IRLbot: scaling to 6
billion pages and beyond. ACM Transactions on the Web (TWEB), 3(3), 8.
[6] Harth, A., Umbrich, J., & Decker, S. (2006). Multicrawler: A pipelined architecture
for crawling and indexing semantic web data. In The Semantic Web-ISWC 2006 (pp.
258-271). Springer Berlin Heidelberg.
[7] Seeger, M. (2010). Building blocks of a scalable web crawler. Master's thesis,
Stuttgart Media University.

33.3k Views · View Upvotes

 I want to build a web crawler. Where do I start?

 I want to build a web search engine using Python. Where can I find more details on building
an efficient crawler? I'm also looking for approac...
 What is a web crawler in C#?
 How do I build a web crawler in R?
 Kindly recommend a BOOK for building the web crawler from scratch?

Jake Cook, Co-founder of Gainlo (gainlo.co), Engineer, Entrepreneur

Written Jun 30, 2016
Building a simple web crawler can be easy since in essence, you are just issuing HTTP
request to website and parse the response. However, when you try to scale the system,
there're tons of problems.

Language and framework do matter a lot. The rule of thumb is not reinventing the
wheel. If there are existing tools that can ease your work, just grab it. If I were to build
a web crawler from scratch, I would choose Python. Some advantages include:

 Language is simple to write. The syntax is concise and it's one of the best
languages that allow you to build things fast.
 There are tons of libraries/frameworks that help you build a web crawler. For
instance, Scrapy is a very fast and powerful framework for crawling.
 Just Google “python web crawler”, you're gonna get hundreds or thousands of
results. You don't need to build everything “from scratch” since so many
existing tools/codes can save you tons of time.
It's also worth to note that one disadvantage of Python is the performance. Compared
to other languages like C++, Python is relatively slower.

In addition, you'd better have a clear understanding of how web crawler works in
essence and what kind of problems you need to consider. I'd recommend you read the
article - Build a Web Crawler that has a very in-depth analysis/introduction of this
topic.
In a nutshell, to crawler a single web page, all we need is to issue a HTTP GET request
to the corresponding URL and parse the response data, which is the core of a crawler.
Start with a URL pool, we can keep adding URLs from sites we crawled and the
process continues. Couple of things you should figure out:

 Crawling frequency - For some small websites, it’s very likely that their
servers cannot handle very frequent request. One approach is to follow the
robot.txt of each site.
 Dedup - In a single machine, you can keep the URL pool in memory and
remove duplicate entries. However, things become more complicated in a
distributed system. Basically multiple crawlers may extract the same URL
from different web pages and they all want to add this URL to the URL pool.
 Parsing - The challenge is that you will always find strange markups, URLs
etc. in the HTML code and it’s hard to cover all corner cases.
19.8k Views · View Upvotes

Ayyathurai N Naveen, Lead Software Developer @ SearchBlox

Written May 3, 2012
I would say that in which language you are most comfortable you can go on with
that.And you are trying to reinventing a wheel.

My recommendations are:

C - If you need your crawler should be extremely fast and less memory
consumption.wget is written in C.

JAVA - More memory consuming than C version .Takes time to develop.

Most of the opensource crawlers like ,Nutch ,
OpenSearchServer,crawler4j,etc are written in java and
opensource you can have a look at the source code to get an idea.

Python/Perl - Easy to develop has a lot of libraries. But i don't have a lot
of experience on these languages.

Wikipedia has a great article about the web-crawler please have a look.
http://en.wikipedia.org/wiki/Web...

8.6k Views · View Upvotes

Jake Kovoor
Written Mar 24
I was actually trying to find a way to do this a few days ago,

and I came up on a great, quick, and simple way to create a web crawler using Python
code.

ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ

Here's the guide for you :-)

How to Make a Web Crawler in Under 50 Lines of Code (Python) - Saint

ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ

I have tried the following code a few days ago on my Python 3.6.1 (which is the latest
as of 21st March 2017) and it should work for you too.

All you have to do is just copy+paste the code into your Python IDE, and it should work
for you.

Here's the code.

ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ

1. from html.parser import HTMLParser

2. from urllib.request import urlopen
3. from urllib import parse
4.
5. # We are going to create a class called LinkParser that inherits some
6. # methods from HTMLParser which is why it is passed into the definition
7. class LinkParser(HTMLParser):
8.
9. # This is a function that HTMLParser normally has
10. # but we are adding some functionality to it
11. def handle_starttag(self, tag, attrs):
12. # We are looking for the begining of a link. Links normally look
13. # like <a href="http://www.someurl.com"></a>
14. if tag == 'a':
15. for (key, value) in attrs:
16. if key == 'href':
17. # We are grabbing the new URL. We are also adding
the
18. # base URL to it. For example:
19. # Home - Saint is the base and
20. # somepage.html is the new URL (a relative URL)
21. #
22. # We combine a relative URL with the base URL to
create
23. # an absolute URL like:
24. # http://www.saintlad.com/somepage.html
25. newUrl = parse.urljoin(self.baseUrl, value)
26. # And add it to our colection of links:
27. self.links = self.links + [newUrl]
28.
29. # This is a new function that we are creating to get links
30. # that our spider() function will call
31. def getLinks(self, url):
32. self.links = []
33. # Remember the base URL which will be important when creating
34. # absolute URLs
35. self.baseUrl = url
36. # Use the urlopen function from the standard Python 3 library
37. response = urlopen(url)
38. # Make sure that we are looking at HTML and not other things
that
39. # are floating around on the internet (such as
40. # JavaScript files, CSS, or .PDFs for example)
41. if response.getheader('Content-Type')=='text/html':
42. htmlBytes = response.read()
43. # Note that feed() handles Strings well, but not bytes
44. # (A change from Python 2.x to Python 3.x)
45. htmlString = htmlBytes.decode("utf-8")
46. self.feed(htmlString)
47. return htmlString, self.links
48. else:
49. return "",[]
50.
51. # And finally here is our spider. It takes in an URL, a word to find,
52. # and the number of pages to search through before giving up
53. def spider(url, word, maxPages):
54. pagesToVisit = [url]
55. numberVisited = 0
56. foundWord = False
57. # The main loop. Create a LinkParser and get all the links on the
page.
58. # Also search the page for the word or string
59. # In our getLinks function we return the web page
60. # (this is useful for searching for the word)
61. # and we return a set of links from that web page
62. # (this is useful for where to go next)
63. while numberVisited < maxPages and pagesToVisit != [] and not
foundWord:
64. numberVisited = numberVisited +1
65. # Start from the beginning of our collection of pages to visit:
66. url = pagesToVisit[0]
67. pagesToVisit = pagesToVisit[1:]
68. try:
69. print(numberVisited, "Visiting:", url)
70. parser = LinkParser()
71. data, links = parser.getLinks(url)
72. if data.find(word)>-1:
73. foundWord = True
74. # Add the pages that we visited to the end of our
collection
75. # of pages to visit:
76. pagesToVisit = pagesToVisit + links
77. print(" **Success!**")
78. except:
79. print(" **Failed!**")
80. if foundWord:
81. print("The word", word, "was found at", url)
82. else:
83. print("Word never found")
ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ

If you are having any problems, or if you found any errors, just drop me a message and
let's work it out together. :-)
732 Views · View Upvotes

Abhinav M Kulkarni, Data Scientist at Trulia

Updated Nov 12, 2013
While you can go through the Udacity class material pointed to by Rajwinder Singh, do
have a look at Apache Nutch open source library - it provides all the functionalities of
a typical web crawler and you can use Apache Solr/Lucene to build an Internet
search engine using it.

To learn the theory behind how a web crawler is built and functions you can refer to an
excellent and ubiquitously used textbook 'Introduction to Information Retrieval' by
Stanford professors. Specifically chapter 20 titled 'Web Crawling and Indices' talks
about the architecture and functioning of a web crawler.

7.3k Views · View Upvotes

Rajwinder Singh, Engineer. (raj.win.der)

Written May 4, 2012
Here's a starting point: Udacity CS101 online class where the students build a search
engine from scratch, including the crawler, using Python.

http://www.udacity.com/overview/...

BTW the original Google search engine was built with Python -- the crawler was
written by Scott Hassan.

15.7k Views · View Upvotes

Chris Heller, Web crawlers were once a passion.

Written May 7, 2014
Here is the basics of how you bootstrap your web crawler, from scratch.

Your web crawler is really just a function. It's input is a set of seed URLs (or entry
points), and it's output is a set of HTML pages (or results).

Inside this web crawler function is something called the "horizon" which is just a list
containing all unvisited URLs that your crawler intends to visit.

The initial value of the horizon is your set of entry points.

The computation performed by this function is as follows:

1. Continuously remove a URL from the horizon

2. Issue an HTTP GET on the URL
3. Download the contents, add the contents to the results
4. Parse the contents for new URLs and append any unvisited URLs to the
horizon.

Your crawl is complete when the horizon is empty.

I think that is a sufficient backbone to build off of. Certainly any production crawler
will have much more that needs to be engineered.

Missing features of the above crawler include, but are not limited to:

1. Robots.txt handling
2. Crawl depth limiting
3. Request rate limiting
4. Incremental crawling
5. Non HTML media type support
6. Page deduplication (URL canonicalization)
7. And many more!
28.6k Views · View Upvotes

Preetish Panda, Product, Marketing and Technology

Written Dec 20
Here are the minimum configurations required for web crawler:

1. HTTP Fetcher: to extract the webpages from the target site servers

2. Dedup: to avoid duplicate data

3. Extractor: URL retrieval system from external links

4. URL Queue Manager: this creates a queue of URLs according to the priority
5. Database: to store extracted data for further analysis and application in the business

While crawling large scale websites, you need to factor in the following:

1. I/O mechanism

2. Multi-threading architecture

3. Crawl depth setting

4. DNS resolving

5. Robots.txt management

6. Request rate management

7. Support for non-HTML media

8. De-duplication

9. Canonicalization of URL for unique parsing

10. Distributed crawling mechanism

11. Server communication

Apart from that you need to ensure that the choice of programming language is correct
so that you can extract maximum utility from the web scraper. Many prefer Python and
Perl to do most of the heavy lifting in the scraping exercise.

Building a simple crawler

Here are the key steps that would be taken by the crawler:

1. Begin with a list of websites to be crawled

2. For each of the URL in the list, the crawler will issue a ‘HTTP Get Request’ and
retrieve the web page content

3. Parse the HTML content of the page and retrieve the probable URLs the crawler
needs to crawl

4. Update the list of websites with new URLs and continue crawling with the program

A successful crawler must consider the server load it will place on the URL it requests.
The crawling frequency of your program needs to be set to one or two times a day
(reasonable frequency that will not cause server crash).
1.5k Views · View Upvotes
Peter Jaap, ecommerce | webdeveloper | entrepreneur | Magento evangelist |
Drupalist
Written Aug 7, 2012
I've always built my spiders (in PHP) with either Simple HTML DOM Parser
(http://simplehtmldom.sourceforge...) or Snoopy (http://sourceforge.net/projects/...).
While the first is easier to use when traversing large DOM structures (especially if you
are used to the jQuery CSS selectors syntax), Snoopy is a bit more abstract in use.

6.5k Views · View Upvotes

Jesus Vassa
Written Jul 1, 2015
Building a web crawler from scratch is not an easy task at all. It will require immense
time and dedication. Also, you might get stuck up half-way through due to its
complexity. So, I would suggest you to hire experts who will be able to guide you while
you build your web crawler.

You could consider getting in touch with the expert team from this website. They are
the world’s most sophisticated ecommerce platform that provides A-Z of ecommerce
solutions.

Their Easy Data Feed tool is a data extraction software, which is designed to quickly
and easily download inventory, pricing, and product information. These data can be
downloaded into a usable spreadsheet from your drop ship supplier’s online portal
without relying on the drop shipper. Also, they can extract inventory data from any
website, password protected website, API, email or FTP and plug it right into your
store.

13.7k Views · View Upvotes

Adrian Balcan, I have 3 years experience in web crawling

Written Jan 30, 2016
It's difficult to build a web crawler, we have succeeded after few months of work and
we already had 3 years of experience in custom crawling/scraping services.

Now we have a polite web crawler made in python and we call it TWMBot. You can see
more details on our website: http://thewebminer.com

Please tell us if we can help you with any crawling task.

3.5k Views · View Upvotes · Answer requested by Younes ben Amara

 How can I make a web crawler using PHP?

 How do I design a web crawler for this?
 How do I make GUI based web crawler?
 Why did Google move from Python to C++ for use in its crawler?
 What does a seed page mean in a web crawler?
 What is a web crawler and how does it work?
 What language is Google's crawler written in?
 How is a web crawler used in retailers?
 Can we make a web crawler using COBOL?
 Who should I hire to build a web crawler?
 What is the best PHP based web crawler?
 How does the web crawler work in Python?
 What's the best way to learn to make web crawlers using Python?
 What can I do with web crawler?
 Is there a way to build a ubiquitously usable web crawler?
Related Questions

 I want to build a web crawler. Where do I start?

 I want to build a web search engine using Python. Where can I find more details on building an efficient
crawler? I'm also looking for approac...