A BitTorrent Client in Python 3.5 Markus Eliasson
A BitTorrent Client in Python 3.5 Markus Eliasson
When Python 3.5 was released together with the new module asyncio I was curios to give it a try. Recently I decided to implement
a simple BitTorrent client using asyncio - I have always been interested in peer-to-peer protocols and it seemed like a perfect fit.
The project is named Pieces, all of the source code is available at GitHub and released under the Apache 2 license. Feel free to
learn from it, steal from it, improve it, laugh at it or just ignore it.
I previously posted a short introduction to Python’s async module. If this is your first time looking at asyncio it might be a good
idea to read through that one first.
An introduction to BitTorrent
BitTorrent has been around since 2001 when Bram Cohen authored the first version of the protocol. The big breakthrough was
when sites as The Pirate Bay made it popular to use for downloading pirated material. Streaming sites, such as Netflix, might have
resulted in a decrease of people using BitTorrent for downloading movies. But BitTorrent is still used in a number of different,
legal, solutions where distribution of larger files are important.
BitTorrent is a peer-to-peer protocol, where peers join a swarm of other peers to exchange pieces of data between each other. Each
peer is connected to multiple peers at the same time, and thus downloading or uploading to multiple peers at the same time. This
is great in terms of limiting bandwidth compared to when a file is downloaded from a central server. It is also great for keeping a
file available as it does not rely on a single source being online.
There is a .torrent file that regulates how many pieces there is for a given file(s), how these should be exchanged between peers,
as well as how the data integrity of these pieces can be confirmed by clients.
While going through the implementation it might be good to have read, or to have another tab open with the Unofficial
BitTorrent Specification. This is without a doubt the best source of information on the BitTorrent protocol. The official
specification is vague and lacks certain details so the unofficial is the one you want to study.
Things like:
Bencoding supports four different data types, dictionaries, lists, integers and strings - it is fairly easy translate to Python’s object
literals or JSON.
Below is bencoding described in Augmented Backus-Naur Form courtesy of the Haskell library.
In pieces the encoding and decoding of bencoded data is implemented in the pieces.bencoding module (source code).
Here are a few examples decoding bencoded data into a Python representation using that module.
Likewise, a Python object structure can be encoded into a bencoded byte string using the same module.
>>> Encoder(123).encode()
b'i123e'
The parser implementation is pretty straight forward, no asyncio is used here though, not even reading the .torrent from disk.
Using the parser from pieces.bencoding , let’s open the .torrent for the popular Linux distribution Ubuntu:
Here you can read see some of the meta-data such as the name of the destination file (ubuntu-16.04-desktop-amd64.iso) and the
total size in bytes (1485881344).
Notice how the keys used in the OrderedDict are binary strings. Bencoding is a binary protocol, and using UTF-8 strings as keys
will not work!
A wrapper class pieces.torrent.Torrent exposing these properties is implemented abstracting the binary strings, and other
details away from the rest of the client. This class only implements the attributes used in pieces client.
I will not go through which attributes that is available, instead the rest of this article will refer back to attributes found in the
.torrent / meta-info were used.
Parameter Description
info_hash The SHA1 hash of the info dict found in the .torrent
peer_id A unique ID generated for this client
uploaded The total number of bytes uploaded
downloaded The total number of bytes downloaded
left The number of bytes left to download for this client
port The TCP port this client listens on
compact Whether or not the client accepts a compacted list of peers or not
The peer_id needs to be exactly 20 bytes, and there are two major conventions used on how to generate this ID. Pieces follows the
Azureus-style convention generating peer id like:
d8:completei3651e10:incompletei385e8:intervali1800e5:peers300:
£¬%ËÌyOkÝ.ê@_<K+Ô\Ý Ámb^TnÈÕ^AËO*ÈÕ1*ÈÕ>¥³ÈÕBä)ðþ¸ÐÞ¦Ô/ãÈÕÈuÉæÈÕ
...
The response data is truncated since it contains binary data that screws up the Markdown formatting.
- interval - The interval in seconds until the client should make a new announce call to the tracker.
- peers - The list of peers is a binary string with a length of multiple of 6 bytes. Where each peer consist of a 4 byte IP address
and a 2 byte port number (since we are using the compact format).
So, a successful announce call made to the tracker, gives you a list of peers to connect to. This might not be all available peers in
this swarm, only the peers the tracker assigned your client to connect. A subsequent call to the tracker might result in another list
of peers.
Async HTTP
Python does not come with a built-in support for async HTTP and my beloved requests library does not implement asyncio
either. Scouting around the Internet it looks like most use aiohttp, which implement both a HTTP client and server.
Pieces use aiohttp in the pieces.tracker.Tracker class for making the HTTP request to the tracker announce url. A shortened
version of that code is this:
The method is declared using async and uses the new asynchronous context manager async with to allow being suspended
while the HTTP call is being made. Given a successful response, this method will be suspended again while reading the binary
response data await response.read() . Finally the response data is wrapped in a TrackerResponse instance containing the list of
peers, alternative an error message.
The result of using aiohttp is that our event loop is free to schedule other work while we have an outstanding request to the
tracker.
The loop
Everything up to this point could really have been made synchronously, but now that we are about to connect to multiple peers
we need to go asynchronous.
The main function in pieces.cli is responsible for setting up the asyncio event loop. If we get rid of some argparse and error
handling details it would look something like this (see cli.py for the full details).
import asyncio
loop = asyncio.get_event_loop()
client = TorrentClient(Torrent(args.torrent))
task = loop.create_task(client.start())
try:
loop.run_until_complete(task)
except CancelledError:
logging.warning('Event loop was canceled')
We start off by getting the default event loop for this thread. Then we construct the TorrentClient with the given Torrent (meta-
info). This will parse the .torrent file and validate everything is ok.
Calling the async method client.start() and wrapping that in a asyncio.Future and later adding that future and instructing the
event loop to keep running until that task is complete.
Is that it? No, not really - we have our own loop (not event loop) implemented in the pieces.client.TorrentClient that sets up the
peer connections, schedules the announce call, etc.
TorrentClient is something like a work coordinator, it starts by creating a async.Queue which will hold the list of available peers
that can be connected to.
Then it constructs N number of pieces.protocol.PeerConnection which will consume peers from off the queue. These
PeerConnection instances will wait ( await ) until there is a peer available in the Queue for one of them to connect to (not blocking).
Since the queue is empty to begin with, no PeerConnection will do any real work until we populate it with peers it can connect to.
This is done in a loop inside of TorrentClient.start .
while True:
if self.piece_manager.complete:
break
if self.abort:
break
current = time.time()
if (not previous) or (previous + interval < current):
response = await self.tracker.connect(
first=previous if previous else False,
uploaded=self.piece_manager.bytes_uploaded,
downloaded=self.piece_manager.bytes_downloaded)
if response:
previous = current
interval = response.interval
self._empty_queue()
for peer in response.peers:
self.available_peers.put_nowait(peer)
else:
await asyncio.sleep(5)
self.stop()
So, each time an announce call is made to the tracker, the list of peers to connect to is reset, and if no peers are retrieved, no
PeerConnection will run. This goes on until the download is complete or aborted.
First, lets go through the different parts of the peer protocol, and then go through how it is all implemented.
Handshake
The first message sent needs to be a Handshake message, and it is the connecting client that is responsible for initiating this.
Immediately after sending the Handshake, our client should receive a Handshake message sent from the remote peer.
If the info_hash does not match the torrent we are about to download, we close the connection.
Immediately after the Handshake, the remote peer may send a BitField message. The BitField message serves to inform the
client on which pieces the remote peer have. Pieces support receiving a BitField message, and most BitTorrent clients seems to
send it - but since pieces currently does not support seeding, it is never sent, only received.
The BitField message payload contains a sequence of bytes that when read binary each bit will represent one piece. If the bit is 1
that means that the peer have the piece with that index, while 0 means that the peer lacks that piece. I.e. Each byte in the payload
represent up to 8 pieces with any spare bits set to 0 .
Each client starts in the state choked and not interested. That means that the client is not allowed to request pieces from the remote
peer, nor do we have intent of being interested.
- Choked A choked peer is not allowed to request any pieces from the other peer.
- Unchoked A unchoked peer is allowed to request pieces from the other peer.
- Interested Indicates that a peer is interested in requesting pieces.
- Not interested Indicates that the peer is not interested in requesting pieces.
Consider Choked and Unchoked to be rules and Interested and Not Interested to be intents between two peers.
After the handshake we send an Interested message to the remote peer, telling that we would like to get unchoked in order to
start requesting pieces.
Until the client receives an Unchoke message - it may not request a piece from its remote peer - the PeerConnection will be choked
(passive) until either unchoked or disconnected.
The following sequence of messages is what we are aiming for when setting up a PeerConnection :
Handshake
client --------------> peer We are initiating the handshake
Handshake
client <-------------- peer Comparing the info_hash with our hash
BitField
client <-------------- peer Might be receiving the BitField
Interested
client --------------> peer Let peer know we want to download
Unchoke
client <-------------- peer Peer allows us to start requesting pieces
Requesting pieces
As soon as the client gets into a unchoked state it will start requesting pieces from the connected peer. The details surrounding
which piece to request is detailed later, in Managing the pieces.
If we know that the other peer have a given piece, we can send a Request message asking the remote peer to send us data for the
specified piece. If the peer complies it will send us a corresponding Piece message where the message payload is the raw data.
This client will only ever have one outstanding Request per peer and politely wait for a Piece message until taking the next
action. Since connections to multiple peers are open concurrently, the client will have multiple Requests outstanding but only one
per connection.
If, for some reason, the client do not want a piece anymore, it can send a Cancel message to the remote peer to cancel any
previously sent Request .
Other messages
Have
The remote peer can at any point in time send us a Have message. This is done when the remote peer have received a piece and
makes that piece available for its connected peers to download.
KeepAlive
The KeepAlive message can be sent at anytime in either direction. The message does not hold any payload.
Implementation
The PeerConnection opens a TCP connection to a remote peer using asyncio.open_connection to asynchronously open a TCP
connection that returns a tuple of StreamReader and a StreamWriter . Given that the connection was created successfully, the
PeerConnection will send and receive a Handshake message.
Once a handshake is made, the PeerConnection will use an asynchronous iterator to return a stream of PeerMessages and take the
appropriate action.
Using an async iterator separates the PeerConnection from the details on how to read from sockets and how to parse the
BitTorrent binary protocol. The PeerConnection can focus on the semantics regarding the protocol - such as managing the peer
state, receiving the pieces, closing the connection.
An asynchronous iterator is a class that implements the methods __aiter__ and __anext__ which is just async versions of
Python’s standard iterators that have implements the methods, __iter__ and next .
Upon iterating (calling next) the PeerStreamIterator will read data from the StreamReader and if enough data is available try to
parse and return a valid PeerMessage .
The BitTorrent protocol uses messages with variable length, where all messages takes the form:
<length><id><payload>
So as soon as the buffer have enough data for the next message it will be parsed and returned from the iterator.
All messages are decoded using Python’s module struct which contains functions to convert to and from Pythons values and C
structs. Struct use compact strings as descriptors on what to convert, e.g. >Ib reads as “Big-Endian, 4 byte unsigned integer, 1 byte
character.
This makes it easy to create unit tests to encode and decode messages. Let’s have a look on the tests for the Have message:
class HaveMessageTests(unittest.TestCase):
def test_can_construct_have(self):
have = Have(33)
self.assertEqual(
have.encode(),
b"\x00\x00\x00\x05\x04\x00\x00\x00!")
def test_can_parse_have(self):
have = Have.decode(b"\x00\x00\x00\x05\x04\x00\x00\x00!")
self.assertEqual(33, have.index)
From the raw binary string we can tell that the Have message have a length of 5 bytes \x00\x00\x00\x05 an id of value 4 \x04 and
the payload is 33 \x00\x00\x00! .
Since the message length is 5 and ID only use a single byte we know that we have four bytes to interpret as the payload value.
Using struct.unpack we can easily convert it to a python integer like:
That is basically it regarding the protocol, all messages follow the same procedure and the iterator keeps reading from the socket
until it gets disconnected. See the source code for details on all messages.
So far we have only discussed pieces - pieces of data being exchanged by two peers. It turns out that pieces is not the entire truth,
there is one more concept - blocks. If you have looked through any of the source code you might have seen code refering to blocks,
so lets go through what a piece really is.
A piece is, unsurprisingly, a partial piece of the torrents data. A torrent’s data is split into N number of pieces of equal size (except
the last piece in a torrent, which might be of smaller size than the others). The piece length is specified in the .torrent file.
Typically pieces are of sizes 512 kB or less, and should be a power of 2.
Pieces are still too big to be shared efficiently between peers, so pieces are further divided into something referred to as blocks.
Blocks is the chunks of data that is actually requested between peers, but pieces are still used to indicate which peer that have
which pieces. If only blocks should have been used it would increase the overhead in the protocol greatly (resulting in longer
BitFields, more Have message and larger .torrent files).
A block is 2^14 (16384) bytes in size, except the final block that most likely will be of a smaller size.
piece 0:
block 0: 16 384 bytes (2^14)
block 1: 16 384 bytes
block 2: 16 384 bytes
= 49 152 bytes
piece 1:
block 0: 16 384 bytes
block 1: 16 384 bytes
block 2: 16 384 bytes
= 49 152 bytes
piece 2:
block 0: 16 384 bytes
block 1: 16 384 bytes
block 2: 4 096 bytes
= 36 864 bytes
Exchanging these blocks between peers is basically what BitTorrent is about. Once all blocks for a piece is done, that piece is
complete and can be shared with other peers (the Have message is sent to connected peers). And once all pieces are complete the
peer transform from a downloader to only be a seeder.
1. The official specification refer to both pieces and blocks as just pieces which is quite confusing. The unofficial specification and
others seem to have agreed upon using the term block for the smaller piece which is what we will use as well.
2. The official specification is stating another block size that what we use. Reading the unofficial specification, it seems that 2^14
bytes is what is agreed among implementers - regardless of the official specification.
The implementation
When a PeerConnection successfully handshakes with another peer and receives a BitField message it will inform the
PieceManager which peer ( peer_id ) that have which pieces. This information will be updated on any received Have message as
well. Using this information, the PeerManager knows the collective state on which pieces that are available from which peers.
When the first PeerConnection goes into a unchoked state it will request the next block from its peer. The next block is determined
by calling the method PieceManager.next_request .
The next_request implements a very simple strategy on which piece to request next.
1. When the PieceManager is constructed all pieces and blocks are pre-constructed based on the piece length from the .torrent
meta-info
2. All pieces are put in a missing list
3. When next_request is called, the manager will do one of:
- Re-request any previously requested block that has timed-out
- Requst the next block in an ongoing piece
- Request the first block in the next missing piece
This way the blocks and pieces will be requsted in order. However, multiple pieces might be ongoing based on which piece a client
have.
Since pieces aims to be a simple client, no effort have been made on implementing a smart or efficient strategy for which pieces to
request. A better solution would be to request the rarest piece first, which would make the entire swarm healthier as well.
Whenever a block is received from a peer, it is stored (in memory) by the PieceManager. When all blocks for a piece is retrieved, a
SHA1 hash is made on the piece. This hash is compared to the SHA1 hashes include in the .torrent info dict - if it matches the
piece is written to disk.
When all pieces are accounted for (matching hashes) the torrent is considered to be complete, which stops the TorrentClient
closing any open TCP connection and as a result the program exits with a message that the torrent is downloaded.
Future work
Seeding is not yet implemented, but it should not be that hard to implement. What is needed is something along the lines of this:
Whenever a peer is connected to, we should send a BitField message to the remote peer indicating which pieces we have.
Whenever a new piece is received (and correctness of hash is confirmed), each PeerConnection should send a Have message
to its remote peer to indicate the new piece that can be shared.
In order to do this the PieceManager needs to be extended to return a list of 0 and 1 for the pieces we have. And the TorrentClient
to tell the PeerConnection to send a Have to its remote peer. Both BitField and Have messages should support encoding of these
messages.
Having seeding implemented would make Pieces a good citizen, supporting both downloading and uploading of data within the
swarm.
Additional features that probably can be added without too much effort is:
Multi-file torrent, will hit PieceManager , since Pieces and Blocks might span over multiple files, it affects how files are
persisted (i.e. a single block might contain data for more than one file).
Resume a download, by seeing what parts of the file(s) are already downloaded (verified by making SHA1 hashes).
Summary
It was real fun to implement a BitTorrent client, having to handle binary protocols and networking was great to balance all that
recent web development I have been doing.
Python continues to be one of my favourite programming language. Handling binary data was a breeze given the struct module
and the recent addition asyncio feels very pythonic. Using async iterator to implement the protocol turned out to be a good fit as
well.
Hopefully this article inspired you to write a BitTorrent client of your own, or to extend pieces in some way. If you spot any error
in the article or the source code, feel free to open an issue over at GitHub.
Python BitTorrent
A thorough technical lead with a passion for producing valuable and clean code. Tends to occasionally blog about building software and can't seem to make up
his mind on which programming language to use next.
Consultant at
Markus Eliasson.