Attaching Semantic Metadata To Cryptocurrency Transactions
Attaching Semantic Metadata To Cryptocurrency Transactions
Attaching Semantic Metadata To Cryptocurrency Transactions
Cryptocurrency Transactions
1 Introduction
Address 1xf
Address 3rx
Combinations
Bitcoin Network
Address 2xf Transaction fee
[2] summarizes it as follows: small payloads with high value need to be publicly
broadcasted and permanently recorded in an asynchronous and pay-as-you-go
way.
Bitcoin adopters started to devise “creative ways” of realizing these use cases
by making use of the elements at their disposal, that is, encoding data into
Input Output
Address 1xf
Address 3rx
Combinations
Bitcoin Network
Address 2xf Transaction fee
Data Output
addresses, values and lock scripts (see [3] for a compilation of techniques). After
a long internal debate, the Bitcoin community reached consensus on letting users
attach data to a transaction through a special kind of output in the transaction,
commonly known as OP RET U RN . OP RET U RN is a single instruction of
the Script language that implements a lock script that always returns False. As
such, any value container locked with OP RET U RN cannot be used as input
in any other transaction. In our formalisation, we define this as a special output.
Figure 2 extends Figure 1 to include a data output. Note that the data output
lock script only prevents its reuse as input to another transaction, anyone can
read the data.
The Bitcoin protocol limits by default the number of data outputs to one
per transaction, however, miners may choose to ignore transactions including
data outputs or accept to validate transactions with more than one data output
(up to the maximum limit of 11 outputs). For the rest of the paper, we assume
that all miners implement the default behaviour. A recent study[1] analyses the
metadata embedded in the blockchain using the OP RET U RN code, finding
that 1% of transactions make use of it, representing 0.3% of the size of the
Ledger.
In general, there are two strategies to use the small amount of data allowed
per transaction. The first, that we call max-compression, is to compress data
as much as possible, ideally to fit in the transaction being described. If this is
not possible, split the compressed data in several chunks and create transactions
(for example from the sender to itself, to minimise the currency cost) that carry
the rest of the chunks. The sender keeps references to the set of transactions
carrying data chunks, to be able to reconstruct the data from the payloads on
different transactions, sharing them when necessary.
The second strategy, that we call hash-out, uses a single data output to
store a hash of the data. The hash is then used both as a key to retrieve the
document from a external store, and to verify that data has not been altered
by the manager of the external store. Several companies that offer services on
top of Bitcoin use either strategy to link transactions to several objects, random
short string messages (Eternity Wall7 ), pdf documents (CoinSpark8 ), or other
abstract digital assets that use Bitcoin as an underlying transaction layer (Open
Assets 9 , CounterParty10 )
Table 1 compares both strategies. Those who choose to maximize compression
are sure that data outputs have the same guarantees that the transactions they
are attached to, but are either limited to 83 bytes or forced to pay a transaction
fee for every extra chunk required. Those who choose to hash-out have space
bounded by the limit of the external storage (for sure more than 83 bytes), but
only the hash has the same guarantees than the transaction. If the external
storage fails and data gets corrupted, there is no way to recover it, one can only
use the hash to certify the corruption.
Max-Compress Hash-Out
Time to confirm all transactions Time to confirm transaction
Write-Latency
carrying a chunk of data being described
Sum of fees of all transactions Fee of transaction
Write-Cost
carrying a chunk of data being described
Integrity Same as transaction Same as transaction
Same as transaction for hash
Persistence Same as transaction
Same as external storage for data
Several compression algorithms have been developed for RDF data, but none of
them for the Cryptocurrency transaction scenario. RDF compression algorithms
can be divided in two categories according to their purpose: IoT devices and
Publication and Exchange. Compression for IoT devices is motivated by the se-
mantization to the Internet of Things, to make devices compatible with the Web
of Data. As IoT devices have limited computing, communication, memory and
energy resources, a whole corpus of research has been devoted to the most ap-
propriate data formats, processing algorithms and protocols to manage Semantic
data for IoT (see [12] for an overview).
In the Publication and Exchange scenario, semantic sata publishers are look-
ing to optimize the way on which they archive, store and serve Semantic Data.
As opposed to the IoT device scenario, publishers have much more computing,
communication, memory and energy resources available, but need to serve large
amounts of data to a large amount of clients. The focus in this scenario is the
binary representation of Big Semantic Data, the development of streamable for-
mats that improve data transference between publishers and consumers, and the
provision of query interfaces on top of these compressed formats.
To provide insight on what should be the most appropriate compression al-
gorithm for Bitcoin transaction, we compare side by side the scenarios of IoT
devices, Publish and Exchange and Bitcoin transactions across the key parame-
ters of both use cases
5 Experimental Study
Listing 1.1. The complete RDF/XML document used, referred to as ”Full Document”
<r d f :RDF
xmlns : r d f =”h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#”
xmlns : prov=”h t t p : / /www. w3 . o r g / ns / prov#”>
<prov : A c t i v i t y r d f : about=”h t t p : / / b i t c o i n . o r g /56754644” >
<prov : wasDerivedFrom>
<prov : E n t i t y r d f : about=”h t t p : / / b i t c o i n . o r g /41565751”/ >
</prov : wasDerivedFrom>
<prov : wasStartedBy>
<prov : Agent r d f : about=”h t t p : / / example . com/bob−from−f i n a n c e ”/>
</prov : wasStartedBy>
<prov : wasAssociatedWith>
<prov : Agent r d f : about=”h t t p : / / example . com/ a l i c e −from−f i n a n c e ”/>
</prov : wasAssociatedWith>
<prov : wasInformedBy>
<prov : A c t i v i t y r d f : about=”h t t p : / / example . com/ procurement−t i c k e t ”/>
</prov : wasInformedBy>
<r d f : type r d f : r e s o u r c e =”h t t p : / /www. w3 . o r g / ns / prov#E n t i t y ”/>
</prov : A c t i v i t y >
</ r d f :RDF>
Listing 1.2. Stripped down document with one piece of information, and a seeAlso
reference. Referred to as ‘Small Document’
<r d f :RDF
xmlns : r d f =”h t t p : / /www. w3 . o r g /1999/02/22 − r d f −syntax−ns#”
xmlns : r d f s =”h t t p : / /www. w3 . o r g /2000/01/ r d f −schema#”
xmlns : prov=”h t t p : / /www. w3 . o r g / ns / prov#”
xml : b a s e=”h t t p : / / example . com/”>
<prov : A c t i v i t y r d f : about=”h t t p : / / b i t c o i n . o r g /56754644” >
<prov : wasInformedBy>
<prov : A c t i v i t y r d f : about=”/procurement−t i c k e t ”/>
</prov : wasInformedBy>
<r d f s : s e e A l s o r d f : r e s o u r c e =”h t t p : / / b i t c o i n . o r g /41565751”/ >
</prov : A c t i v i t y >
</ r d f :RDF>
HDT [5] is a binary representation designed for the Publish/Exchange use case.
HDT decomposes an RDF dataset in three parts: a header that holds metadata
describing an HDT semantic dataset using plain RDF. It acts as an entry point
for the consumer, who can have an initial idea of key properties of the content
even before retrieving the whole dataset. The main motivation behind HDT is
the ability to provide a simple query interface to large datasets that otherwise
wouldn’t fit in memory, as such, we expected to not perform well in this use
case. For the experiment, we used the HDT CPP Docker container with version
1.1.1, commit 421165e.
SHDT [8] is a simplified version of HDT, adapted for working on IoT devices,
that improves the memory and energy footprint of HDT in exchange of lower
compression ratio. As is the case with HDT, this tradeoff is in principle not
appropriate for our scenario. This application is part of the Wiselib library11
which targets embedded devices, and as such difficulties were encountered in
compiling for a PC so this algorithm was not run.
11
https://github.com/ibr-alg/wiselib
RDF4J Binary RDF (Formerly Sesame) 12 was the algorithm for binary encod-
ing in the Sesame library, which is now continued in RDF4J13 . We used the Java
RDF4J library version 2.2.0.
ERI [4] aims at improving over RDSZ by exploiting structural information that
is known before hand between publisher and consumer. ERI and RDSZ have
similar performance, one being better than the other depending on the under-
lying distribution of predicates and entities of the input RDF dataset. We used
the prototype release of the GitHub repository15 ,at commit a99ff03, without ad-
ditional configuration. Both RDSZ and ERI were only tested with large input
datasets, therefore, we did not know what to expect on our setup.
EXI for RDF [9] leverages W3C’s Efficient XML Interchange (EXI) format [11]
to achieve an efficient binary representation of RDF from its XML representa-
tion. EXI uses a grammar-driven approach to represent XML-based data in an
efficient binary form and vice versa. The grammar is derived from given XML
Schema where each defined complex type is represented as a deterministic fi-
nite automaton. [9] explores the use of two type of grammars: a generic one
that allows the encoding of an RDF using several vocabularies, but with limited
compression ratio; and with concrete grammars. The generic grammars follow
EXI by using string tables, to map unknown elements and attributes, as well as
strings to an ID, which is then managed to ensure consistency with the repos-
itory. This store allows the a triple to be represented in a compact form based
on these IDs. Using this format allows a compressed RDF to be queried by the
client, without the client having pre-existing knowledge of the RDF graph[9].
The concrete grammars remove the need for this, since the available elements
are almost entirely known from the schema, and can be defined accordingly, re-
moving the need for storage and processing of strings[9]. EXI with a concrete
12
http://rdf4j.org/
13
http://docs.rdf4j.org/rdf4j-binary/
14
https://bitbucket.org/norbertofdz/rdsz
15
https://github.com/webdata/ERI
16
https://afs.github.io/rdf-thrift/rdf-binary-thrift.html
grammar was reported to compress 20 triples in 78 bytes, orders of magnitude
better than Thrift. 17 . As such, we expected it to be the best performer in our
setup.
We used the Exificient GUI18 obtained on 24 July 2017. The GUI provides
the ability of creating grammars from XML Schemas. We re-used the generic
schema of the uRDF store19 . To generate a concrete grammar, we input to the
GUI the XMLSchema representation of the PROV ontology 20 .
Table 3. The size of the input document in different RDF formats (bytes)
– The use case for the EXI format is focused on sensors, and is heavily biased
in favour of numerical data representing sensor measures
– The concrete grammar was more comprehensively done than we were able to
with the PROV XML schema. Possibly by taking a subset of the ontology,
or including stricter rules we might increase compression efficiency.
In regards to the first item, we have some indication that this is the case,
with some preliminary testing. By changing the rdf:about to numbers, that is,
crafting a document with triples with literal numeric values as objects, the com-
pression ratio improved from 44% of the size of XML to 19% of the XML (down
to 120 bytes). This indicates that in this setup, where every byte counts, the
characteristics of the input document become critical.
The good performance of RDSZ came as a surprise to us. RDSZ was not
compared against EXI due to being unsuitable for embedded devices, owing to
the loss of the RDF triple structure, and the use of the energy-inefficient Zlib[9].
However, as energy is not a factor in our setup, our results suggest that sacrificing
triple structure is beneficial to our scenario. The good result of the naive gzip
seems to reinforce this hypothesis.
In this paper we have studied the problem of attaching RDF metadata to trans-
actions in Cryptocurrency Blockchains, so that metadata and transaction are
stored in the same Blockchain, with special emphasis on the Bitcoin family,
where each transaction can carry by default 83 bytes of data. The problem
is related to the RDF-Compression problems motivated by Publish/Exchange
of Big Semantic Data collections and for Semantic Data management in IoT
devices. We compared the key dimensions considered for the design of those
algorithms with the Bitcoin transaction scenario, uncovering that it is not com-
pletely aligned with any of the other two. In Bitcoin transactions, the critical
aspect is the compression ratio on a small size input, with factors like energy or
provision of a query interface being irrelevant.
Finally, we tested seven state-of-the-art RDF compression algorithms on two
sample documents describing the provenance of a transaction in 10 and 4 triples
respectively, to test how well they generalize to the Bitcoin scenario. The results
are largely negative, as most approaches did not improve over the naive approach
of compressing the NTriples representation with gzip. Only RDSZ and EXI with
concrete grammar were able to improve in the document with 4 triples. Our
results also suggest that, in the current state of the art, only a very limited
class of RDF Documents could be attached to a Bitcoin transaction without
requiring additional transactions. We believe this motivates research for new
RDF compression algorithms specifically tailored to this use case.
As future work, besides the aforementioned development of specifically tai-
lored RDF-Compression algorithms, we consider of interest the study of which
classes of RDF documents can be better compressed by which algorithm, and
the optimal way to split and compress a document to minimize the number of
required extra transactions. An intelligent client would then be able to com-
bine different approaches depending on the structure of the document to be
attached to the transaction. Finally, another interesting direction is repeating
this same analysis for the case of Smart Contract Blockchains like Ethereum,
where additional data registers are available and is possible to create a server-side
query/update interface.
Acknowledgements We thank Javier D. Fernández for making the code
of ERI available on github. We thank Victor Charpenay for kindly answering
inquiries about the EXI format and the EXIficient GUI.
References
1. Bartoletti, M., Pompianu, L.: An analysis of Bitcoin OP return metadata (2017)
2. Coin Sciences Ltd: Metadata in the Blockchain: The
OP return Explosion, https://www.slideshare.net/coinspark/
bitcoin-2-and-opreturns-the-blockchain-as-tcpip
3. Colored Coins Team: Data storage on the blockchain, https://github.
com/Colored-Coins/Colored-Coins-Protocol-Specification/wiki/
Data-Storage-Methods
4. Fernández, J.D., Llaves, A., Corcho, O.: Efficient RDF Interchange (ERI) Format
for RDF Data Streams. In: The Semantic Web – ISWC 2014. pp. 244–259. Lecture
Notes in Computer Science, Springer, Cham (Oct 2014), https://link.springer.
com/chapter/10.1007/978-3-319-11915-1_16
5. Fernández, J.D., Martı́nez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.:
Binary RDF representation for publication and exchange (HDT). Web Semantics:
Science, Services and Agents on the World Wide Web 19, 22–41 (Mar 2013), http:
//www.sciencedirect.com/science/article/pii/S1570826813000036
6. Fernández, N., Arias, J., Sánchez, L., Fuentes-Lorenzo, D., Corcho, O.: RDSZ:
An Approach for Lossless RDF Stream Compression. In: The Semantic
Web: Trends and Challenges. pp. 52–67. Lecture Notes in Computer Science,
Springer, Cham (May 2014), https://link.springer.com/chapter/10.1007/
978-3-319-07443-6_5
7. Garay, J., Kiayias, A., Leonardos, N.: The Bitcoin Backbone Protocol: Analysis
and Applications. In: Advances in Cryptology - EUROCRYPT 2015. pp. 281–
310. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg (Apr 2015),
https://link.springer.com/chapter/10.1007/978-3-662-46803-6_10
8. Hasemann, H., Kröller, A., Pagel, M.: RDF provisioning for the Internet of Things.
In: 2012 3rd IEEE International Conference on the Internet of Things. pp. 143–150
(Oct 2012)
9. Käbisch, S., Peintner, D., Anicic, D.: Standardized and Efficient RDF Encod-
ing for Constrained Embedded Networks. In: The Semantic Web. Latest Ad-
vances and New Domains. pp. 437–452. Lecture Notes in Computer Science,
Springer, Cham (May 2015), https://link.springer.com/chapter/10.1007/
978-3-319-18818-8_27
10. Nakamoto, S.: Bitcoin: A peer-to-peer electronic cash system. Tech. rep. (2008)
11. Schneider, J., Kamiya, T., Peintner, D., Kyusakov, R.: Efficient XML Inter-
change (EXI) Format 1.0. Tech. rep., W3C (2014), https://www.w3.org/TR/2014/
REC-exi-20140211/
12. Su, X., Riekki, J., Nurminen, J.K., Nieminen, J., Koskimies, M.: Adding seman-
tics to internet of things. Concurrency and Computation: Practice and Experience
27(8), 1844–1860 (Jun 2015), http://onlinelibrary.wiley.com/doi/10.1002/
cpe.3203/abstract
13. Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L.,
De Meester, B., Haesendonck, G., Colpaert, P.: Triple Pattern Fragments: A
low-cost knowledge graph interface for the Web. Web Semantics: Science, Ser-
vices and Agents on the World Wide Web 37, 184–206 (Mar 2016), http://www.
sciencedirect.com/science/article/pii/S1570826816000214