0% found this document useful (0 votes)
90 views36 pages

IBM Streams Processing Language Introductory Tutorial

IBMInfoSphereStreams-SPLIntroductoryTutorial

Uploaded by

ramanavg
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views36 pages

IBM Streams Processing Language Introductory Tutorial

IBMInfoSphereStreams-SPLIntroductoryTutorial

Uploaded by

ramanavg
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

IBM InfoSphere Streams Version 2.0.0.

IBM Streams Processing Language Introductory Tutorial

IBM InfoSphere Streams Version 2.0.0.4

IBM Streams Processing Language Introductory Tutorial

Note Before using this information and the product it supports, read the general information under Notices on page 19.

Edition Notice This document contains proprietary information of IBM. It is provided under a license agreement and is protected by copyright law. The information contained in this publication does not include any product warranties, and any statements provided in this manual should not be interpreted as such. You can order IBM publications online or through your local IBM representative. v To order publications online, go to the IBM Publications Center at www.ibm.com/e-business/linkweb/ publications/servlet/pbi.wss v To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at www.ibm.com/ planetwide When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. Copyright IBM Corporation 2011, 2012. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Summary of changes
This topic describes updates to this documentation for IBM InfoSphere Streams Version 2.0 (all releases).

Updates for Version 2.0.0.4 (Version 2.0, Fix Pack 4)


This guide was not updated for Version 2.0.0.4.

Updates for Version 2.0.0.3 (Version 2.0, Fix Pack 3)


This guide was not updated for Version 2.0.0.3.

Updates for Version 2.0.0.2 (Version 2.0, Fix Pack 2)


This guide was not updated for Version 2.0.0.2.

Updates for Version 2.0.0.1 (Version 2.0, Fix Pack 1)


This guide was not updated for Version 2.0.0.1.

Copyright IBM Corp. 2011, 2012

iii

iv

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Abstract
This document is an introductory tutorial to the IBM Streams Processing Language (SPL), the programming language for IBM InfoSphere Streams. If you are new to SPL, and want to learn it, this is a good document to read first.

Copyright IBM Corp. 2011, 2012

vi

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Contents
Summary of changes . . . . . . . . . iii Abstract. . . . . . . . . . . . . . . v Chapter 1. Getting started . . . . . . . 1 Chapter 2. Stream processing . . . . . 3 Chapter 3. Types and functions . . . . 7 Chapter 4. Composite operators . . . . 9 Chapter 5. Primitive operators. . . . . 13 Chapter 6. Next steps . . . . . . . . 17 Notices . . . . . . . . . . . . . . 19 Index . . . . . . . . . . . . . . . 23

Copyright IBM Corp. 2011, 2012

vii

viii

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Chapter 1. Getting started


The best way to learn a new programming language is to write programs in it. But to write programs, you need to know how to compile and run programs, because you will want to frequently test whether what you wrote does what you wanted. Therefore, we will start this tutorial with a very simple program, whose purpose is less to illustrate language features than to try out the compiler. Here is the code:
composite HelloWorld { graph stream<rstring message> Hi = Beacon() { param iterations : 1u; output Hi : message = "Hello, world!"; } () as Sink = Custom(Hi) { logic onTuple Hi : printStringLn(message); } } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10

Let's defer the discussion of how this code works, and instead focus on getting it compiled. You will get the most out of this tutorial if you try out things as you go along. Therefore, the tutorial frequently has instructions for you like the following: make sure you are on a machine that has IBM InfoSphere Streams installed, and thus, has the SPL compiler, sc, available. For more information about installing InfoSphere Streams, see the IBM InfoSphere Streams: Installation and Administration Guide. Create a directory called HelloWorld on that machine. Create a file in that directory called HelloWorld.spl, and enter the above program text in it, then save it. Make sure you are in that directory, and run the compiler, by entering sc -T -M HelloWorld. The -T flag creates a standalone executable, that is, a program that can run as a single process on a single machine, without requiring a running InfoSphere Streams instance. The -M HelloWorld command-line option specifies that the main composite is called HelloWorld. Each SPL program has one main composite operator. A composite operator is an operator that encapsulates a stream graph, and the stream graph of a main composite can be run as a program. If you ran the compiler as recommended, and there were no compiler errors, then it created the executable file ./output/bin/standalone. Run the executable file. It should print Hello, world! to the console. Now we will discuss how the code works. Line 1 declares a composite operator: composite HelloWorld { ... }. Line 2 starts a graph clause, which means that Lines 3-9 describe a stream graph. The graph consists of two operator invocations. Line 3 is the head of the first operator invocation: stream<rstring message> Hi = Beacon() invokes operator Beacon to produce a stream Hi whose tuples have one attribute rstring message. Line 7 is the head of the second operator invocation: () as Sink = Custom(Hi) invokes operator Custom, which reads from stream Hi. The () as Sink part indicates that this operator invocation produces no stream ( ( ) ), and has the name Sink. Here is a visual representation of this stream graph:

Beacon

Hi

Sink

Figure 1. Stream graph of the HelloWorld program

Copyright IBM Corp. 2011, 2012

The operator invocations are shown as circles, and the stream is shown as an arrow. The operator invocations are decorated at the bottom right with little scratch-paper icons that indicate internal state: both Beacon and Sink are stateful in this program. The Beacon operator produces data. In this invocation, Line 4, param iterations : 1u;, tells it to produce just one tuple; the u suffix on the number 1 makes it an unsigned integer, since it would not make sense to have a negative number of iterations. In SPL, users or library writers define operators (like Beacon) and their parameters (like iterations) using a common framework; they are not built into the language. Line 5, output Hi : message = "Hello, world!";, assigns the string "Hello, world!" to attribute message of output stream Hi. Moving on to the second operator invocation, the Custom operator provides a clean slate for custom user logic. Line 8, logic onTuple Hi : printStringLn(message), specifies that upon arrival of a tuple on stream Hi, the program should print the string attribute message from the tuple, followed by a newline character \n. At this point, you have compiled and run a first SPL program, and you understand what it does. This program only illustrates a tiny fraction of SPL, but before we move on to more interesting examples, we will take a look at the compiled code. Besides the standalone executable, the compiler also generated several other artifacts. Recall that we started out from just one directory HelloWorld and with just one file HelloWorld.spl. If you look at the directory after compiling, you will find something like the following:
/+ HelloWorld /+ HelloWorld.spl /* toolkit.xml /* data /* output /* HelloWorld.adl /* bin /* standalone /* src /* operator /* pe /* standalone /* type # # # # # # # # # # # # SPL source code toolkit index directory for data read/written by the program directory for artifacts generated by the compiler ADL (application description language) file compiled binaries the standalone executable from earlier generated C++ source code source code for operator invocations source code for PEs (processing elements) source code for the standalone file source code for types

In this listing, authored files (files written by hand) are annotated with /+ and generated files (files written automatically by the compiler) are annotated with /*. For now, we do not need to cover all the generated artifacts in detail, but you are encouraged to look at a few of them to get a feeling for what they look like. The purpose of this tutorial is to provide an introduction to SPL. To focus on the essentials, it intentionally omits details that you do not immediately need to know, but can look up at your leisure in the more complete and precise reference documentation. This section gave an example for using sc, the SPL compiler. For more information about using the SPL compiler, see the IBM Streams Processing Language Compiler Usage Reference. This section also used the toolkit operators Beacon and Custom, and the toolkit function printString. For more information about library operators and functions, see the IBM Streams Processing Language Standard Toolkit Reference.

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Chapter 2. Stream processing


As the name implies, SPL is a language for processing data streams. But the HelloWorld example from the previous section hardly qualifies as stream processing, since there was only a single stream with a single tuple in that program. This section introduces a more idiomatic example that processes streams of a-priori unknown length, using a graph of operator invocations that have pipeline parallelism. The purpose of the program is to list a file, prefix each line with a line number, and write the result to another file. It accomplishes this with the following stream graph:

FileSource

Lines

Functor

Numbered

Sink

Figure 2. Stream graph of the NumberedCat program.

A stream is a (possibly infinite) sequence of tuples; in the example, Lines and Numbered are streams. A tuple is a data item on a stream. In the example, the stream Lines transports one tuple for each line in the input file. An operator is a reusable stream transformer: each operator invocation transforms some input streams into some output streams. The place where a stream connects to an operator is called a port. Many operators have one input port and one output port (like Functor in the example), but operators can also have zero input ports (FileSource), zero output ports (FileSink), or multiple input or output ports (which we will see in later examples). But back to the line-numbering program. We will call it NumberedCat as an homage to the Unix cat utility that, given the right command-line options, performs the same task. Here is the code:
composite NumberedCat { graph stream<rstring contents> Lines = FileSource() { param format : line; file : getSubmissionTimeValue("file"); } stream<rstring contents> Numbered = Functor(Lines) { logic state : mutable int32 i = 0; onTuple Lines : i++; output Numbered : contents = (rstring)i + " " + contents; } () as Sink = FileSink(Numbered) { param file : "result.txt"; format : line; } } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12 //13 //14 //15 //16

Like in the previous example, there is a composite operator definition with a graph clause that contains operator invocations. The invocation of FileSource in Lines 3-6 reads one line at a time (param format : line), from a file specified at submission-time (param file : getSubmissionTimeValue("file")). In a little bit, we will see how to supply the file name at submission time. The invocation of Functor in Lines 7-11 maintains a state variable mutable int32 i = 0 which it increments each time a tuple arrives (onTuple Lines : i++). SPL variables are
Copyright IBM Corp. 2011, 2012

immutable by default, so without the mutable modifier, the compiler would have prevented us from incrementing i++. The output clause output Numbered : contents = (rstring)i + " " + contents assigns the contents attribute of the output stream by casting the line number i to a string (rstring)i, and concatenating it with the contents attribute of the input stream. As the example shows, an output clause has assignments where the left-hand side is an attribute of the output stream, whereas attribute names in the right-hand side belong to input streams. Finally, the invocation of FileSink on Lines 12-15 writes the results to a file named result.txt. You should try out the following. Create a directory called NumberedCat. Put the example program in a file NumberedCat/NumberedCat.spl. Compile it to a stand-alone executable with sc -T -M NumberedCat. Put the following text in a file NumberedCat/data/catFood.txt: The Unix utility "cat" is so called because it can con"cat"enate files. Our program behaves like "cat -n", listing one file and numbering lines. When we run the program, we need to supply the input file name as a submission-time value. The FileSource operator expects a file name that is relative to the NumberedCat/data directory. Therefore, we run the program with ./output/bin/standalone file="catFood.txt". Look at the NumberedCat/data directory. If everything went fine, then the program created a file called result.txt that contains the numbered lines of catFood.txt. So far, we have run all our programs in stand-alone mode. That is common during testing and debugging. But a major strength of InfoSphere Streams is that it can run programs on a cluster of workstations. To do this, we need to compile without the -T,--standalone-application option, and then create an instance of the runtime into which we submit the job. Please try the following sequence of commands:
sc -M NumberedCat # compile streamtool mkinstance --template developer # make a runtime instance streamtool startinstance # start the runtime instance streamtool submitjob -P file=catFood.txt output/NumberedCat.adl # submit the job streamtool lsjobs # list running jobs # wait until data/result.txt contains the numbered lines of data/catFood.txt streamtool canceljob 0 # cancel the job streamtool stopinstance # stop the runtime instance streamtool rminstance # remove the runtime instance

If everything went well, this accomplished the same result as running the program stand-alone. If anything went wrong, consult your system administrator, or try to diagnose the problem yourself by using the streamtool getlog/viewlog commands. As mentioned before, the best way to learn a language is to write and run programs in it, so now is a good time to ensure that you have the right setup to do that. Note how the streamtool submitjob command accepts submission-time values with the -P option, and uses the .adl file (application description language) to figure out which operators to submit. This section illustrated the flavor of SPL as a streaming language, and gave you a taste for how to run programs on an instance of the IBM InfoSphere Streams distributed runtime. We saw three new standard toolkit operators FileSource, Functor, and FileSink. For more information about standard toolkit operators, see the IBM Streams Processing Language Standard Toolkit Reference. To learn more about working with the distributed runtime, type streamtool man, which contains a

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

plethora of information about commands like submitjob and family. To learn more about SPL, see the IBM Streams Processing Language Specification.

Chapter 2. Stream processing

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Chapter 3. Types and functions


One of the most important goals of programming languages is to enable reuse and improve readability. Languages support this by allowing you to define your own types and functions. User-defined types and functions foster reuse, because they can be defined once and used multiple times. User-defined types and functions foster readability, because they give a descriptive name to a concept and unclutter the code that uses it. To illustrate this, we will develop a simple streaming application that counts the lines and words in a file. The WordCount program consists of the following stream graph:

FileSource

Data

Functor

OneLine

Counter

Figure 3. Stream graph of the WordCount program.

The FileSource operator invocation reads a file, sending lines on the Data stream. The Functor operator invocation counts the lines and words for each individual line of Data, sending the statistics on the OneLine stream. Unlike the invocation of the Functor operator in Chapter 2, Stream processing, on page 3, this invocation of the Functor is stateless; it has no side-effects or dependencies between tuples. Finally, the Counter operator invocation aggregates the statistics for all lines in the file, and prints them at the end. Before we look at the main composite operator, let's define some helpers. We will use a type LineStat for the statistics about a line; a function countWords(rstring line) to count the words in a line; and a function addM(mutable LineStat x, LineStat y) to add two LineStat values and store the result in x. Here is the definition of these helpers:
type LineStat = tuple<int32 lines, int32 words>; int32 countWords(rstring line) { return size(tokenize(line, " \t", false)); } void addM(mutable LineStat x, LineStat y) { x.lines += y.lines; x.words += y.words; } //1 //2 //3 //4 //5 //6 //7 //8

You can put this code in a file called WordCount/Helpers.spl. Line 1 defines type LineStat to be a tuple with two attributes for counting lines and words. Lines 2-4 define function countWords by using the standard toolkit function tokenize to split the line on spaces and tabs (" \t"), and then using the standard toolkit function size to count the resulting fragments. Lines 5-8 define function addM. As mentioned previously, SPL variables are immutable by default, so we had to explicitly declare parameter x as mutable to enable the function to add values to its attributes. Having the mutable modifier in the signature of the function makes it clear to the user what kind of side-effects the function might have, and the compiler can also use this information for optimization. Now we are ready to define the main composite operator. You can put the following code in a file called WordCount/WordCount.spl.
composite WordCount { graph stream<rstring line> Data = FileSource() { //1 //2 //3

Copyright IBM Corp. 2011, 2012

//4 //5 } //6 stream<LineStat> OneLine = Functor(Data) { //7 output OneLine : lines = 1, words = countWords(line); //8 } //9 () as Counter = Custom(OneLine) { //10 logic state : mutable LineStat sum = { lines = 0, words = 0 }; //11 onTuple OneLine : addM(sum, OneLine); //12 onPunct OneLine : if (currentPunct() == Sys.FinalMarker) //13 println(sum); //14 } //15 //16

param file format

: getSubmissionTimeValue("file"); : line;

By this point in the tutorial, you should be able to read and understand much of this code. Note how type LineStat is used both in Line 7 as a schema for stream OneLine, and in Line 11 as a type for variable sum. Line 12 adds the statistics from the newest tuple in stream OneLine into the accumulator variable sum by using the helper function addM defined before. Lines 13-14 illustrate punctuation-handling, which is a new feature that we have not seen before. A punctuation is a control signal that appears interleaved with the tuples on a stream. The logic onPunct OneLine clause gets triggered each time a punctuation arrives on stream OneLine. If the punctuation is Sys.FinalMarker, that indicates that the end of the stream has been reached. In our example, the FileSource operator sends a FinalMarker at the end of the file, and the Functor operator forwards it after sending statistics for the last line. Compile and run the program as a standalone application, as you learned in the previous sections. You will need to provide an input file in the data directory, and provide the file name as a submission-time value on the command-line of the standalone application. The program should print the total statistics to the console. When you learn a new programming language and start writing programs in it, you are bound to encounter error messages. These can be baffling, because you thought your program was fine, yet the compiler objected to something in it. Therefore, a good exercise when learning a language is to make some intentional errors, and familiarize yourself with the error messages. That way, when you see the same errors again "by accident", you will already be somewhat familiar with them. So let's inject an error into the example program. Go to file WordCount/Helpers.spl, and remove the mutable modifier from the signature of function addM. In other words, Line 5 should read void addM(LineStat x, LineStat y). Recompile by doing sc -T -M WordCount. You should get something like the following:
Helpers.spl:6:11: CDISP0378E ERROR: The operand modified by += must be mutable. Helpers.spl:7:11: CDISP0378E ERROR: The operand modified by += must be mutable.

The compiler complains because the += operator tries to modify the parameter x, but it has not been declared as mutable. In this section, you saw how to define your own types and functions, which enables reuse and improves readability. For more information about defining your own types and functions, see the IBM Streams Processing Language Specification. Types and functions form a sub-language that you can easily learn without any other materials as prerequisites. To the contrary, they serve as the foundation for more advanced language features like the ones we will cover in the remaining sections of this tutorial.

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Chapter 4. Composite operators


Just like user-defined types and functions, user-defined composite operators help code reuse and readability. A composite operator encapsulates a stream sub-graph, which can then be used in different contexts. Each of the examples so far had a main composite operator, which encapsulates the stream graph that forms the whole program. Main composite operators are self-contained in the sense that their stream graph has no output or input ports, and they have no mandatory parameters. In this section, we will instead look at a composite operator that has both ports and parameters. The operator reads a stream from input port In, and removes duplicate consecutive lines, then writes the result to a stream to output port Out. We call the operator Uniq as an homage to the uniq utility that performs the same task. Internally, the Uniq operator uses a Custom operator to implement its functionality. Here is a diagram of the stream graph:

Uniq

In

Custom

Out

Figure 4. Stream graph of the body of the Uniq operator.

To make things more interesting, the Uniq operator has a parameter type $key, which is the type containing the subset of attributes of the input tuple that are used to determine uniqueness. If two consecutive tuples are identical for these attributes, the second one is dropped even if it differs in some other attributes. The following code implements operator Uniq:
namespace my.util; public composite Uniq(output Out; input In) { param type $key; graph stream<In> Out = Custom(In) { logic state : { mutable boolean first = true; mutable $key prev; } onTuple In : { $key curr = ($key)In; if (first || prev != curr) { submit(In, Out); first = false; prev = curr; } } } } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12 //13 //14 //15 //16 //17 //18 //19 //20

Line 1, namespace my.util, specifies a namespace for the operator. That means that the operator's full name is really my.util::Uniq. You should put the above source code in a file Uniq/my.util/Uniq.spl. Line 2, public composite Uniq(output Out; input In), specifies that the operator is public, meaning it can be used from other namespaces; and that it has one output port Out and one input port In. Lines 3 and 4 declare the mandatory formal parameter $key, which is a type. Line 12, $key curr = ($key)In;, declares a local variable curr of type $key, and initializes it with
Copyright IBM Corp. 2011, 2012

the expression ($key)In, which takes the current tuple from input stream In and casts it to type $key, in other words, drops any attributes that are not relevant for the comparison with the previous tuple. We have to consider one special case: for the very first tuple, there is no previous tuple, so we always treat it as unique. Now that we have defined our own operator my.util::Uniq, we need to test it. To do that, we will generate a stream All of tuples that have some duplicates, and send them through the Uniq operator to get the stream Some of unique tuples. We will print both All and Some so we can inspect whether the operator actually worked as expected. The stream graph for the test driver is:

Uniq

Some

PrintSome

Beacon

All

PrintAll

Figure 5. Stream graph of the test driver for the Uniq operator.

Note that as far as the driver is concerned, Uniq is just an ordinary operator, whose invocation can serve as a vertex in a stream graph just like any of the other operators we have used before. Note also that a single stream from a single output port, like All in the example, can be used as the input to multiple operators; in this case, all tuples are duplicated, once for each recipient. The following code implements the test driver:
use my.util::Uniq; composite Main { type KeyType = tuple<int32 j>; graph stream<int32 i, int32 j> All = Beacon() { logic state : mutable int32 n = 0; param iterations : 10u; output All : i = ++n, j = n / 3; } stream<All> Some = Uniq(All) { param key : KeyType; } () as PrintAll = Custom(All) { logic onTuple All : printString("All" + (rstring)All + "\n"); } () as PrintSome = Custom(Some) { logic onTuple Some : printString("Some" + (rstring)Some + "\n"); } } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12 //13 //14 //15 //16 //17 //18 //19 //20

Note how Lines 11-13 invoke our operator Uniq, passing an actual parameter param key : KeyType, which indicates that only attribute j is to be used in the uniqueness test. Put this code into a file Uniq/Main.spl, and run sc -T -M Main to compile it as a stand-alone application. Now run ./output/bin/standalone. You should see the following output:
All {i=1,j=0} Some {i=1,j=0} All {i=2,j=0}

10

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

All {i=3,j=1} Some {i=3,j=1} All {i=4,j=1} All {i=5,j=1} All {i=6,j=2} Some {i=6,j=2} All {i=7,j=2} All {i=8,j=2} All {i=9,j=3} Some {i=9,j=3} All {i=10,j=3}

If you look just at All lines, you see that the i attribute just counts up iterations from 1 to 10, while the j attribute is always i/3 rounded down to the nearest integer. Since we used type tuple<int32 j> as the uniqueness key, only every third tuple is considered unique, and therefore, Some lines show only every third tuple. In this section, you have seen how to define your own composite operators to encapsulate useful reusable functionality. You have also seen how the Beacon operator from the standard toolkit can serve as a useful workload generator for testing. We recommend that you test your own operators with test drivers like the one shown in this example. Besides helping you to iron out bugs during development, drivers like these are also useful to keep around later for regression testing. SPL composite operators are more powerful than the example in this section illustrates. They can encapsulate not just a single operator, but a whole graph; they can have multiple output and input ports; and they can have more parameters, of different kinds besides types. For more information about composite operators, see the IBM Streams Processing Language Specification.

Chapter 4. Composite operators

11

12

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Chapter 5. Primitive operators


Recall that an operator is a reusable stream transformer, and a composite operator encapsulates a stream graph. If all operators were composite, we would have a chicken-and-egg problem; therefore, SPL also has primitive operators, which encapsulate code in a native language. This is usually a more traditional, von-Neumann language such as Java or C++. In this section, we will develop a primitive operator RoundRobinSplit in C++, but IBM InfoSphere Streams also enables you to write primitive operators in Java. If you are not a C++ programmer, or if you anticipate that you will mostly use operators from the standard toolkit or other toolkits, you can skip this section. We will start this presentation from the familiar, by giving an example of invoking RoundRobinSplit from SPL code. We will implement the following stream graph:

Functor

A0

B0

Beacon

Input

RRSplit

Pair

Output

Writer

A1
Functor

B1

Figure 6. Stream graph of the test driver for the RoundRobinSplit operator.

Graphs like these are called split-joins, and are a common cause of non-determinism in streaming applications, because data may be processed at different speeds along the different paths. However, some applications require deterministic behavior, which is also useful for testing purposes. Our new RoundRobinSplit operator, together with the Pair operator from the standard library, provides a simple way to implement a deterministic split-join without giving up much of the performance advantage afforded by the parallelism in the middle portion of the stream graph. Specifically, RoundRobinSplit deterministically alternates between sending data to each of its output ports, and Pair deterministically alternates between receiving data from each of its input ports. Here is the code for this stream graph:
use my.util::RoundRobinSplit; composite Main { graph stream<int32 count> Input = Beacon() { logic state : mutable int32 n = 0; param iterations : 10u; output Input : count = n++; } (stream<int32 count> A0; stream<int32 count> A1) = RoundRobinSplit(Input) { param batch : 2u; } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11

Copyright IBM Corp. 2011, 2012

13

stream<int32 count, int32 path> B0 = Functor(A0) { output B0 : path = 0; } stream<int32 count, int32 path> B1 = Functor(A1) { output B1 : path = 1; } stream<int32 count, int32 path> Output = Pair(B0; B1) {} () as Writer = FileSink(Output) { param file : "/dev/stdout"; flush : 1u; } }

//12 //13 //14 //15 //16 //17 //18 //19 //20 //21 //22 //23

Line 9, (stream<int32 count> A0; stream<int32 count> A1) = RoundRobinSplit(Input), invokes operator RoundRobinSplit to produce two output streams A0 and A1. The operator takes a parameter param batch : 2u that indicates that it alternates after every two tuples. Line 18 invokes operator Pair on two input streams B0 and B1, with the code stream<int32 count, int32 path> Output = Pair(B0; B1). For now, put this code into a file RoundRobinSplit/Main.spl. However, don't try to compile it yet; we need to implement the operator RoundRobinSplit first. Create a directory RoundRobinSplit/my.util/RoundRobinSplit, and change into that directory. Now, run spl-make-operator --kind c++. That will generate several skeleton files for you, including an operator model RoundRobinSplit.xml and two code generation templates (.cgt files), one for a header file RoundRobinSplit_h.cgt and one for a C++ implementation file RoundRobinSplit_cpp.cgt. When you write more sophisticated primitive operators, you will often need to edit the XML operator model, but in this case, the operator is simple enough so you do not need to change the operator model at all. Open the header file code generation template RoundRobinSplit_h.cgt. You will see a class definition with several method declarations. Remove most methods except for the constructor and process(Tuple & tuple, uint32_t port). Add two instance fields Mutex _mutex and uint32_t _count. You should end up with the following code in RoundRobinSplit_h.cgt:
#pragma SPL_NON_GENERIC_OPERATOR_HEADER_PROLOGUE class MY_OPERATOR : public MY_BASE_OPERATOR { public: MY_OPERATOR(); void process(Tuple & tuple, uint32_t port); private: Mutex _mutex; uint32_t _count; }; #pragma SPL_NON_GENERIC_OPERATOR_HEADER_EPILOGUE //1 //2 //3 //4 //5 //6 //7 //8 //9 //10

Next, open the C++ implementation file code generation template RoundRobinSplit_cpp.cgt. Remove most methods except for the constructor and process(Tuple & tuple, uint32_t port). Implement these methods as shown in the following listing of RoundRobinSplit_cpp.cgt:
#pragma SPL_NON_GENERIC_OPERATOR_IMPLEMENTATION_PROLOGUE MY_OPERATOR::MY_OPERATOR() : _count(0) {} void MY_OPERATOR::process(Tuple & tuple, uint32_t port) { uint32_t const nOutputs = getNumberOfOutputPorts(); uint32_t const batchSize = getParameter("batch"); AutoPortMutex apm(_mutex, *this); uint32 outputPort = (_count / batchSize) % nOutputs; _count = (_count + 1) % (batchSize * nOutputs); assert(outputPort < nOutputs); submit(tuple, outputPort); } #pragma SPL_NON_GENERIC_OPERATOR_IMPLEMENTATION_EPILOGUE //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12

The constructor just initializes the _count instance variable to zero. The process method queries the runtime APIs for the number of output ports (Line 4) and the batch size parameter (Line 5); acquires the mutex to guard against concurrent

14

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

manipulation of the _count instance variable (Line 6); determines the output port (Line 7), updates _count (Line 8), and submits the input tuple to the appropriate output port (Line 10). The mutex is necessary because without it, if there are two threads T1 and T2 , then T1 's invocation of process might be interrupted in the middle of Line 8, after reading the old value of _count but before writing the new value; then T2 might call process and update _count; and finally, T1 might resume and overwrite T2 's update to _count. Now, we are finally ready to compile the application. Change to the RoundRobinSplit directory and run the SPL compiler with sc -T -M Main. The SPL compiler will invoke the C++ compiler to compile the instance of the RoundRobinSplit operator: the sources are the files A0.cpp and A0.h in directory RoundRobinSplit/output/src/operator, and the object file is RoundRobinSplit/ output/build/operator/A0.o. Run the application by changing to directory RoundRobinSplit and executing ./output/bin/standalone. You will get the following output:
0,0 2,1 1,0 3,1 4,0 6,1 5,0 7,1

Each line shows the count and path attributes separated by a comma. Since the split uses a batch size of two but the join uses a batch size of one, the counts (left column) have a progression of 0,2,1,3,4,6,5,7 whereas the paths (right column) just alternate between 0,1,0,1,0,1,0,1. This output is deterministically repeatable, independent of the processing speed of the two paths. It is instructional to introduce an error in the C++ code to see what happens. If we change the call on Line 10 of RoundRobinSplit_cpp.cgt to submit(outputPort, tuple), the C++ compiler reports an error message with the correct file name and line number:
my.util/RoundRobinSplit/RoundRobinSplit_cpp.cgt:10: error: no matching function for call to 'SPL::_Operator::A0::submit(SPL::uint32&, SPL::Tuple&) note: candidates are: virtual void SPL::Operator::submit(SPL::Tuple&, uint32_t) note: virtual void SPL::Operator::submit(const SPL::Tuple&, uint32_t) note: void SPL::Operator::submit(const SPL::Punctuation&, uint32_t)

This section barely scratched the surface of developing primitive operators in SPL. There is a rich API for generating specialized code for performance, and for compile-time error checking on things like the number and types of ports. For more information about developing primitive operators, see the IBM Streams Processing Language Toolkit Development Reference to learn more. You may also want to take a look at the IBM Streams Processing Language Operator Model Reference to learn about the XML file for the primitive operator.

Chapter 5. Primitive operators

15

16

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Chapter 6. Next steps


In this tutorial, you have learned about a lot of topics superficially. Along the way, you saw several links to more detailed documentation. A good way to continue learning the language is to exercise by writing programs of your own. Table 1 lists several suggestions for topics you may want to study further, and exercises you may want to do to cement your SPL skills.
Table 1. Topics and exercises for further study
Topic Expression language Type system Documentation IBM Streams Processing Language Specification. IBM Streams Processing Language Specification. Exercise Write a program to reverse the lines of a small file. Create a histogram of the length of lines in a file. Merge two sorted streams such that the output is also sorted.

Other operators/functions in the IBM Streams Processing standard toolkit Language Standard Toolkit Reference. Documentation for SPL standard toolkit types and functions is located in the $STREAMS_INSTALL/doc/spl/ standard-toolkit/builtinfunctions-and-types directory. Windows Configs IBM Streams Processing Language Specification. IBM Streams Processing Language Config Reference and IBM Streams Processing Language Specification. IBM Streams Processing Language Toolkit Development Reference and IBM Streams Processing Language Operator Model Reference. IBM Streams Processing Language Toolkit Development Reference and IBM Streams Processing Language Operator Model Reference. IBM Streams Processing Language Toolkit Development Reference. IBM Streams Processing Language Specification. IBM Streams Processing Language Streams Debugger Reference.

Sort a file five lines at a time. Change the logLevelconfig and look at the log files to see what happens. Write a C++ primitive operator that extracts groups matched by subexpressions of a regexp.

C++ primitive operators

Java primitive operators

Write a Java primitive operator that extracts groups matched by subexpressions of a regexp.

Writing native functions

Turn a map into a list of (key,value) tuples. Run the SchemaSharing sample that ships with SPL. Run the NumberedCat program and interactively drop a tuple.

Dynamic application composition Streams debugger

Copyright IBM Corp. 2011, 2012

17

18

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Notices
This information was developed for products and services offered in the U.S.A. Information about non-IBM products is based on information available at the time of first publication of this document and is subject to change. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country/region where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions; therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Copyright IBM Corp. 2011, 2012

19

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information that has been exchanged, should contact: IBM Canada Limited Office of the Lab Director 8200 Warden Avenue Markham, Ontario L6G 1C7 CANADA Such information may be available, subject to appropriate terms and conditions, including, in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement, or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems, and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements, or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information may contain examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious, and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs, in source language, which illustrate programming techniques on various operating platforms. You may copy,

20

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided AS IS, without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved.

Trademarks
IBM, the IBM logo, ibm.com and InfoSphere are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/ copytrade.shtml. The following terms are trademarks or registered trademarks of other companies v Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. v Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. v UNIX is a registered trademark of The Open Group in the United States and other countries. v Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other product and service names might be trademarks of IBM or other companies.

Notices

21

22

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Index C
commands streamtool canceljob 3 streamtool lsjobs 3 streamtool man 3 streamtool mkinstance 3 streamtool rminstance 3 streamtool startinstance 3 streamtool stopinstance 3 streamtool submitjob 3 compiling HelloWorld program composite operator overview 9 operators FileSource 7 Functor 7 output files getting started

W
WordCount program 1 7

P
1 port 3 primitive operators RoundRobinSplit 13

R
RoundRobinSplit primitive operators 13 running HelloWorld program 1

D
data stream processing 3

E
example HelloWorld example 1

S
SPL exercises 17 stand-alone application compilation command 9 streams overview 3 streamtool canceljob command 3 streamtool commands See commands streamtool lsjobs command 3 streamtool man 3 streamtool mkinstance command 3 streamtool rminstance command 3 streamtool startinstance command 3 streamtool stopinstance command 3 streamtool submitjob command 3

F
FileSink operator 3 FileSource operator 3 functions 7 Functor operator 3

G
getting started compiling program 1 HelloWorld example 1 output files 1 sc compiler 1 writing program 1

T
toolkit operators FileSink 3 FileSource 3 Functor 3 tuples 3 tutorial composite operators 1 HelloWorld example 1 sc compiler 1 streams 1 tuples 1

H
HelloWorld example compiling program 1 writing program 1

M
mutable modifier 7

N
namespace 9

U
Uniq operator 9 user defined composite operators user defined types mutable modifier 7 9

O
operator Uniq 9 Copyright IBM Corp. 2011, 2012

23

24

IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

Printed in USA

You might also like