IBM Streams Processing Language Introductory Tutorial
IBM Streams Processing Language Introductory Tutorial
Note Before using this information and the product it supports, read the general information under Notices on page 19.
Edition Notice This document contains proprietary information of IBM. It is provided under a license agreement and is protected by copyright law. The information contained in this publication does not include any product warranties, and any statements provided in this manual should not be interpreted as such. You can order IBM publications online or through your local IBM representative. v To order publications online, go to the IBM Publications Center at www.ibm.com/e-business/linkweb/ publications/servlet/pbi.wss v To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at www.ibm.com/ planetwide When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. Copyright IBM Corporation 2011, 2012. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Summary of changes
This topic describes updates to this documentation for IBM InfoSphere Streams Version 2.0 (all releases).
iii
iv
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Abstract
This document is an introductory tutorial to the IBM Streams Processing Language (SPL), the programming language for IBM InfoSphere Streams. If you are new to SPL, and want to learn it, this is a good document to read first.
vi
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Contents
Summary of changes . . . . . . . . . iii Abstract. . . . . . . . . . . . . . . v Chapter 1. Getting started . . . . . . . 1 Chapter 2. Stream processing . . . . . 3 Chapter 3. Types and functions . . . . 7 Chapter 4. Composite operators . . . . 9 Chapter 5. Primitive operators. . . . . 13 Chapter 6. Next steps . . . . . . . . 17 Notices . . . . . . . . . . . . . . 19 Index . . . . . . . . . . . . . . . 23
vii
viii
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Let's defer the discussion of how this code works, and instead focus on getting it compiled. You will get the most out of this tutorial if you try out things as you go along. Therefore, the tutorial frequently has instructions for you like the following: make sure you are on a machine that has IBM InfoSphere Streams installed, and thus, has the SPL compiler, sc, available. For more information about installing InfoSphere Streams, see the IBM InfoSphere Streams: Installation and Administration Guide. Create a directory called HelloWorld on that machine. Create a file in that directory called HelloWorld.spl, and enter the above program text in it, then save it. Make sure you are in that directory, and run the compiler, by entering sc -T -M HelloWorld. The -T flag creates a standalone executable, that is, a program that can run as a single process on a single machine, without requiring a running InfoSphere Streams instance. The -M HelloWorld command-line option specifies that the main composite is called HelloWorld. Each SPL program has one main composite operator. A composite operator is an operator that encapsulates a stream graph, and the stream graph of a main composite can be run as a program. If you ran the compiler as recommended, and there were no compiler errors, then it created the executable file ./output/bin/standalone. Run the executable file. It should print Hello, world! to the console. Now we will discuss how the code works. Line 1 declares a composite operator: composite HelloWorld { ... }. Line 2 starts a graph clause, which means that Lines 3-9 describe a stream graph. The graph consists of two operator invocations. Line 3 is the head of the first operator invocation: stream<rstring message> Hi = Beacon() invokes operator Beacon to produce a stream Hi whose tuples have one attribute rstring message. Line 7 is the head of the second operator invocation: () as Sink = Custom(Hi) invokes operator Custom, which reads from stream Hi. The () as Sink part indicates that this operator invocation produces no stream ( ( ) ), and has the name Sink. Here is a visual representation of this stream graph:
Beacon
Hi
Sink
The operator invocations are shown as circles, and the stream is shown as an arrow. The operator invocations are decorated at the bottom right with little scratch-paper icons that indicate internal state: both Beacon and Sink are stateful in this program. The Beacon operator produces data. In this invocation, Line 4, param iterations : 1u;, tells it to produce just one tuple; the u suffix on the number 1 makes it an unsigned integer, since it would not make sense to have a negative number of iterations. In SPL, users or library writers define operators (like Beacon) and their parameters (like iterations) using a common framework; they are not built into the language. Line 5, output Hi : message = "Hello, world!";, assigns the string "Hello, world!" to attribute message of output stream Hi. Moving on to the second operator invocation, the Custom operator provides a clean slate for custom user logic. Line 8, logic onTuple Hi : printStringLn(message), specifies that upon arrival of a tuple on stream Hi, the program should print the string attribute message from the tuple, followed by a newline character \n. At this point, you have compiled and run a first SPL program, and you understand what it does. This program only illustrates a tiny fraction of SPL, but before we move on to more interesting examples, we will take a look at the compiled code. Besides the standalone executable, the compiler also generated several other artifacts. Recall that we started out from just one directory HelloWorld and with just one file HelloWorld.spl. If you look at the directory after compiling, you will find something like the following:
/+ HelloWorld /+ HelloWorld.spl /* toolkit.xml /* data /* output /* HelloWorld.adl /* bin /* standalone /* src /* operator /* pe /* standalone /* type # # # # # # # # # # # # SPL source code toolkit index directory for data read/written by the program directory for artifacts generated by the compiler ADL (application description language) file compiled binaries the standalone executable from earlier generated C++ source code source code for operator invocations source code for PEs (processing elements) source code for the standalone file source code for types
In this listing, authored files (files written by hand) are annotated with /+ and generated files (files written automatically by the compiler) are annotated with /*. For now, we do not need to cover all the generated artifacts in detail, but you are encouraged to look at a few of them to get a feeling for what they look like. The purpose of this tutorial is to provide an introduction to SPL. To focus on the essentials, it intentionally omits details that you do not immediately need to know, but can look up at your leisure in the more complete and precise reference documentation. This section gave an example for using sc, the SPL compiler. For more information about using the SPL compiler, see the IBM Streams Processing Language Compiler Usage Reference. This section also used the toolkit operators Beacon and Custom, and the toolkit function printString. For more information about library operators and functions, see the IBM Streams Processing Language Standard Toolkit Reference.
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
FileSource
Lines
Functor
Numbered
Sink
A stream is a (possibly infinite) sequence of tuples; in the example, Lines and Numbered are streams. A tuple is a data item on a stream. In the example, the stream Lines transports one tuple for each line in the input file. An operator is a reusable stream transformer: each operator invocation transforms some input streams into some output streams. The place where a stream connects to an operator is called a port. Many operators have one input port and one output port (like Functor in the example), but operators can also have zero input ports (FileSource), zero output ports (FileSink), or multiple input or output ports (which we will see in later examples). But back to the line-numbering program. We will call it NumberedCat as an homage to the Unix cat utility that, given the right command-line options, performs the same task. Here is the code:
composite NumberedCat { graph stream<rstring contents> Lines = FileSource() { param format : line; file : getSubmissionTimeValue("file"); } stream<rstring contents> Numbered = Functor(Lines) { logic state : mutable int32 i = 0; onTuple Lines : i++; output Numbered : contents = (rstring)i + " " + contents; } () as Sink = FileSink(Numbered) { param file : "result.txt"; format : line; } } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12 //13 //14 //15 //16
Like in the previous example, there is a composite operator definition with a graph clause that contains operator invocations. The invocation of FileSource in Lines 3-6 reads one line at a time (param format : line), from a file specified at submission-time (param file : getSubmissionTimeValue("file")). In a little bit, we will see how to supply the file name at submission time. The invocation of Functor in Lines 7-11 maintains a state variable mutable int32 i = 0 which it increments each time a tuple arrives (onTuple Lines : i++). SPL variables are
Copyright IBM Corp. 2011, 2012
immutable by default, so without the mutable modifier, the compiler would have prevented us from incrementing i++. The output clause output Numbered : contents = (rstring)i + " " + contents assigns the contents attribute of the output stream by casting the line number i to a string (rstring)i, and concatenating it with the contents attribute of the input stream. As the example shows, an output clause has assignments where the left-hand side is an attribute of the output stream, whereas attribute names in the right-hand side belong to input streams. Finally, the invocation of FileSink on Lines 12-15 writes the results to a file named result.txt. You should try out the following. Create a directory called NumberedCat. Put the example program in a file NumberedCat/NumberedCat.spl. Compile it to a stand-alone executable with sc -T -M NumberedCat. Put the following text in a file NumberedCat/data/catFood.txt: The Unix utility "cat" is so called because it can con"cat"enate files. Our program behaves like "cat -n", listing one file and numbering lines. When we run the program, we need to supply the input file name as a submission-time value. The FileSource operator expects a file name that is relative to the NumberedCat/data directory. Therefore, we run the program with ./output/bin/standalone file="catFood.txt". Look at the NumberedCat/data directory. If everything went fine, then the program created a file called result.txt that contains the numbered lines of catFood.txt. So far, we have run all our programs in stand-alone mode. That is common during testing and debugging. But a major strength of InfoSphere Streams is that it can run programs on a cluster of workstations. To do this, we need to compile without the -T,--standalone-application option, and then create an instance of the runtime into which we submit the job. Please try the following sequence of commands:
sc -M NumberedCat # compile streamtool mkinstance --template developer # make a runtime instance streamtool startinstance # start the runtime instance streamtool submitjob -P file=catFood.txt output/NumberedCat.adl # submit the job streamtool lsjobs # list running jobs # wait until data/result.txt contains the numbered lines of data/catFood.txt streamtool canceljob 0 # cancel the job streamtool stopinstance # stop the runtime instance streamtool rminstance # remove the runtime instance
If everything went well, this accomplished the same result as running the program stand-alone. If anything went wrong, consult your system administrator, or try to diagnose the problem yourself by using the streamtool getlog/viewlog commands. As mentioned before, the best way to learn a language is to write and run programs in it, so now is a good time to ensure that you have the right setup to do that. Note how the streamtool submitjob command accepts submission-time values with the -P option, and uses the .adl file (application description language) to figure out which operators to submit. This section illustrated the flavor of SPL as a streaming language, and gave you a taste for how to run programs on an instance of the IBM InfoSphere Streams distributed runtime. We saw three new standard toolkit operators FileSource, Functor, and FileSink. For more information about standard toolkit operators, see the IBM Streams Processing Language Standard Toolkit Reference. To learn more about working with the distributed runtime, type streamtool man, which contains a
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
plethora of information about commands like submitjob and family. To learn more about SPL, see the IBM Streams Processing Language Specification.
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
FileSource
Data
Functor
OneLine
Counter
The FileSource operator invocation reads a file, sending lines on the Data stream. The Functor operator invocation counts the lines and words for each individual line of Data, sending the statistics on the OneLine stream. Unlike the invocation of the Functor operator in Chapter 2, Stream processing, on page 3, this invocation of the Functor is stateless; it has no side-effects or dependencies between tuples. Finally, the Counter operator invocation aggregates the statistics for all lines in the file, and prints them at the end. Before we look at the main composite operator, let's define some helpers. We will use a type LineStat for the statistics about a line; a function countWords(rstring line) to count the words in a line; and a function addM(mutable LineStat x, LineStat y) to add two LineStat values and store the result in x. Here is the definition of these helpers:
type LineStat = tuple<int32 lines, int32 words>; int32 countWords(rstring line) { return size(tokenize(line, " \t", false)); } void addM(mutable LineStat x, LineStat y) { x.lines += y.lines; x.words += y.words; } //1 //2 //3 //4 //5 //6 //7 //8
You can put this code in a file called WordCount/Helpers.spl. Line 1 defines type LineStat to be a tuple with two attributes for counting lines and words. Lines 2-4 define function countWords by using the standard toolkit function tokenize to split the line on spaces and tabs (" \t"), and then using the standard toolkit function size to count the resulting fragments. Lines 5-8 define function addM. As mentioned previously, SPL variables are immutable by default, so we had to explicitly declare parameter x as mutable to enable the function to add values to its attributes. Having the mutable modifier in the signature of the function makes it clear to the user what kind of side-effects the function might have, and the compiler can also use this information for optimization. Now we are ready to define the main composite operator. You can put the following code in a file called WordCount/WordCount.spl.
composite WordCount { graph stream<rstring line> Data = FileSource() { //1 //2 //3
//4 //5 } //6 stream<LineStat> OneLine = Functor(Data) { //7 output OneLine : lines = 1, words = countWords(line); //8 } //9 () as Counter = Custom(OneLine) { //10 logic state : mutable LineStat sum = { lines = 0, words = 0 }; //11 onTuple OneLine : addM(sum, OneLine); //12 onPunct OneLine : if (currentPunct() == Sys.FinalMarker) //13 println(sum); //14 } //15 //16
: getSubmissionTimeValue("file"); : line;
By this point in the tutorial, you should be able to read and understand much of this code. Note how type LineStat is used both in Line 7 as a schema for stream OneLine, and in Line 11 as a type for variable sum. Line 12 adds the statistics from the newest tuple in stream OneLine into the accumulator variable sum by using the helper function addM defined before. Lines 13-14 illustrate punctuation-handling, which is a new feature that we have not seen before. A punctuation is a control signal that appears interleaved with the tuples on a stream. The logic onPunct OneLine clause gets triggered each time a punctuation arrives on stream OneLine. If the punctuation is Sys.FinalMarker, that indicates that the end of the stream has been reached. In our example, the FileSource operator sends a FinalMarker at the end of the file, and the Functor operator forwards it after sending statistics for the last line. Compile and run the program as a standalone application, as you learned in the previous sections. You will need to provide an input file in the data directory, and provide the file name as a submission-time value on the command-line of the standalone application. The program should print the total statistics to the console. When you learn a new programming language and start writing programs in it, you are bound to encounter error messages. These can be baffling, because you thought your program was fine, yet the compiler objected to something in it. Therefore, a good exercise when learning a language is to make some intentional errors, and familiarize yourself with the error messages. That way, when you see the same errors again "by accident", you will already be somewhat familiar with them. So let's inject an error into the example program. Go to file WordCount/Helpers.spl, and remove the mutable modifier from the signature of function addM. In other words, Line 5 should read void addM(LineStat x, LineStat y). Recompile by doing sc -T -M WordCount. You should get something like the following:
Helpers.spl:6:11: CDISP0378E ERROR: The operand modified by += must be mutable. Helpers.spl:7:11: CDISP0378E ERROR: The operand modified by += must be mutable.
The compiler complains because the += operator tries to modify the parameter x, but it has not been declared as mutable. In this section, you saw how to define your own types and functions, which enables reuse and improves readability. For more information about defining your own types and functions, see the IBM Streams Processing Language Specification. Types and functions form a sub-language that you can easily learn without any other materials as prerequisites. To the contrary, they serve as the foundation for more advanced language features like the ones we will cover in the remaining sections of this tutorial.
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Uniq
In
Custom
Out
To make things more interesting, the Uniq operator has a parameter type $key, which is the type containing the subset of attributes of the input tuple that are used to determine uniqueness. If two consecutive tuples are identical for these attributes, the second one is dropped even if it differs in some other attributes. The following code implements operator Uniq:
namespace my.util; public composite Uniq(output Out; input In) { param type $key; graph stream<In> Out = Custom(In) { logic state : { mutable boolean first = true; mutable $key prev; } onTuple In : { $key curr = ($key)In; if (first || prev != curr) { submit(In, Out); first = false; prev = curr; } } } } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12 //13 //14 //15 //16 //17 //18 //19 //20
Line 1, namespace my.util, specifies a namespace for the operator. That means that the operator's full name is really my.util::Uniq. You should put the above source code in a file Uniq/my.util/Uniq.spl. Line 2, public composite Uniq(output Out; input In), specifies that the operator is public, meaning it can be used from other namespaces; and that it has one output port Out and one input port In. Lines 3 and 4 declare the mandatory formal parameter $key, which is a type. Line 12, $key curr = ($key)In;, declares a local variable curr of type $key, and initializes it with
Copyright IBM Corp. 2011, 2012
the expression ($key)In, which takes the current tuple from input stream In and casts it to type $key, in other words, drops any attributes that are not relevant for the comparison with the previous tuple. We have to consider one special case: for the very first tuple, there is no previous tuple, so we always treat it as unique. Now that we have defined our own operator my.util::Uniq, we need to test it. To do that, we will generate a stream All of tuples that have some duplicates, and send them through the Uniq operator to get the stream Some of unique tuples. We will print both All and Some so we can inspect whether the operator actually worked as expected. The stream graph for the test driver is:
Uniq
Some
PrintSome
Beacon
All
PrintAll
Figure 5. Stream graph of the test driver for the Uniq operator.
Note that as far as the driver is concerned, Uniq is just an ordinary operator, whose invocation can serve as a vertex in a stream graph just like any of the other operators we have used before. Note also that a single stream from a single output port, like All in the example, can be used as the input to multiple operators; in this case, all tuples are duplicated, once for each recipient. The following code implements the test driver:
use my.util::Uniq; composite Main { type KeyType = tuple<int32 j>; graph stream<int32 i, int32 j> All = Beacon() { logic state : mutable int32 n = 0; param iterations : 10u; output All : i = ++n, j = n / 3; } stream<All> Some = Uniq(All) { param key : KeyType; } () as PrintAll = Custom(All) { logic onTuple All : printString("All" + (rstring)All + "\n"); } () as PrintSome = Custom(Some) { logic onTuple Some : printString("Some" + (rstring)Some + "\n"); } } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12 //13 //14 //15 //16 //17 //18 //19 //20
Note how Lines 11-13 invoke our operator Uniq, passing an actual parameter param key : KeyType, which indicates that only attribute j is to be used in the uniqueness test. Put this code into a file Uniq/Main.spl, and run sc -T -M Main to compile it as a stand-alone application. Now run ./output/bin/standalone. You should see the following output:
All {i=1,j=0} Some {i=1,j=0} All {i=2,j=0}
10
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
All {i=3,j=1} Some {i=3,j=1} All {i=4,j=1} All {i=5,j=1} All {i=6,j=2} Some {i=6,j=2} All {i=7,j=2} All {i=8,j=2} All {i=9,j=3} Some {i=9,j=3} All {i=10,j=3}
If you look just at All lines, you see that the i attribute just counts up iterations from 1 to 10, while the j attribute is always i/3 rounded down to the nearest integer. Since we used type tuple<int32 j> as the uniqueness key, only every third tuple is considered unique, and therefore, Some lines show only every third tuple. In this section, you have seen how to define your own composite operators to encapsulate useful reusable functionality. You have also seen how the Beacon operator from the standard toolkit can serve as a useful workload generator for testing. We recommend that you test your own operators with test drivers like the one shown in this example. Besides helping you to iron out bugs during development, drivers like these are also useful to keep around later for regression testing. SPL composite operators are more powerful than the example in this section illustrates. They can encapsulate not just a single operator, but a whole graph; they can have multiple output and input ports; and they can have more parameters, of different kinds besides types. For more information about composite operators, see the IBM Streams Processing Language Specification.
11
12
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Functor
A0
B0
Beacon
Input
RRSplit
Pair
Output
Writer
A1
Functor
B1
Figure 6. Stream graph of the test driver for the RoundRobinSplit operator.
Graphs like these are called split-joins, and are a common cause of non-determinism in streaming applications, because data may be processed at different speeds along the different paths. However, some applications require deterministic behavior, which is also useful for testing purposes. Our new RoundRobinSplit operator, together with the Pair operator from the standard library, provides a simple way to implement a deterministic split-join without giving up much of the performance advantage afforded by the parallelism in the middle portion of the stream graph. Specifically, RoundRobinSplit deterministically alternates between sending data to each of its output ports, and Pair deterministically alternates between receiving data from each of its input ports. Here is the code for this stream graph:
use my.util::RoundRobinSplit; composite Main { graph stream<int32 count> Input = Beacon() { logic state : mutable int32 n = 0; param iterations : 10u; output Input : count = n++; } (stream<int32 count> A0; stream<int32 count> A1) = RoundRobinSplit(Input) { param batch : 2u; } //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11
13
stream<int32 count, int32 path> B0 = Functor(A0) { output B0 : path = 0; } stream<int32 count, int32 path> B1 = Functor(A1) { output B1 : path = 1; } stream<int32 count, int32 path> Output = Pair(B0; B1) {} () as Writer = FileSink(Output) { param file : "/dev/stdout"; flush : 1u; } }
//12 //13 //14 //15 //16 //17 //18 //19 //20 //21 //22 //23
Line 9, (stream<int32 count> A0; stream<int32 count> A1) = RoundRobinSplit(Input), invokes operator RoundRobinSplit to produce two output streams A0 and A1. The operator takes a parameter param batch : 2u that indicates that it alternates after every two tuples. Line 18 invokes operator Pair on two input streams B0 and B1, with the code stream<int32 count, int32 path> Output = Pair(B0; B1). For now, put this code into a file RoundRobinSplit/Main.spl. However, don't try to compile it yet; we need to implement the operator RoundRobinSplit first. Create a directory RoundRobinSplit/my.util/RoundRobinSplit, and change into that directory. Now, run spl-make-operator --kind c++. That will generate several skeleton files for you, including an operator model RoundRobinSplit.xml and two code generation templates (.cgt files), one for a header file RoundRobinSplit_h.cgt and one for a C++ implementation file RoundRobinSplit_cpp.cgt. When you write more sophisticated primitive operators, you will often need to edit the XML operator model, but in this case, the operator is simple enough so you do not need to change the operator model at all. Open the header file code generation template RoundRobinSplit_h.cgt. You will see a class definition with several method declarations. Remove most methods except for the constructor and process(Tuple & tuple, uint32_t port). Add two instance fields Mutex _mutex and uint32_t _count. You should end up with the following code in RoundRobinSplit_h.cgt:
#pragma SPL_NON_GENERIC_OPERATOR_HEADER_PROLOGUE class MY_OPERATOR : public MY_BASE_OPERATOR { public: MY_OPERATOR(); void process(Tuple & tuple, uint32_t port); private: Mutex _mutex; uint32_t _count; }; #pragma SPL_NON_GENERIC_OPERATOR_HEADER_EPILOGUE //1 //2 //3 //4 //5 //6 //7 //8 //9 //10
Next, open the C++ implementation file code generation template RoundRobinSplit_cpp.cgt. Remove most methods except for the constructor and process(Tuple & tuple, uint32_t port). Implement these methods as shown in the following listing of RoundRobinSplit_cpp.cgt:
#pragma SPL_NON_GENERIC_OPERATOR_IMPLEMENTATION_PROLOGUE MY_OPERATOR::MY_OPERATOR() : _count(0) {} void MY_OPERATOR::process(Tuple & tuple, uint32_t port) { uint32_t const nOutputs = getNumberOfOutputPorts(); uint32_t const batchSize = getParameter("batch"); AutoPortMutex apm(_mutex, *this); uint32 outputPort = (_count / batchSize) % nOutputs; _count = (_count + 1) % (batchSize * nOutputs); assert(outputPort < nOutputs); submit(tuple, outputPort); } #pragma SPL_NON_GENERIC_OPERATOR_IMPLEMENTATION_EPILOGUE //1 //2 //3 //4 //5 //6 //7 //8 //9 //10 //11 //12
The constructor just initializes the _count instance variable to zero. The process method queries the runtime APIs for the number of output ports (Line 4) and the batch size parameter (Line 5); acquires the mutex to guard against concurrent
14
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
manipulation of the _count instance variable (Line 6); determines the output port (Line 7), updates _count (Line 8), and submits the input tuple to the appropriate output port (Line 10). The mutex is necessary because without it, if there are two threads T1 and T2 , then T1 's invocation of process might be interrupted in the middle of Line 8, after reading the old value of _count but before writing the new value; then T2 might call process and update _count; and finally, T1 might resume and overwrite T2 's update to _count. Now, we are finally ready to compile the application. Change to the RoundRobinSplit directory and run the SPL compiler with sc -T -M Main. The SPL compiler will invoke the C++ compiler to compile the instance of the RoundRobinSplit operator: the sources are the files A0.cpp and A0.h in directory RoundRobinSplit/output/src/operator, and the object file is RoundRobinSplit/ output/build/operator/A0.o. Run the application by changing to directory RoundRobinSplit and executing ./output/bin/standalone. You will get the following output:
0,0 2,1 1,0 3,1 4,0 6,1 5,0 7,1
Each line shows the count and path attributes separated by a comma. Since the split uses a batch size of two but the join uses a batch size of one, the counts (left column) have a progression of 0,2,1,3,4,6,5,7 whereas the paths (right column) just alternate between 0,1,0,1,0,1,0,1. This output is deterministically repeatable, independent of the processing speed of the two paths. It is instructional to introduce an error in the C++ code to see what happens. If we change the call on Line 10 of RoundRobinSplit_cpp.cgt to submit(outputPort, tuple), the C++ compiler reports an error message with the correct file name and line number:
my.util/RoundRobinSplit/RoundRobinSplit_cpp.cgt:10: error: no matching function for call to 'SPL::_Operator::A0::submit(SPL::uint32&, SPL::Tuple&) note: candidates are: virtual void SPL::Operator::submit(SPL::Tuple&, uint32_t) note: virtual void SPL::Operator::submit(const SPL::Tuple&, uint32_t) note: void SPL::Operator::submit(const SPL::Punctuation&, uint32_t)
This section barely scratched the surface of developing primitive operators in SPL. There is a rich API for generating specialized code for performance, and for compile-time error checking on things like the number and types of ports. For more information about developing primitive operators, see the IBM Streams Processing Language Toolkit Development Reference to learn more. You may also want to take a look at the IBM Streams Processing Language Operator Model Reference to learn about the XML file for the primitive operator.
15
16
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Other operators/functions in the IBM Streams Processing standard toolkit Language Standard Toolkit Reference. Documentation for SPL standard toolkit types and functions is located in the $STREAMS_INSTALL/doc/spl/ standard-toolkit/builtinfunctions-and-types directory. Windows Configs IBM Streams Processing Language Specification. IBM Streams Processing Language Config Reference and IBM Streams Processing Language Specification. IBM Streams Processing Language Toolkit Development Reference and IBM Streams Processing Language Operator Model Reference. IBM Streams Processing Language Toolkit Development Reference and IBM Streams Processing Language Operator Model Reference. IBM Streams Processing Language Toolkit Development Reference. IBM Streams Processing Language Specification. IBM Streams Processing Language Streams Debugger Reference.
Sort a file five lines at a time. Change the logLevelconfig and look at the log files to see what happens. Write a C++ primitive operator that extracts groups matched by subexpressions of a regexp.
Write a Java primitive operator that extracts groups matched by subexpressions of a regexp.
Turn a map into a list of (key,value) tuples. Run the SchemaSharing sample that ships with SPL. Run the NumberedCat program and interactively drop a tuple.
17
18
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Notices
This information was developed for products and services offered in the U.S.A. Information about non-IBM products is based on information available at the time of first publication of this document and is subject to change. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country/region where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions; therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.
19
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information that has been exchanged, should contact: IBM Canada Limited Office of the Lab Director 8200 Warden Avenue Markham, Ontario L6G 1C7 CANADA Such information may be available, subject to appropriate terms and conditions, including, in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement, or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems, and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements, or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information may contain examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious, and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs, in source language, which illustrate programming techniques on various operating platforms. You may copy,
20
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided AS IS, without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved.
Trademarks
IBM, the IBM logo, ibm.com and InfoSphere are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/ copytrade.shtml. The following terms are trademarks or registered trademarks of other companies v Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. v Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. v UNIX is a registered trademark of The Open Group in the United States and other countries. v Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Other product and service names might be trademarks of IBM or other companies.
Notices
21
22
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Index C
commands streamtool canceljob 3 streamtool lsjobs 3 streamtool man 3 streamtool mkinstance 3 streamtool rminstance 3 streamtool startinstance 3 streamtool stopinstance 3 streamtool submitjob 3 compiling HelloWorld program composite operator overview 9 operators FileSource 7 Functor 7 output files getting started
W
WordCount program 1 7
P
1 port 3 primitive operators RoundRobinSplit 13
R
RoundRobinSplit primitive operators 13 running HelloWorld program 1
D
data stream processing 3
E
example HelloWorld example 1
S
SPL exercises 17 stand-alone application compilation command 9 streams overview 3 streamtool canceljob command 3 streamtool commands See commands streamtool lsjobs command 3 streamtool man 3 streamtool mkinstance command 3 streamtool rminstance command 3 streamtool startinstance command 3 streamtool stopinstance command 3 streamtool submitjob command 3
F
FileSink operator 3 FileSource operator 3 functions 7 Functor operator 3
G
getting started compiling program 1 HelloWorld example 1 output files 1 sc compiler 1 writing program 1
T
toolkit operators FileSink 3 FileSource 3 Functor 3 tuples 3 tutorial composite operators 1 HelloWorld example 1 sc compiler 1 streams 1 tuples 1
H
HelloWorld example compiling program 1 writing program 1
M
mutable modifier 7
N
namespace 9
U
Uniq operator 9 user defined composite operators user defined types mutable modifier 7 9
O
operator Uniq 9 Copyright IBM Corp. 2011, 2012
23
24
IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial
Printed in USA