Skip to content

pipeline streaming#88

Closed
gty929 wants to merge 8 commits intobitfield:masterfrom
gty929:async
Closed

pipeline streaming#88
gty929 wants to merge 8 commits intobitfield:masterfrom
gty929:async

Conversation

@gty929
Copy link

@gty929 gty929 commented Nov 26, 2021

This PR is an attempt to tackle issue #34. I used io.Pipe() in the EachLine() to implement streaming, which is simple and efficient.

As an example, the following program outputs 'toc' five times, with one second time interval:

script.Slice(make([]string, 5)).ExecForEach("bash -c 'echo tic; sleep 1'").Replace("tic", "toc").Stdout()

Unluckily, the program is not backward compatible. A user must append a Sink method at the end of the pipeline if s/he wants to wait until the program finishes. For this reason, I added a Wait() method, and slightly modified the test for ExecForEach(). (The code passed all other tests.) Another problem is that the strings.Builder in EachLine() cannot be reset, since the stream is append only.
Currently, Error() and ExitStatus() return the intermediate status of a running pipe, since they are not Sink methods.

Probably, one solution is to add a bool flag in the Pipe struct to enable/disable streaming. (Disabled by default & use a Stream() source method to enable it). Should I go ahead?

@bitfield
Copy link
Owner

Hey @gty929 ! This looks interesting!

I'm still not sure what problem this is intended to solve, though. Perhaps you could give an example? (I presume that you don't actually need to write a script that prints toc at one-second intervals.)

@gty929
Copy link
Author

gty929 commented Nov 26, 2021

Hi, @bitfield ! I used the example above just to show that the streaming is working. I think there have been quite a few discussions in issues #34, #59, and #78 about the use cases, and why we should not leave out the streaming functionality.

E.g., as @besi1z and @xxxserxxx mentioned, it would be great if the output of Exec() can be displayed in real-time. As for me, I hate it when a long-running command does not produce anything on the screen (since I have no idea when the program will finish or whether anything gets stuck).

The second reason is for time and memory efficiency, as @posener and @kepkin mentioned. By making the pipeline asynchronous, the output from one stage can be consumed by the next stage in time. For example, to find all the TODO's in 10,000 files, it's really not necessary to read all the files into memory.

By the way, I just rewrote Exec() so that it supports streaming as well. One can check it with besi1z's example:

script.Exec("ping -c 100 www.google.com").Stdout()

If the efforts are worthy, I will write more integration tests for the streaming feature.

@bitfield
Copy link
Owner

For example, to find all the TODO's in 10,000 files, it's really not necessary to read all the files into memory.

Indeed, and we don't! If you look at the example grep implementation, we have:

	script.Stdin().Match(os.Args[1]).Stdout()

The Match filter reads one line at a time and checks it for a match, so no matter how much data is supplied, we never hold more than a line at a time in memory.

Regarding your other point, if you want to run a command like ping -c 100 www.google.com, that sounds like something that's best done by running the ping command directly. There's no value in simply executing that command from a Go program and sending its results to the standard output, so I don't think the ability to do that justifies the extra complexity that this PR would introduce.

@gty929
Copy link
Author

gty929 commented Dec 1, 2021

I noticed that there is already a "find TODO" example in the library. The program prints the filename and the line number before each line, so the implementation a bit more complex than the grep example.

func main() {
	listPath := "."
	if len(os.Args) > 1 {
		listPath = os.Args[1]
	}
	// filter hidden directories and files
	filterFiles := regexp.MustCompile(`^\..*|/\.`)
	files := script.FindFiles(listPath).RejectRegexp(filterFiles)
	content := files.EachLine(func(filePath string, builderFile *strings.Builder) {
		p := script.File(filePath)
		lineNumber := 1
		p.EachLine(func(str string, build *strings.Builder) {
			if strings.Contains(str, "todo") {
				builderFile.WriteString(fmt.Sprintf("%s:%d %s \n", filePath, lineNumber, strings.TrimSpace(str)))
			}
			lineNumber++
		})
	})
	content.Stdout()
}

Even though the program does not read in the file contents all at once, it still accumulates all the outputs before sending them to Stdout, which I believe is not ideal. For example, a user may be running a search program like this on a large project or file system, and s/he may lose patience if no result gets printed for a long time. What's worse, if the user makes a mistake, e.g., writing a wrong regex expression in the if statement, s/he won't notice it until all the files are scanned. (In the worst case, if the expression matches every line, then there might be an out-of-memory problem.)

I just wrote another example for my proposed stream functionality. For multi-threaded programs, developers usually need to run a test suite many times to detect concurrency bugs. With the -count flag, go test can be run multiple times automatically. A problem, however, is that the command will not output anything unless it encounters a failure, so it's hard for developers to track the progress. (Adding a -v flag, on the other hand, will cause the program to output too much information.) The following example program solves this problem by outputting progress information after each round of testing.

// This program runs tests on the script library 50 times and prints out the progress nicely.
func main() {
	round, step := 10, 5
	progressInfo := make([]string, round)
	for i := 0; i < round; i++ {
		progressInfo[i] = fmt.Sprintf("------ Done %v / %v ------", (i+1)*step, round*step)
	}
	cmd := fmt.Sprintf("bash -c 'go test -count %v github.com/bitfield/script; echo {{.}}'", step)
	// with Stream(), the program can print to stdout in real time
	script.Slice(progressInfo).Stream().ExecForEach(cmd).Stdout()
}

Here's a sample output, which is printed out line by line:

ok      github.com/bitfield/script      1.538s
------ Done 5 / 50 ------
ok      github.com/bitfield/script      1.479s
------ Done 10 / 50 ------
ok      github.com/bitfield/script      1.440s
------ Done 15 / 50 ------
...

Obviously, we cannot write a program like this in Script without the streaming feature.

I agree that we should avoid unnecessary complexity, but I do think that the task is worth the effort. As I mentioned above, there have been at least four other people who have demanded this feature or have tried to implement it on their own.

To make the program backward compatible, we just need to add two more public functions: a Stream() function that makes the subsequent pipe functions executed in the streaming mode, as demonstrated above, and a Synchronize() function, which stops the streaming. For clients, the learning cost for the streaming feature is pretty low (while they can also choose not to use it). In fact, if the client is familiar with Pipe in UNIX systems, s/he may find the streaming version more natural.

From the developers' perspective, the cost of this change is almost one-off. One need not worry about streaming when writing source and sink functions. For filter functions, no change is necessary as well if it returns p.EachLine(...) directly. For the remaining few filters, contributors just need to remember to set the "streaming" flag if a new pipe is initiated.

@bitfield
Copy link
Owner

bitfield commented Dec 1, 2021

Yes, this sounds good! Why would we need Stream/Synchronize functions, though? Why not make all pipes streaming by default?

@gty929
Copy link
Author

gty929 commented Dec 1, 2021

If we use streaming by default, there's no way to make everything backward compatible.

First, by the definition of streaming, a function cannot perform a full error check on its previous stage before it starts executing. The best assurance one can obtain is that each stage will set the error field before it closes the writer of io.Pipe(), so that the error can get caught by the next stage, thus propagating to the end of the pipeline. (See the commit 886d0bb).
If the program is read-only (e.g., counting lines), it's not a big deal, but if the operation contains some sensitive write operation, a user may want to ensure that each stage executes correctly, before moving to the next stage. In that case, a Synchronous() method will be helpful.

Second, there's no way for a filter function to know whether it is the last stage in a pipeline. Ideally, a user should end a pipeline with a sink function (which always synchronizes the pipeline), but it's not always the case. For example, a user may call the EachLine() method just to update a variable in the program, and hence ignore the method's return value. If we set streaming as default, EachLine() will return immediately, and it's likely that the Golang variable hasn't been updated yet. On the other hand, such asynchronous behavior could be desirable in some cases, e.g, a user may want to split a long pipeline into several pieces, while keeping the streaming property. Therefore, I guess it's better to just leave the choice to the users.

Finally, as @posener mentioned in issue #34, the behavior of Exec and Eachline changes slightly if streaming is turned on. In Exec, we cannot use CombinedOutput for streaming as before. In my current implementation, stderr is appended after stdout in io.MultiReader. Therefore, the sequence of the output lines may change. (I think in Unix, only stdout is pipelined. It's indeed a bit strange to combine stdout and stderr together in a pipeline.)
For Eachline, the effect of resetting the string builder will also be different. In the non-streaming implementation, the whole output of the function will be cleared if a reset is called on the string builder. But that is not acceptable in the streaming scenario, since the previous outputs have already been sent to the next stage. As a result, only the output corresponding to the specific line will get cleared.

I guess adding these two functions is not a heavy burden to the library. After all, it's better than creating a new project called 'script-async' or whatever. If a user really wants streaming to be set as default, s/he can fork the library and change the initial value of the asynchronous flag in the NewPipe() function.

@posener
Copy link

posener commented Dec 1, 2021

@gty929 as for another project you suggests. I already created https://github.com/posener/script

@bitfield
Copy link
Owner

bitfield commented Dec 4, 2021

Yes, I see what you mean. Streaming is fundamentally a different model from the 'error-safe reader' which script implements, as described in chapter 10 of The Power of Go: Tools. It doesn't fit with everything else, and you'd really need a new type: StreamPipe, or something like that, to avoid confusion with the existing semantics.

I think you've convinced me that the cost of adding this functionality would just be too high; closing accordingly, with thanks for the work you've put in to prove the concept.

@posener, it's quite up to you of course, but it might be worth thinking about giving your library a different name, since it is effectively a different library now, rather than a modified fork of this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments