Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/4] - protofsm: add new package for driving generic protocol FSMs #8337

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Jan 3, 2024

In this PR, we create a new package, protofsm which is intended to
abstract away from something we've done dozens of time in the daemon:
create a new event-drive protocol FSM. One example of this is the co-op
close state machine, and also the channel state machine itself.

This packages picks out the common themes of:

  • clear states and transitions between them
  • calling out to special daemon adapters for I/O such as transaction
    broadcast or sending a message to a peer
  • cleaning up after state machine execution
  • notifying relevant callers of updates to the state machine

The goal of this PR, is that devs can now implement a state machine
based off of this primary interface:

// State defines an abstract state along, namely its state transition function
// that takes as input an event and an environment, and returns a state
// transition (next state, and set of events to emit). As state can also either
// be terminal, or not, a terminal event causes state execution to halt.
type State[Event any, Env Environment] interface {
	// ProcessEvent takes an event and an environment, and returns a new
	// state transition. This will be iteratively called until either a
	// terminal state is reached, or no further internal events are
	// emitted.
	ProcessEvent(event Event, env Env) (*StateTransition[Event, Env], error)

	// IsTerminal returns true if this state is terminal, and false otherwise.
	IsTerminal() bool
}

With their focus being only on each state transition, rather than all
the boiler plate involved (processing new events, advancing to
completion, doing I/O, etc, etc).

Instead, they just make their states, then create the state machine
given the starting state and env. The only other custom component needed
is something capable of mapping wire messages or other events from the
"outside world" into the domain of the state machine.

The set of types is based on a pseudo sum type system wherein you
declare an interface, make the sole method private, then create other
instances based on that interface. This restricts call sites (must pass
in that interface) type, and with some tooling, exhaustive matching can
also be enforced via a linter.

The best way to get a hang of the pattern proposed here is to check out
the tests. They make a mock state machine, and then use the new executor
to drive it to completion. You'll also get a view of how the code will
actually look, with the focus being on the: input event, current state,
and output transition (can also emit events to drive itself forward).

Copy link

github-actions bot commented Jan 3, 2024

Pull reviewers stats

Stats of the last 30 days for lnd:

User Total reviews Time to review Total comments
yyforyongyu
🥇
6
▀▀▀
3d 19h 57m
▀▀▀▀▀
3
Roasbeef
🥈
5
▀▀▀
1d 18h 46m
▀▀
18
▀▀▀▀▀▀▀
guggero
🥉
4
▀▀
1d 7h 51m
▀▀
3
bhandras
2
3h 27m
0
calvinrzachman
1
6m
1
ziggie1984
1
20h 14m
0

Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the uniformed StateMachine🤩 I think the blockbeat from #7951 can even fit in this picture - that there exists a set of universal events such as a new block event, and we force every state machine to process it. My main question is whether we could stop generalizing at ProcessEvent, and leave the implementations of executeDaemonEvent to specific subsystems? This naturally leads to the question of whether we need this DaemonAdapters interface, as it seems it's not a common functionality that's shared by all subsystems.

Still need to think through, but a few ideas,

  • we could make StateMachine an interface, and maybe add something like BaseMachine that has the minimal methods such as driveMachine.
  • I like that State is an interface which makes writing the tests much easier. It's just that the name is a bit confusing I guess, as it's sort like a processor, and each state has its own processor.

My understanding of the design is, an event-driven machine that's pipelined with state processors, the machine doesn't care about the specifics of the event, instead, it's the state processor's responsibility to handle the event and instruct a new state. I think we could stop here without distinguishing interval vs external events, apply it to a few subsystems to see its effect.

protofsm/daemon_events.go Outdated Show resolved Hide resolved
protofsm/state_machine.go Outdated Show resolved Hide resolved
protofsm/state_machine.go Show resolved Hide resolved
protofsm/state_machine.go Show resolved Hide resolved
// executeDaemonEvent executes a daemon event, which is a special type of event
// that can be emitted as part of the state transition function of the state
// machine. An error is returned if the type of event is unknown.
func (s *StateMachine[Event, Env]) executeDaemonEvent(event DaemonEvent) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like it's leaking the implementation details from other subsystems

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll have a better idea of the interaction once the new co-op close stuff is up, but the general idea is that:

  • All the state machine state transitions are pure functions
  • They emit events for the executor (prob should rename this struct slightly) to apply themselves
  • Something needs to be aware of the boundary between the pure state machine, and the daemon execution env it runs in
  • This thing handles that role of knowing all the global I/O or daemon actions to execute itself, and potentially emit an event back into the state machine (post execution hook)

Otherwise, what do you think should be handling the I/O between the daemon and the state machine?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could hidden behind currentState.ProcessEvent? Since it generates the transition, it might as well process it based on the new transition, like broadcast or send message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting more things behind ProcessEvent would negatively impact testability. With this construction we can test the state transitions themselves in a pure environment and then wire up the execution of the generated events separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could hidden behind currentState.ProcessEvent? Since it generates the transition, it might as well process it based on the new transition, like broadcast or send message.

So the idea is that the actual state transitions never need to concern themselves with any of these details. They just emit the event, then wait for w/e new event to be sent in. There's no leakage of implementation details at this StateMachine level, as we'll pass in a concrete implementation based on lnd later, here's an idea of what that looks like: ce75ef8.

This is to be considered universal, just like the POSIX interface we all know and love today. In this case, our processes re these FSMs, and the syscalls ways to interact with the chain or daemon.

protofsm/state_machine.go Outdated Show resolved Hide resolved
protofsm/state_machine_test.go Show resolved Hide resolved
protofsm/state_machine_test.go Show resolved Hide resolved
@Roasbeef
Copy link
Member Author

Roasbeef commented Jan 4, 2024

we could make StateMachine an interface,

Why do you think this should be an interface? The goal here is to provide a generic implementation that can drive any FSM, which is defined from that starting/initial state, and all the state transition functions. If you look at the test, it takes that mock state machine, and is able to drive that with the shared semantics of: terminal states, clean up functions, pure state transitions that emit any side effects as events, etc.

and leave the implementations of executeDaemonEvent to specific subsystems

The goal of those was to implement all the side effects we'd ever need in a single place. The daemon events added were just the ones I needed to implement the new co-op close state machine nearly from scratch. I think if we look at all the state machines we've written in the codebase, maybe there's ~10 daemon level adapters that are used continuously. One that's missing right now is requesting to be notified of something confirming.

protofsm/state_machine.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a no-nit high level review here. My biggest squint was around the SendWhen impure pseudo-predicate. Not gonna lie, I don't like it. However, I suspect that the reason you went this route is that making it pure would require hooks into the state changes of surrounding subsystems in ways that would require significant changes to the overall LND codebase before this could be inserted.

That said, there still may be no way around it. The main concern here is that the polling approach may miss the opportunities it needs to send the message out. The example here is OnCommit/OnFlush where we poll and still owe a commitment so we can't send, but then we do the commit and immediately follow up with another state change, thereby re-falsifying the SendWhen predicate before the next poll cycle.

In the case of shutdown and the coop close negotiations, this technically violates the spec. Idk what the practical consequences of that would be (they may be benign), but unless we can synchronize directly into the channel update lifecycle, we can't really be spec compliant.

On the other hand, you could make the argument that it isn't the state machine's responsibility to understand when a message should be synchronized into the message stream at all. It's job is simply to generate the response and the caller would queue it for sending at the next possible opportunity. This is the approach I took with the coop close v1: The ChanCloser is completely unaware of how the messages are dispatched, it just knows what to send, not when or how.


// SendPredicate is a function that returns true if the target message should
// sent.
type SendPredicate = func() bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find myself very suspicious of this type of predicate construction. I would like for the function to take an argument to formally make it a predicate, and one that is ideally pure.

I haven't finished tackling the rest of the PR yet but I'm looking for opportunities to make this a reality in a way that cleans up the model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I guess could call like a BoolCallback or something? Or just a ContinuationFunc?

I think this is useful for areas where we don't have the new hook concept. Eg: something waiting for a channel to be added to the graph before it acts.

In the RBF closer, I use this to hook into the case where no dangling updates exist (can send `shutdown): https://github.com/lightningnetwork/lnd/blob/43386d5643f961f948bf95513933c7d5a72fc74e/peer/chan_observer.go#L46-L51

Alternatively, we can use the hooks to send an event into the state machine once the state has been achieved. Then we have a new transition to just handle that event (send the shutdown). I slightly prefer it as is though, as the code reads as more imperative. Otherwise, the state machine would need more knowledge of hooks, and the ability of the hook to do things limit emit a daemon event, wherein the only way to emit that today is as a return value (the state transitions don't have handle on the thing executing it, they're effectively sandboxed).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrafficLight? Trigger?

We can leave as is. I just don't know about that poll-cycle issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to handle this is to make the "true yielding" calls into events. This removes the need for time-based poll loops.

Comment on lines +61 to +73
type State[Event any, Env Environment] interface {
// ProcessEvent takes an event and an environment, and returns a new
// state transition. This will be iteratively called until either a
// terminal state is reached, or no further internal events are
// emitted.
ProcessEvent(event Event, env Env) (*StateTransition[Event, Env], error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things I'm noticing about this type of construction is that it forces the Event type that the StateMachine consumes to be the same type as the Event type that it produces. I don't think this is necessary. Does go require all of the Type Variables in the ProcessEvent function to be scoped to the interface, or can you introduce new tyvars on the method itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things I'm noticing about this type of construction is that it forces the Event type that the StateMachine consumes to be the same type as the Event type that it produces. I don't think this is necessary

Good observation. I don't think it's necessary, but it felt natural in that if the state machine is defined by the type of events it accepts and the env, then most of the time, you want to also return something of that very same type.

I think in the future if we want the ability for one state machine to turn into another then, we can add in a new type and a code path to handle the switch over. I needed to do something similar to this, but I was able to just make a new composite state:

Does go require all of the Type Variables in the ProcessEvent function to be scoped to the interface, or can you introduce new tyvars on the method itself?

Current limitation is that you need to scope it all on the interface. You can't have new type params on methods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary, but it felt natural in that if the state machine is defined by the type of events it accepts and the env, then most of the time, you want to also return something of that very same type.

It's natural but will limit composability of machines which can be useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current limitation is that you need to scope it all on the interface. You can't have new type params on methods.

Boo. k.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the longevity of this library that we need to decouple events that we consume from the events we produce.

Comment on lines 66 to 77
ProcessEvent(event Event, env Env) (*StateTransition[Event, Env], error)

// IsTerminal returns true if this state is terminal, and false otherwise.
IsTerminal() bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An opportunity for a "law" here is that IsTerminal = true ==> ProcessEvent = nop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. Here's an instance of that in the RBF coop PR: https://github.com/lightningnetwork/lnd/blob/43386d5643f961f948bf95513933c7d5a72fc74e/lnwallet/chancloser/rbf_coop_transitions.go#L917-L936

Available tooling wise, I think the best way for us to enforce this would be at the unit test level. I can think of some more involved mechanisms that involve stuff like registering all types in a global map, to then run a generic unit test again, but not so sure we should reach for that yet at this stage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Unit tests would be the way to ensure this.

// executeDaemonEvent executes a daemon event, which is a special type of event
// that can be emitted as part of the state transition function of the state
// machine. An error is returned if the type of event is unknown.
func (s *StateMachine[Event, Env]) executeDaemonEvent(event DaemonEvent) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting more things behind ProcessEvent would negatively impact testability. With this construction we can test the state transitions themselves in a pure environment and then wire up the execution of the generated events separately.

Comment on lines 276 to 448
// If this is a disable channel event, then we'll disable the channel.
// This is usually done for things like co-op closes.
case *DisableChannelEvent:
err := s.daemon.DisableChannel(daemonEvent.ChanPoint)
if err != nil {
return fmt.Errorf("unable to disable channel: %w", err)
}

return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these state machines guaranteed to be attached to a particular channel id? This Event feels less general than the other two, but I don't have a concrete argument for why that's the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these state machines guaranteed to be attached to a particular channel id? This Event feels less general than the other two, but I don't have a concrete argument for why that's the case.

Yeah good point. In terms of usage, I guess this could become a new interface that we accept as part of the Environment. This one felt more syscall-y to me though, as the other interfaces don't affect/mutate the outside world, they just want to examine an attribute (no dangling updates), or do something pure like sign a signature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe we just don't process these events if they are emitted by a state machine that doesn't have an association with an active channel. I agree it could be syscall-y but I am somewhat convinced that the security of said OS should include not being able to disable other channels...

Comment on lines +243 to +389
// Otherwise, this has a SendWhen predicate, so we'll need
// launch a goroutine to poll the SendWhen, then send only once
// the predicate is true.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid the polling approach? i.e. can we subscribe to the state changes we are actually interested in monitoring and make the predicate operate on a before/after pairing? Would that be too onerous?

I think the poll-an-opaque-boolean-returning-fn approach is potentially troublesome since it relies on impurity. If it's not prohibitively difficult I'd suggest we find a way to make SendWhen's predicate pure, and actually feed it the state changes it's monitoring rather than time based polling of state, but I do recognize that may not be an easy lift.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree the polling (as you mentioned) has the downside on lagging the first instance where the new state/predicate is present, and in the case of something that can revert (flip on, then off) it may miss the opportunity all together.

If it's not prohibitively difficult I'd suggest we find a way to make SendWhen's predicate pure, and actually feed it the state changes it's monitoring rather than time based polling of state, but I do recognize that may not be an easy lift.

Hmm, yeah I'm not sure what we would pass in here. With the code as is in the RBF closer PR, it could pass in the ChanStateObserver (or w/e it's called now), but then that would require the FSM executor to be able to extract attributes from the env, whereas rn it's based on an interface.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's very unclear what you would want to send I agree. Ideally it's the state we are polling that says "coast is clear" and send in a new copy of it on each change. Generally I think of the relevant state here as channel state (the state of balances and commitment transactions) and link state (the state associated with where in the protocol trace we are with our peer). LightningChannel captures the channel state well but we don't really have a good tracker of link state since our link is busted up across multiple different data structures with the chan closers and funding manager. Making the desired change here may require us finishing the ChannelLifecycle refactor. Maybe we can try out a polling answer and just see what happens? It can mean untimely shutdown responses though.


// ExternalEvent is an optional external event that is to be sent to
// the daemon for dispatch. Usually, this is some form of I/O.
ExternalEvents fn.Option[DaemonEventSet]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a material difference in semantics between None and Some({}) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting Some({}) would be considered a bug, or quirky logic. This should only be set to Some when the FSM has some I/O that it wants to perform. Similar to the whole nil vs []byte{} thing with slices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just make it a set then, since sets can be empty?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant.

protofsm/state_machine.go Outdated Show resolved Hide resolved
Comment on lines 330 to 620
// With the event processed, we'll process any
// new daemon events that were emitted as part
// of this new state transition.
err := fn.MapOptionZero(events.ExternalEvents, func(dEvents DaemonEventSet) error {
for _, dEvent := range dEvents {
err := s.executeDaemonEvent(dEvent)
if err != nil {
return err
}
}

return nil
})
if err != nil {
return err
}

// Next, we'll add any new emitted events to
// our event queue.
events.InternalEvent.WhenSome(func(inEvent Event) {
eventQueue.Enqueue(inEvent)
})

return nil
})
if err != nil {
return err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly it means that external events will always synchronize before internal ones. I believe this is OK (even preferred), but wanted to double check that this is what we want. Is there ever a situation in which we'd want to emit both but have them synchronize the other direction?

I think Internal events are a way to buy us "cut-through" where we don't rely on synthetic events from the surrounding environment to drive ourselves forward and we opportunistically drive ourselves as far forward as possible in any given moment. Under this interpretation I think this is indeed what we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a situation in which we'd want to emit both but have them synchronize the other direction?

Good q....my mental model here is that the executor context switches to execute all the syscalls, then resumes execution of the state machine with any emitted internal events. Re the opposite ordering, I guess this is sort of a system call interface the executor has with the FSM: what's the execution order of emitted events? One could likely devise state machines where if you flip the ordering you may end up with incorrect (?) behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Internal events are a way to buy us "cut-through" where we don't rely on synthetic events from the surrounding environment to drive ourselves forward and we opportunistically drive ourselves as far forward as possible in any given moment.

Yep, I think for the most part, you can elide emitting an internal event just by doing even more within a given state transition. I like them though from the PoV of minimal state transitions, and also domain modeling as well.

protofsm/state_machine_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

coderabbitai bot commented Jan 24, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Roasbeef
Copy link
Member Author

Roasbeef commented Feb 6, 2024

PTAL.

@Roasbeef Roasbeef force-pushed the protofsm branch 2 times, most recently from e0265c1 to 057c481 Compare February 7, 2024 00:16
Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main thing is I think we want to make state machines not able to "throw a disable" to another channel.


// SendPredicate is a function that returns true if the target message should
// sent.
type SendPredicate = func() bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrafficLight? Trigger?

We can leave as is. I just don't know about that poll-cycle issue.

Comment on lines 66 to 77
ProcessEvent(event Event, env Env) (*StateTransition[Event, Env], error)

// IsTerminal returns true if this state is terminal, and false otherwise.
IsTerminal() bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Unit tests would be the way to ensure this.

Comment on lines 276 to 448
// If this is a disable channel event, then we'll disable the channel.
// This is usually done for things like co-op closes.
case *DisableChannelEvent:
err := s.daemon.DisableChannel(daemonEvent.ChanPoint)
if err != nil {
return fmt.Errorf("unable to disable channel: %w", err)
}

return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe we just don't process these events if they are emitted by a state machine that doesn't have an association with an active channel. I agree it could be syscall-y but I am somewhat convinced that the security of said OS should include not being able to disable other channels...

@Roasbeef
Copy link
Member Author

Pushed up a new set of commits with some bug fixes and some additional functionality that came in handy when starting to hook up the new RBF coop close state machine to the peer struct.

@Roasbeef
Copy link
Member Author

Roasbeef commented Mar 6, 2024

Updated the branch to remove DisableChannel as a syscall.

Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bunch of comments peppered throughout. However, the biggest squint I have is the fact that the StateMachines we define using this approach are always structured as their own CSP and so any time we try to compose them together we will have concurrency considerations. This makes it harder for us to fix the existing races we have between the chancloser machinery and the link. For that reason, I think we need to engineer into this approach a way to mark an Event as fully processed to the world outside of the StateMachine.

Concretely, SendEvent needs to return something approximating a sync primitive that will wake/unlock when the event is fully ack'ed.

Comment on lines +101 to +112
// RegisterSpendNtfn registers an intent to be notified once the target
// outpoint is successfully spent within a transaction. The script that
// the outpoint creates must also be specified. This allows this
// interface to be implemented by BIP 158-like filtering.
RegisterSpendNtfn(outpoint *wire.OutPoint, pkScript []byte,
heightHint uint32) (*chainntnfs.SpendEvent, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "that the outpoint creates" are you referring to the script of the outpoint being spent? or referring to the script of ... one of (?) ... the outpoints that are created by the tx that spends the specified outpoint?

s.wg.Add(1)
go func() {
defer s.wg.Done()
for {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, this for loop is unnecessary

s.wg.Add(1)
go func() {
defer s.wg.Done()
for {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. I'm on a quest to make for loops an endangered species.

Comment on lines 478 to 665
// Before we start, if we have an init daemon event specified, then
// we'll handle that now.
err := fn.MapOptionZ(s.initEvent, func(event DaemonEvent) error {
return s.executeDaemonEvent(event)
})
if err != nil {
log.Errorf("unable to execute init event: %w", err)
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of taste, I think these single run actions should be run in the main body of the start method rather than the prelude of the goroutine method. Thoughts?

initialState State[Event, Env],
env Env) StateMachine[Event, Env] {
initialState State[Event, Env], env Env,
initEvent fn.Option[DaemonEvent]) StateMachine[Event, Env] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a dumb question, but why make the caller responsible for setting this. IIUC, the caller is saying "Hey can you tell me to do the thing immediately, then I'll do the thing". Why not just do the thing? Or alternatively, why have the caller tell the StateMachine "tell me to do the thing". The state machine could just know that it needs to do the thing.

Comment on lines 226 to 290
// SendMessage attempts to send a wire message to the state machine. If the
// message can be mapped using the default message mapper, then true is
// returned indicating that the message was processed. Otherwise, false is
// returned.
func (s *StateMachine[Event, Env]) SendMessage(msg lnwire.Message) bool {
// If we have no message mapper, then return false as we can't process
// this message.
if !s.cfg.MsgMapper.IsSome() {
return false
}

// Otherwise, try to map the message using the default message mapper.
// If we can't extract an event, then we'll return false to indicate
// that the message wasn't processed.
var processed bool
s.cfg.MsgMapper.WhenSome(func(mapper MsgMapper[Event]) {
event := mapper.MapMsg(msg)

event.WhenSome(func(event Event) {
s.SendEvent(event)

processed = true
})
})

return processed
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also simplified by going for the const(None) approach.

Comment on lines +56 to +59
// Name returns the name of the environment. This is used to uniquely
// identify the environment of related state machines.
Name() string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest naming this "Id" instead of "Name" since it will build intuition for posterity that it carries semantic meaning and must be unique for it to function as expected.

Comment on lines +40 to +45
// newLogClosure returns a new closure over a function that returns a string
// which itself provides a Stringer interface so that it can be used with the
// logging system.
func newLogClosure(c func() string) logClosure {
return logClosure(c)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? can't we just use the type conversion function generated by the typedef a few lines above?

Comment on lines +675 to +685
// An error occurred, so we'll tear down the
// entire state machine as we can't proceed.
go s.Stop()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have plans to bring it back up? This seems like a pretty important thing to know both for implementers of this interface, as well as consumers.

Comment on lines +71 to +73
// SpendMapper is a function that's used to map a spend notification to a
// custom state machine event.
type SpendMapper[Event any] func(*chainntnfs.SpendDetail) Event
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unlike the MsgMapper, it seems that this will always be guaranteed to produce a valid event from a Spend. Can you elaborate on why this difference in choices makes sense. The incongruence bothers me. I also think we can easily fix it via the const approach as I specified earlier, but I am curious about why you made two different choices in the first place.

@Roasbeef Roasbeef changed the title protofsm: add new package for driving generic protocol FSMs [1/4] - protofsm: add new package for driving generic protocol FSMs Mar 8, 2024
@saubyk saubyk modified the milestones: v0.18.0, v0.18.1 Mar 21, 2024
@morehouse
Copy link
Collaborator

On initial look, I'm not excited about this change.

I find the event-driven pattern less readable, with code blocks like this:

switch event.(type) {
	case *IncomingStfu:
		stfu := lnwire.Stfu{
			ChanID:    env.cid,
			Initiator: false,
		}
		send := protofsm.SendMsgEvent[Events]{
			Msgs:       []lnwire.Message{&stfu},
			TargetPeer: env.key,
			SendWhen:   fn.Some(env.canSend),
			PostSendEvent: fn.Some(
				Events(&gotoQuiescent{}), // gross
			),
		}

		return &protofsm.StateTransition[Events, *Env]{
			NextState: &Live{},
			NewEvents: fn.Some(
				protofsm.EmittedEvent[Events]{
					ExternalEvents: fn.Some(
						protofsm.DaemonEventSet{&send},
					),
				},
			),
		}, nil
	case *Initiate:
		stfu := lnwire.Stfu{
			ChanID:    env.cid,
			Initiator: true,
		}
		send := protofsm.SendMsgEvent[Events]{
			Msgs:       []lnwire.Message{&stfu},
			TargetPeer: env.key,
			SendWhen:   fn.Some(env.canSend),
			PostSendEvent: fn.Some(
				Events(&gotoAwaitingStfu{}), // gross
			),
		}

		return &protofsm.StateTransition[Events, *Env]{
			NextState: &Live{},
			NewEvents: fn.Some(
				protofsm.EmittedEvent[Events]{
					ExternalEvents: fn.Some(
						protofsm.DaemonEventSet{&send},
					),
				},
			),
		}, nil
	case *gotoAwaitingStfu:
		return &protofsm.StateTransition[Events, *Env]{
			NextState: &AwaitingStfu{},
			NewEvents: fn.None[protofsm.EmittedEvent[Events]](),
		}, nil
	case *gotoQuiescent:
		return &protofsm.StateTransition[Events, *Env]{
			NextState: &Quiescent{},
			NewEvents: fn.None[protofsm.EmittedEvent[Events]](),
		}, nil
	default:
		panic("impossible: invalid QuiescerEvent")
	}
}

instead of readable equivalent code something like this:

func (q *quiescer) sendStfu() error {
  stfu := lnwire.Stfu{
    ChanID:    env.cid,
    Initiator: q.state == Initiate,
  }
  if err := sendMsg(stfu); err != nil {
    return err
  }

  switch q.state {
    case IncomingStfu: q.state = Quiescent
    case Initiate:     q.state = AwaitingStfu
    default:           return fmt.Errorf("Invalid state change")
  }

  return nil
} 

I also find it more difficult to trace the flow of a program written with protofsm, with states and events and transitions being passed all over the place. I fear that debugging code written in this style may be much more difficult.

I find it quite confusing to think about what is executing at any given time. It seems each protofsm gets its own goroutine and daemon events also get their own goroutines. And the concurrency behavior is hidden from the protofsm user, which seems a disaster just waiting to happen.

Maybe I'm slower than others, but I've been trying to grok protofsm for a day now and I'm still not confident I fully grasp the intricacies. If I had to write or modify code in this style, I would not be confident that my code was bug-free.

@Roasbeef
Copy link
Member Author

Roasbeef commented Apr 4, 2024

I also find it more difficult to trace the flow of a program written with protofsm, with states and events and transitions being passed all over the place. I fear that debugging code written in this style may be much more difficult.

I think the exact opposite is the case. With the framework as is, you have a standardized way of handling new state transitions, and you're forced to only maintain state within the protocol state definition, instead of a large struct with many variables that are only conditionally set if a certain state is present. You can examine a single state transition at a time, which clearly enumerates all its inputs and outputs.

You also don't need to re-write the very same executor loop (take in message, select on quit channel, apply state, loop agaon) that we've implemented several times over in the codebase. You just write your state transitions, and hand it off for handling.

Re debugging, my experience of debugging the rbf-coop state machine was pretty straight forward. The only state you need to wrangle with is the state in the protocol state. There's no concurrency within the state machine either, you're forced to implement everything with serial execution. You write unit tests for a given state transition, and can even employ property based testing to assert invariants re inputs/outputs.

I find it quite confusing to think about what is executing at any given time. It seems each protofsm gets its own goroutine and daemon events also get their own goroutines. And the concurrency behavior is hidden from the protofsm user, which seems a disaster just waiting to happen.

For a given state machine, everything is executed serially (we can also make it fully blocking, but nothing works like that today, since you don't want to block wire message ingestion). You define the transitions, then a generic executor handles mapping a wire message to a protocol state (just one example) to apply directly. The daemon events executed async are the very same ones that you'd normally spwan a goroutine to funnel a response into a channel (waiting for a spend/confirmation, etc). Transaction broadcast and wire message sending are synchronous.

@Roasbeef
Copy link
Member Author

Roasbeef commented Apr 4, 2024

Haven't dived deep into that PR yet, but looking at the example, the top two transitions to Live don't look necessarily, and they can just go directly to AwaitingStfu. I think with that, you have a more accurate comparison:

  • s(Live, IncomingStfu) -> Quiescent.
  • s(Live, Initiate) -> AwaitingStfu.

So just two switch cases. There def is a bit more line noise going on there due to Go's lack basic type inference, but you can make some helper funcs to handle the defs.

Even with that they don't look quite equivalent, as one wants to wait on a certain state to send the message, while the other would unconditionally send it. As mentioned above, to compare directly, you'd also need to implement the executor/event loop for the second version, IIRC that hadn't yet been done.

@ProofOfKeags
Copy link
Collaborator

I think all of the comments here that @Roasbeef makes about protofsm in general are correct. I also think that the quiescence protofsm implementation exaggerates its costs and understates its benefits. I made certain choices in the quiescence implementation in order to bind the state transition itself when the message itself gets sent as opposed to when it gets staged to send. This may not be necessary -- It may not even be good!

The main benefit that is understated here is that it is often the case that a state machine is best expressed as a sum of products. Product types are very easily expressed in go via structs. Sums on the other hand are another story. The sealed interface pattern helps us model it better and makes it such that we can structurally guarantee the presence or absence of the associated state paremeters with the state itself rather than having a swiss cheese block of potentially valid or invalid pointers depending on the state selector row. It also allows us to explicitly enumerate the valid state transitions away from a particular state in a way that is very well organized and isolated. Could this be accomplished in another way? Yes. Is it better to do another way? I'm not so sure.

So while it was a very simple state machine to implement, quiescence is probably not an illustrative example of the leverage that protofsm can provide. The essential tradeoff being made is that protofsm adds a close to fixed overhead in terms of the naturality of expressing the state machine, and its benefits compound as the state machine itself gets bigger.

@saubyk saubyk added the P1 MUST be fixed or reviewed label Jun 25, 2024
@saubyk saubyk modified the milestones: v0.18.3, v0.19.0 Aug 1, 2024
@Roasbeef
Copy link
Member Author

Roasbeef commented Aug 2, 2024

Rebased to get a fresh CI run going.

Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest changes I'd like to see are terminology changes regarding the StateMachine and State. I think from reviewing the code it is clear that the currently named StateMachine is really an Executor and that the State is really a StateMachine.

This is important because I think the (current) State abstraction will be nice to use in the context of embedding one automata within another.

The second major change is to ensure that the executor only has one thread and we can predict the ordering of events. As it stands right now the SendWhen predicate has its own separate ticker that operates independently of the (current) StateMachine's event loop. I think instead we want to have the event loop call the sendWhen predicates predictably in the same cycle that it calls the main driver.

Comment on lines +31 to +45
// logClosure is used to provide a closure over expensive logging operations
// so they aren't performed when the logging level doesn't warrant it.
type logClosure func() string

// String invokes the underlying function and returns the result.
func (c logClosure) String() string {
return c()
}

// newLogClosure returns a new closure over a function that returns a string
// which itself provides a Stringer interface so that it can be used with the
// logging system.
func newLogClosure(c func() string) logClosure {
return logClosure(c)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR was put in, we have added an LND-wide version of this: https://github.com/lightningnetwork/lnd/blob/master/lnutils/log.go

Comment on lines +167 to +186
// MsgMapper is an optional message mapper that can be used to map
// normal wire messages into FSM events.
MsgMapper fn.Option[MsgMapper[Event]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still relevant and we now have it in fn


// ExternalEvent is an optional external event that is to be sent to
// the daemon for dispatch. Usually, this is some form of I/O.
ExternalEvents fn.Option[DaemonEventSet]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant.

Comment on lines +61 to +73
type State[Event any, Env Environment] interface {
// ProcessEvent takes an event and an environment, and returns a new
// state transition. This will be iteratively called until either a
// terminal state is reached, or no further internal events are
// emitted.
ProcessEvent(event Event, env Env) (*StateTransition[Event, Env], error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the longevity of this library that we need to decouple events that we consume from the events we produce.

Comment on lines +221 to +250
return fn.MapOptionZ(cfgMapper, func(mapper MsgMapper[Event]) bool {
return mapper.MapMsg(msg).IsSome()
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah again I think that creating a message mapper that returns None for all inputs handily solves this issue. Should be easy to do with the new Const function.


// SendPredicate is a function that returns true if the target message should
// sent.
type SendPredicate = func() bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to handle this is to make the "true yielding" calls into events. This removes the need for time-based poll loops.

protofsm/log.go Outdated

// The default amount of logging is none.
func init() {
UseLogger(build.NewSubLogger("PRCL", nil))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a logger at the library level?

In this PR, we create a new package, `protofsm` which is intended to
abstract away from something we've done dozens of time in the daemon:
create a new event-drive protocol FSM. One example of this is the co-op
close state machine, and also the channel state machine itself.

This packages picks out the common themes of:

  * clear states and transitions between them
  * calling out to special daemon adapters for I/O such as transaction
    broadcast or sending a message to a peer
  * cleaning up after state machine execution
  * notifying relevant callers of updates to the state machine

The goal of this PR, is that devs can now implement a state machine
based off of this primary interface:
```go
// State defines an abstract state along, namely its state transition function
// that takes as input an event and an environment, and returns a state
// transition (next state, and set of events to emit). As state can also either
// be terminal, or not, a terminal event causes state execution to halt.
type State[Event any, Env Environment] interface {
	// ProcessEvent takes an event and an environment, and returns a new
	// state transition. This will be iteratively called until either a
	// terminal state is reached, or no further internal events are
	// emitted.
	ProcessEvent(event Event, env Env) (*StateTransition[Event, Env], error)

	// IsTerminal returns true if this state is terminal, and false otherwise.
	IsTerminal() bool
}
```

With their focus being _only_ on each state transition, rather than all
the boiler plate involved (processing new events, advancing to
completion, doing I/O, etc, etc).

Instead, they just make their states, then create the state machine
given the starting state and env. The only other custom component needed
is something capable of mapping wire messages or other events from the
"outside world" into the domain of the state machine.

The set of types is based on a pseudo sum type system wherein you
declare an interface, make the sole method private, then create other
instances based on that interface. This restricts call sites (must pass
in that interface) type, and with some tooling, exhaustive matching can
also be enforced via a linter.

The best way to get a hang of the pattern proposed here is to check out
the tests. They make a mock state machine, and then use the new executor
to drive it to completion. You'll also get a view of how the code will
actually look, with the focus being on the: input event, current state,
and output transition (can also emit events to drive itself forward).
In this commit, we add an optional daemon event that can be specified to
dispatch during init. This is useful for instances where before we
start, we want to make sure we have a registered spend/conf notification
before normal operation starts.

We also add new unit tests to cover this, and the prior spend/conf event
additions.
In this commit, we add the ability for the state machine to consume wire
messages. This'll allow the creation of a new generic message router
that takes the place of the current peer `readHandler` in an upcoming
commit.
This'll be used later to uniquely identify state machines for
routing/dispatch purposes.
We'll use this to be able to signal to a caller that a critical error
occurred during the state transition.
Adding this makes a state machine easier to unit test, as the caller can
specify a custom polling interval.
In this commit, we add the SpendMapper which allows callers to create
custom spent events. Before this commit, the caller would be able to
have an event sent to them in the case a spend happens, but that event
wouldn't have any of the relevant spend details.

With this new addition, the caller can specify how to take a generic
spend event, and transform it into the state machine specific spend
event.
In this commit, we update the execution logic to allow multiple internal
events to be emitted. This is useful to handle potential out of order
state transitions, as they can be cached, then emitted once the relevant
pre-conditions have been met.
@lightninglabs-deploy
Copy link

@yyforyongyu: review reminder
@Crypt-iQ: review reminder
@morehouse: review reminder
@Roasbeef, remember to re-request review from reviewers when ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

7 participants