Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

channeldb: add reject and channel caches #2847

Merged

Conversation

cfromknecht
Copy link
Contributor

@cfromknecht cfromknecht commented Mar 28, 2019

This PR adds two caches housed within the channeldb to optimize two existing hot spots related to gossip traffic.

Reject Cache

The first is dubbed a reject cache, whose entries contain a small amount of information critical to determining if we should spend resources validating a particular channel announcement or channel update. This improves the performance of HasChannelEdge, which is a subroutine of KnownEdge and IsStaleEdgePolicy.

Each entry in the reject cache stores the following:

type rejectCacheEntry struct {
    upd1Time int64
    upd2Time int64
    flags    rejectFlags
}

where flags is packed bitfield containing, for now, the exists and isZombie booleans. We store the time as an unix integer as opposed to the time.Time directly, since time.Time's internal pointer for storing timezone info would force the garbage collector to traverse these long-lived entries. They are subsequently reconstructed during calls to HasChannelEdge using the integral value.

The addition of the reject cache greatly improves LND's ability to efficiently filter gossip traffic, and results in a significantly lower number of database accesses for this purpose. Users should notice the gossip syncers terminating much quicker (due to the absence of db hits) and overall result in snappier connections.

Channel Cache

The second is channel cache which caches ChannelEdge values, and used to reduce memory allocations stemming from ChanUpdatesInHorizon. A ChannelEdge has the following structure:

type ChannelEdge struct {
    Info    *ChannelEdgeInfo
    Policy1 *ChannelEdgePolicy
    Policy2 *ChannelEdgePolicy
}

Currently, each call to ChanUpdatesInHorizon will seek and deserialize all ChannelEdge values in the requested range. When connected to large number of peers, this can result an excessive amount of memory that must be 1) allocated, and 2) cleaned up by the garbage collector. The values are intended to be read only, and are discarded as soon as the relevant information is written out on the wire to the peers.

As a result, the channel cache can greatly reduce the amount of wasted allocations, especially if a large percentage of the requested range is held in memory or peers request similar time slices of the graph.

Eviction

Both caches employ randomized eviction when inserting an element would cause the cache to exceed its configured capacity. The rationale stems from the fact that the access pattern for these caches is dictated entirely by our peers. Assuming the entire working set cannot fit in memory, a deterministic caching strategy would ease a malicious peer's ability to craft access patterns that incur a worst-case hit frequency (close-to-or-equal-to 0%). The resulting effect would that we equivalent to having no cache at all, and force us to hit disk for each rejection check and/or deserialize each requested ChannelEdge. The randomized eviction strategy thus provides some level of DOS protection, while also being simple to and efficient to implement in go (because map iteration is randomized by default).

Lazy Consistency

For some cases, keeping the cache in sync with the on-disk state requires reading and deserializing extra data from the db that is not deducible from the inputs. However, at the time the entry is modified, it's not certain that the entry will be accessed again, meaning that extraneous allocations and deserializations may be performed, even though the entry could be evicted before that data is ever used.

For this reason, both the reject and channel caches remove entries from cache whenever an operation dirties an entry and then lazily loads them on the next access. The lone exception is UpdateChannelPolicy, where we write through info to the caches if those entries are already present because it is the most frequently used operation. If the entries are not present, then they are lazily loaded on the next access for the reasons stated above.

There are other possible places we could add write through, though removing the entry is by far the safest alternative. We can proceed in doing so in other places if they prove to be a bottleneck.

CLI configuration

Each cache can be configured by way of the new lncfg.Caches subconfig, allowing users to set the maximum number of cache entries for both the reject and channel caches. The configuration options appear as:

caches:
      --caches.reject-cache-size=    Maximum number of entries contained in the reject cache, which is used to speed up filtering of new channel announcements and channel updates from peers. Each entry requires 25 bytes. (default: 50000)
      --caches.channel-cache-size=   Maximum number of entries contained in the channel cache, which is used to reduce memory allocations from gossip queries from peers. Each entry requires 2Kb. (default: 20000)

At the default values provided, the reject cache occupies 1.2MB and easily holds an entry for today's entire graph. The channel cache occupies about 40MB, holding about half of the channels in memory. The majority of peers query for values near the tip of the graph, allowing gossip queries to be satisfied almost entirely from the in-memory contents.

There are certain peers that request a full dump of all known channels, which will require going to disk (unless of course the channel cache is configured to hold the entire graph in memory). AFAIK, CL currently queries for the entire range on each connection, and after #2740 LND nodes will begin doing so roughly once every six hours to a random peer to ensure that any holes in its routing table are filled. Inherently no caching algorithm can be optimal under such circumstances, though empirically, the channel cache quickly converges back to having most of the elements at tip for subsequent queries on the hot spots.

I'm open to other opinions on the default cache sizes, please discuss if people have others they prefer!

Depends on:

@Roasbeef Roasbeef added this to the 0.6 milestone Mar 28, 2019
@Roasbeef
Copy link
Member

Now that the dependent PR has been merged, this can be rebased!

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay caching! I've been running this on a few of my mainnet nodes and have noticed that combined with the gossip sync PR, it pretty much does away with the large memory burst a node can see today on restart.

No major comments, other than pointing out a future direction that we may want to consider where we update the cache with a new entry rather than removing the entry from the cache. Eventually, we can also extend this channel cache to be used in things like path finding or computing DescribeGraph, etc.

channeldb/graph.go Show resolved Hide resolved
channeldb/db.go Show resolved Hide resolved
// rejectFlagExists is a flag indicating whether the channel exists,
// i.e. the channel is open and has a recent channel update. If this
// flag is not set, the channel is either a zombie or unknown.
rejectFlagExists = 1 << 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a safe opportunity to use iota, and also declare these to be typed as rejectFlags rather than being an uint8.

channeldb/reject_cache.go Outdated Show resolved Hide resolved
channeldb/reject_cache.go Outdated Show resolved Hide resolved
channeldb/reject_cache.go Show resolved Hide resolved
channeldb/channel_cache.go Show resolved Hide resolved
return err
}

c.rejectCache.remove(edge.ChannelID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can insert the new copy of the edge policy into the cache. I see no issue in delaying this to a distinct change though once we see how this fares in the wild once most of the network has updated.

channeldb/db.go Show resolved Hide resolved
config.go Show resolved Hide resolved
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are certain peers that request a full dump of all known channels, which will require going to disk (unless of course the channel cache is configured to hold the entire graph in memory). AFAIK, CL currently queries for the entire range on each connection, and after #2740 LND nodes will begin doing so roughly once every six hours to a random peer to ensure that any holes in its routing table are filled.

This won't be too bad since we'll only request all the channel_ids we don't know of that the remote peer does. After this is done once, the cost of a historical sync should be pretty negligible as long as we don't go offline for a long period of time.

channeldb/channel_cache.go Show resolved Hide resolved
channeldb/graph.go Outdated Show resolved Hide resolved
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🍭

Have been testing this out on my node over the past few days, and it's significantly helped with the initial allocation burst when connecting to peers for the first time. I anticipate we'll also get a lot of feedback from node operators during the RC cycle as well, which can be used to modify the cache write policies or default sizes. Needs a rebase to remove the Drawin travis timing commits!

channeldb/channel_cache.go Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved
Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

channeldb/graph.go Show resolved Hide resolved
channeldb/graph.go Outdated Show resolved Hide resolved
lncfg/caches.go Show resolved Hide resolved
config.go Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved
})
if err != nil {
return err
}

c.rejectCache.remove(edge.ChannelID)
c.chanCache.remove(edge.ChannelID)
if entry, ok := c.rejectCache.get(edge.ChannelID); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just remove IMO, we have no guarantees that the updated policy doesn't change the value of the rejectFlags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reject flags are only changed wrt to channel existence/zombie pruning. updating the edge policy shouldn't modify rejectFlags

channeldb/graph.go Show resolved Hide resolved
@cfromknecht
Copy link
Contributor Author

@Roasbeef @wpaulino @halseth latest version is up and comments addressed, ptal

This commit introduces the Validator interface, which
is intended to be implemented by any sub configs. It
specifies a Validate() error method that should fail
if a sub configuration contains any invalid or insane
parameters.

In addition, a package-level Validate method can be
used to check a variadic number of sub configs
implementing the Validator interface. This allows the
primary config struct to be extended via targeted
and/or specialized sub configs, and validate all of
them in sequence without bloating the main package
with the actual validation logic.
@cfromknecht cfromknecht added database Related to the database/storage of LND optimization gossip labels Apr 2, 2019
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🛰

@Roasbeef Roasbeef merged commit 1dc1e85 into lightningnetwork:master Apr 3, 2019
@cfromknecht cfromknecht deleted the reject-and-channel-cache branch April 3, 2019 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to the database/storage of LND gossip optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants