Skip to content

Comments

NIFI-6510 - Analytics framework#3681

Merged
ai-christianson merged 45 commits intomasterfrom
analytics-framework
Sep 9, 2019
Merged

NIFI-6510 - Analytics framework#3681
ai-christianson merged 45 commits intomasterfrom
analytics-framework

Conversation

@ai-christianson
Copy link
Contributor

@ai-christianson ai-christianson commented Aug 29, 2019

Description of PR

Currently NiFi has lots of metrics available for areas including jvm and flow component usage (via component status) as well as provenance data which NiFi makes available either through the UI or reporting tasks (for consumption by other systems). Past discussions in the community cite users shipping this data to applications such as Prometheus, ELK stacks, or Ambari metrics for further analysis in order to capture/review performance issues, detect anomalies, and send alerts or notifications. These systems are efficient in capturing and helping to analyze these metrics however it requires customization work and knowledge of NiFi operations to provide meaningful analytics within a flow context.

In speaking with Matt Burgess and Andy Christianson on this topic we feel that there is an opportunity to introduce an analytics framework that could provide users reasonable predictions on key performance indicators for flows, such as back pressure and flow rate, to help administrators improve operational management of NiFi clusters. This framework could offer several key features:

- Provide a flexible internal analytics engine and model api which supports the addition of or enhancement to onboard models
- Support integration of remote or cloud based ML models
- Support both traditional and online (incremental) learning methods
- Provide support for model caching (perhaps later inclusion into a model repository or registry)
- UI enhancements to display prediction information either in existing summary data, new data visualizations, or directly within the flow/canvas (where applicable)

For an initial target we thought that back pressure prediction would be a good starting point for this initiative, given that back pressure detection is a key indicator of flow performance and many of the metrics currently available would provide enough data points to create a reasonable performing model. We have some ideas on how this could be achieved however we wanted to discuss this more with the community to get thoughts about tackling this work, especially if there are specific use cases or other factors that should be considered

This closes NIFI-6510.

PR Testing and Validation Notes:

When analytics is enabled (via nifi.analytics.predict.enabled in nifi.properties), back pressure predictions should appear as part of the connection/queue tool tip. It should include two predictions:

  • Predicted Queue size in the next interval (interval as configured by nifi.analytics.predict.interval in nifi.properties)
  • Estimated time until back pressure is encountered

These predictions are available in the context of queue object count and queue content size.

Screen Shot 2019-09-03 at 2 57 11 PM

Values can also be viewed in the NiFi Summary:

Screen Shot 2019-09-03 at 3 09 17 PM

With default settings, predictions will take about 2-3 minutes to appear since snapshots are only taken once per minute. The administration guide in this PR provides more information on how this and other properties can be configured to adjust this if needed.

Suggested tests include:

  • Enable/Disable Feature
  • Increase Decreasing time intervals
  • Prediction Assessment - Confirming if prediction provided is reasonable? E.g. are you seeing back pressure being experienced at or near predicted times
  • Filling and quickly emptying queues - This should causes predictions to temporarily halt as model becomes less reliable (due to sharp change in samples causing high variance)
  • Changing score thresholds - Higher scores threshold means NiFi can only use models with less error, yet this may be difficult to obtain given samples so predictions may be less available. Lower scores mean NiFi may provide more predictions yet ttey would be higher level of error
  • Invalid values for properties (e.g. invalid model implementation)

For all changes:

  • [X ] Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • [ X] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on both JDK 8 and JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

ai-christianson and others added 28 commits August 28, 2019 16:20
(cherry picked from commit e013b91)

DFA-9 - updated logging and corrected logic for checking if not in backpressure

(cherry picked from commit a1f8e70)
(cherry picked from commit 050e0fc)

(cherry picked from commit 9fd365f)
…entAccess, updated interface to use connection id

(cherry picked from commit 14854ff)

DFA-9 - reduced snapshot interval to 1 minute

(cherry picked from commit 36abb0a)
… just use ConnectionStatusAnalytics directly
…aces as we can just use ConnectionStatusAnalytics directly"

This reverts commit 5b9fead.
* DFA-9 - Initial refactor for Status Analytics - created additional interfaces for models, refactored callers to use StatusAnalytics objects with connection context. Implemented SimpleRegression model.

DFA-9 - added logging

* DFA-9 - relocated query window to CSA from model, adding the prediction percentages and time interval

* DFA-9 - checkstyle fixes
…lso changes to properly reflect when predictions can be made vs not.

(cherry picked from commit 6fae058)
(cherry picked from commit 6d7a13b)
…d variable names

(cherry picked from commit 58c7c81)
(cherry picked from commit b6e35ac)
Updates to support multiple variables for features, clearing cached regression model based on r-squared values

Added ordinary least squares model, which truly uses multivariable regression. Refactor of interfaces to include more general interface for variate models (that include scoring support).

Ratcheck fixes

Added test for SimpleRegression. Minor fix for OLS model

fixed test errors

fixed checkstyle errors

(cherry picked from commit fab411b)
… object. Also allow configurable model from nifi.properties

NIFI-6566 - changes to allow scoring configurations for model in nifi.properties

NIFI-6566 - added default implementation value to NiFiProperties

NIFI-6566 - correction to default variable name in NiFiProperties, removed unnecessary init method from ConnectionStatusAnalytics

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes #3663
…ns. Added check in ConnectionStatusAnalytics to confirm expected model by type
This closes NIFI-6586

Signed-off-by: Andrew I. Christianson <andy@andyic.org>
…the UI

* Add multi-line tooltips with detail for connection queue back pressure graphics.
* Add estimated time to back pressure to connections summary table.
* Add back pressure prediction ticks.
* add moment.js to format predicted time to back pressure
* tweak summary table headings to match data displayed. re-order connection summary columns
…the connection summary table. Also added a js doc comment.
@YolandaMDavis YolandaMDavis changed the title NIFI-6510 - Analytics framework NIFI-6510 [WIP] - Analytics framework Aug 30, 2019
@YolandaMDavis
Copy link
Contributor

Relabeling this PR as WIP due to newer commit to support enable/disable of feature (with disable being the default). @rfellows this will require an update to the UI to ensure that user can determine if predictions are enabled vs available. Once that commit is available I'll remove WIP from heading.

@YolandaMDavis YolandaMDavis changed the title NIFI-6510 [WIP] - Analytics framework NIFI-6510 - Analytics framework Sep 3, 2019
Copy link
Contributor

@markap14 markap14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the Pull Request! It's clear that a huge amount of work has gone into this - some very cool stuff here! I provided quite a bit of feedback. It mostly revolves around coding style and keeping the API clean. I did notice, however, in the DTO Factory that we are not checking READ permissions on the connection, its source, and its destination. We do need to fix that. I commented more specifically inline, but as it is, it may be exposing the names of components that the user does not have permissions to see.

private String destinationId;
private String destinationName;
private String backPressureDataSizeThreshold;
private Boolean predictionsAvailable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use a Boolean object here rather than a boolean? Doesn't seem like something that would be unset.

private long outputBytes;
private int maxQueuedCount;
private long maxQueuedBytes;
private long predictionIntervalMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a 'requirement' but there are quite a field fields here being added related to predictions. I wonder if it makes sense to have a BackpressurePrediction objection instead, that houses these? This would be particularly helpful if additional fields are added later, either related to backpressure prediction, or some other sort of analytic analysis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there was only one field to be added to ConnectionStatus and any others were envisioned to go in a separate analytics endpoint. Now that we've got many fields to make available via this object, I agree it's probably better to have a holder object.

private String queuedCount;
private Integer percentUseCount;
private Integer percentUseBytes;
private Long predictedMillisUntilCountBackpressure = 0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the ConnectionStatus object above, it may make sense to break this out into a separate BackpressurePredictionDTO object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markap14 we can do that. Do you see any issues with embedding the BackpressurePredictionDTO inside the ConnectionStatusDTO? We're considering that option because then the data would be available within the same request (less UI/API impact). It would be set to null if analytics are disabled.

@YolandaMDavis @mattyb149 @rfellows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fair to me @achristianson


if (statusAnalyticsEngine != null) {
StatusAnalytics statusAnalytics = statusAnalyticsEngine.getStatusAnalytics(conn.getIdentifier());
if (statusAnalytics != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this evaluates to false should we be calling connStatus.setPredictionsAvailable(false);? Or perhaps set it to false above and allow only this branch to override it to true. As-is, we explicitly set it to false if engine is null but leave it unset if statusAnalytics is null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achristianson I think your recent commit resolved this, is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, yes.


connectionStatisticsDTO.setGroupId(connection.getProcessGroup().getIdentifier());
connectionStatisticsDTO.setId(connection.getIdentifier());
connectionStatisticsDTO.setName(connection.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Authorization that is checked for this endpoint is the authority to READ the flow. Which I think is fine for determining whether or not statistics can be retrieved. However, this data model contains the name of the Connection, the name of the Source, and the Name of the Destination. In order to include these, the user's permissions must first be checked to see if they have READ permissions to the Connection, the Source, and the Destination, respectively. If the user does not have READ permissions to the Source, for instance, the Name of the Source needs to be set to the ID. Similarly for the name of the Connection and the name of the Destination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pretty much copied this part from createConnectionStatusDto() above, which doesn't appear to have those authorizations either but does seem to include connection information. I assume we should make the same change there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually digging into this a bit more I think it is okay in terms of authorization. Because of the way that ConnectionStatus gets generated, the authorization is done when the Status object is being generated. Typically, we handle authorizations at the DTO level instead. Not sure off the top of my head why it's done that way in this case. Perhaps it could be revisited in the future.
That said, looking at this, I don't think we actually need the name of the connection, or the source component's name or id, or the destination's name or id. We really just need the ID, maybe the group ID, refresh time, and the predictions. We should avoid including any extra information that is not needed, because it adds to the overhead of processing (bandwidth, CPU for serializing/deserializing/compressing/decompressing, etc.) We can always add things in later if we need them. But once we release with this, we can't really take them out because it would be a breaking change in terms of backward compatibility.


snapshot.setId(connection.getIdentifier());
snapshot.setGroupId(connection.getProcessGroup().getIdentifier());
snapshot.setName(connection.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies here for the name of the connection, the name of the source, and the name of the destination.

@ai-christianson
Copy link
Contributor Author

Extracted out predictions into their own ConnectionStatusPredictionsSnapshotDTO. @rfellows the UI field is now just predictions and will be null if predictions are not available. The fields within predictions are exactly the same, just encapsulated within their own object now.

@markap14 @YolandaMDavis @mattyb149

@markap14
Copy link
Contributor

markap14 commented Sep 9, 2019

Thanks guys. All of my major concerns are addressed at this point, I think. I'm a +1 to merge.

@ai-christianson ai-christianson merged commit 8a8b9c1 into master Sep 9, 2019
szaboferee pushed a commit to szaboferee/nifi that referenced this pull request Oct 7, 2019
* NIFI-6510 Implement initial analytic engine

* NIFI-6510 Implemented basic linear regression model for queue counts

* NIFI-6510 Initial analytics REST endpoint and supporting objects

* NIFI-6510 Connect the dots for StatusAnalytics -> API

* NIFI-6510 Added poc engine with prediction model caching

(cherry picked from commit e013b91)

DFA-9 - updated logging and corrected logic for checking if not in backpressure

(cherry picked from commit a1f8e70)

* NIFI-6510 Updated objects and interfaces to reflect 4 prediction metrics

(cherry picked from commit 050e0fc)

(cherry picked from commit 9fd365f)

* NIFI-6510 adjustments for interface updates, added call to StandardEventAccess, updated interface to use connection id

(cherry picked from commit 14854ff)

DFA-9 - reduced snapshot interval to 1 minute

(cherry picked from commit 36abb0a)

* NIFI-6510 Split StatusAnalytics interface into Engine and per-Connection versions

* NIFI-6510 Remove redundant connection prediction interfaces as we can just use ConnectionStatusAnalytics directly

* NIFI-6510 Revert "DFA-9 Remove redundant connection prediction interfaces as we can just use ConnectionStatusAnalytics directly"

This reverts commit 5b9fead.

* NIFI-6510 Added prediction fields for use by UI, still need to be populated

* NIFI-6510 Analytics Framework Introduction (apache#10)

* DFA-9 - Initial refactor for Status Analytics - created additional interfaces for models, refactored callers to use StatusAnalytics objects with connection context. Implemented SimpleRegression model.

DFA-9 - added logging

* DFA-9 - relocated query window to CSA from model, adding the prediction percentages and time interval

* DFA-9 - checkstyle fixes

* NIFI-6510 Add prediction percent values and predicted interval seconds

(cherry picked from commit e60015d)

* NIFI-6510 Changes to inject flowManager instead of flow controller, also changes to properly reflect when predictions can be made vs not.

(cherry picked from commit 6fae058)

* NIFI-6510 Added tests for engine

(cherry picked from commit 6d7a13b)

* NIFI-6150 Added tests for connection status analytics class, corrected variable names

(cherry picked from commit 58c7c81)

* NIFI-6150 Make checkstyle happy

(cherry picked from commit b6e35ac)

* NIFI-6150 Fixed NaN check and refactored time prediction. Switched to use non caching engine for testing

* NIFI-6510 Fixed checkstyle issue in TestConnectionStatusAnalytics

* NIFI-6510 Adjusted interval and incorporated R-squared check

Updates to support multiple variables for features, clearing cached regression model based on r-squared values

Added ordinary least squares model, which truly uses multivariable regression. Refactor of interfaces to include more general interface for variate models (that include scoring support).

Ratcheck fixes

Added test for SimpleRegression. Minor fix for OLS model

fixed test errors

fixed checkstyle errors

(cherry picked from commit fab411b)

* NIFI-6510 Added property to nifi.properties - Prediction Interval for connection status analytics (apache#11)

* NIFI-6566 - Refactor to decouple model instance from status analytics object. Also allow configurable model from nifi.properties

NIFI-6566 - changes to allow scoring configurations for model in nifi.properties

NIFI-6566 - added default implementation value to NiFiProperties

NIFI-6566 - correction to default variable name in NiFiProperties, removed unnecessary init method from ConnectionStatusAnalytics

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes apache#3663

* NIFI-6585 - Refactored tests to use mocked models and extract functions.  Added check in ConnectionStatusAnalytics to confirm expected model by type

* NIFI-6586 - documentation and comments

This closes NIFI-6586

Signed-off-by: Andrew I. Christianson <andy@andyic.org>

* NIFI-6568 - Surface time-to-back-pressure and initial predictions in the UI
* Add multi-line tooltips with detail for connection queue back pressure graphics.
* Add estimated time to back pressure to connections summary table.
* Add back pressure prediction ticks.
* add moment.js to format predicted time to back pressure
* tweak summary table headings to match data displayed. re-order connection summary columns

* NIFI-6568 - Properly sort the min estimated time to back pressure in the connection summary table. Also added a js doc comment.

* NIFI-6510 - add an enable/disable property for analytics

* NIFI-6510 - documentation updates for enable/disable property

* NIFI-6510 - UI: handle the scenario where backpressure predictions are disabled (apache#3685)

* NIFI-6510 - admin guide updates to further describe model functionality

* NIFI-6510 - code quality fixes (if statement and constructor)

* NIFI-6510 - log warnings when properties could not be retrieved. fixed incorrect property retrieval for score threshold

* NIFI-6510 Extract out predictions into their own DTO

* NIFI-6510 Optimize imports

* NIFI-6510 Fix formatting

* NIFI-6510 Optimize imports

* NIFI-6510 Optimize imports

* NIFI-6510 - Notice updates for Commons math and Caffeine

* NIFI-6510 - UI updates to account for minor API changes for back pressure predictions (apache#3697)

* NIFI-6510 - Fix issue displaying estimated time to back pressure in connection summary table when only one of the predictions is known.

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes apache#3705

* NIFI-6510 Rip out useless members

* NIFI-6510 - dto updates to check for -1 value

* NIFI-6510 - checkstyle fix

* NIFI-6510 - rolled back last change and applied minNonNegative method

* NIFI-6510 Rip out useless members
patricker pushed a commit to patricker/nifi that referenced this pull request Jan 22, 2020
* NIFI-6510 Implement initial analytic engine

* NIFI-6510 Implemented basic linear regression model for queue counts

* NIFI-6510 Initial analytics REST endpoint and supporting objects

* NIFI-6510 Connect the dots for StatusAnalytics -> API

* NIFI-6510 Added poc engine with prediction model caching

(cherry picked from commit e013b91)

DFA-9 - updated logging and corrected logic for checking if not in backpressure

(cherry picked from commit a1f8e70)

* NIFI-6510 Updated objects and interfaces to reflect 4 prediction metrics

(cherry picked from commit 050e0fc)

(cherry picked from commit 9fd365f)

* NIFI-6510 adjustments for interface updates, added call to StandardEventAccess, updated interface to use connection id

(cherry picked from commit 14854ff)

DFA-9 - reduced snapshot interval to 1 minute

(cherry picked from commit 36abb0a)

* NIFI-6510 Split StatusAnalytics interface into Engine and per-Connection versions

* NIFI-6510 Remove redundant connection prediction interfaces as we can just use ConnectionStatusAnalytics directly

* NIFI-6510 Revert "DFA-9 Remove redundant connection prediction interfaces as we can just use ConnectionStatusAnalytics directly"

This reverts commit 5b9fead.

* NIFI-6510 Added prediction fields for use by UI, still need to be populated

* NIFI-6510 Analytics Framework Introduction (apache#10)

* DFA-9 - Initial refactor for Status Analytics - created additional interfaces for models, refactored callers to use StatusAnalytics objects with connection context. Implemented SimpleRegression model.

DFA-9 - added logging

* DFA-9 - relocated query window to CSA from model, adding the prediction percentages and time interval

* DFA-9 - checkstyle fixes

* NIFI-6510 Add prediction percent values and predicted interval seconds

(cherry picked from commit e60015d)

* NIFI-6510 Changes to inject flowManager instead of flow controller, also changes to properly reflect when predictions can be made vs not.

(cherry picked from commit 6fae058)

* NIFI-6510 Added tests for engine

(cherry picked from commit 6d7a13b)

* NIFI-6150 Added tests for connection status analytics class, corrected variable names

(cherry picked from commit 58c7c81)

* NIFI-6150 Make checkstyle happy

(cherry picked from commit b6e35ac)

* NIFI-6150 Fixed NaN check and refactored time prediction. Switched to use non caching engine for testing

* NIFI-6510 Fixed checkstyle issue in TestConnectionStatusAnalytics

* NIFI-6510 Adjusted interval and incorporated R-squared check

Updates to support multiple variables for features, clearing cached regression model based on r-squared values

Added ordinary least squares model, which truly uses multivariable regression. Refactor of interfaces to include more general interface for variate models (that include scoring support).

Ratcheck fixes

Added test for SimpleRegression. Minor fix for OLS model

fixed test errors

fixed checkstyle errors

(cherry picked from commit fab411b)

* NIFI-6510 Added property to nifi.properties - Prediction Interval for connection status analytics (apache#11)

* NIFI-6566 - Refactor to decouple model instance from status analytics object. Also allow configurable model from nifi.properties

NIFI-6566 - changes to allow scoring configurations for model in nifi.properties

NIFI-6566 - added default implementation value to NiFiProperties

NIFI-6566 - correction to default variable name in NiFiProperties, removed unnecessary init method from ConnectionStatusAnalytics

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes apache#3663

* NIFI-6585 - Refactored tests to use mocked models and extract functions.  Added check in ConnectionStatusAnalytics to confirm expected model by type

* NIFI-6586 - documentation and comments

This closes NIFI-6586

Signed-off-by: Andrew I. Christianson <andy@andyic.org>

* NIFI-6568 - Surface time-to-back-pressure and initial predictions in the UI
* Add multi-line tooltips with detail for connection queue back pressure graphics.
* Add estimated time to back pressure to connections summary table.
* Add back pressure prediction ticks.
* add moment.js to format predicted time to back pressure
* tweak summary table headings to match data displayed. re-order connection summary columns

* NIFI-6568 - Properly sort the min estimated time to back pressure in the connection summary table. Also added a js doc comment.

* NIFI-6510 - add an enable/disable property for analytics

* NIFI-6510 - documentation updates for enable/disable property

* NIFI-6510 - UI: handle the scenario where backpressure predictions are disabled (apache#3685)

* NIFI-6510 - admin guide updates to further describe model functionality

* NIFI-6510 - code quality fixes (if statement and constructor)

* NIFI-6510 - log warnings when properties could not be retrieved. fixed incorrect property retrieval for score threshold

* NIFI-6510 Extract out predictions into their own DTO

* NIFI-6510 Optimize imports

* NIFI-6510 Fix formatting

* NIFI-6510 Optimize imports

* NIFI-6510 Optimize imports

* NIFI-6510 - Notice updates for Commons math and Caffeine

* NIFI-6510 - UI updates to account for minor API changes for back pressure predictions (apache#3697)

* NIFI-6510 - Fix issue displaying estimated time to back pressure in connection summary table when only one of the predictions is known.

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes apache#3705

* NIFI-6510 Rip out useless members

* NIFI-6510 - dto updates to check for -1 value

* NIFI-6510 - checkstyle fix

* NIFI-6510 - rolled back last change and applied minNonNegative method

* NIFI-6510 Rip out useless members
@exceptionfactory exceptionfactory deleted the analytics-framework branch March 16, 2023 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants