Skip to content

Add DSS plugin for MLflow #191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

vthorey
Copy link
Contributor

@vthorey vthorey commented Nov 26, 2021

We follow the setup proposed in MLflow Tracking scenario 4 for tracking experiments with a remote backend server which is directly implemented in DSS public API (in https://github.com/dataiku/dip/tree/feature/mlflow-experiment-tracking) and an artifact host which is a DSS managed folder.

In order to communicate with DSS public API and handle manage folder for artifact store, we define in this PR a new MLflow plugin. Doc on MLflow plugins. What it does:

  1. add authentication and project key to MLflow client requests
    In this part, the communication with MLflow client is done by adding env variables "DSS_MLFLOW_HEADER", "DSS_MLFLOW_TOKEN", "DSS_MLFLOW_PROJECTKEY". The content of these env variables is just added to the headers of MLflow client requests.

  2. add a connector to handle managed folder for MLflow artifacts
    MLflow provides connectors for many backends: S3, FileSystem, Databricks, etc. Basically, implementing the connector means creating a new one for managed folder and using the dataikuapi's client methods to manipulate managed folders.
    Indeed, in MLflow, to log artifacts, the MLflow client asks the backend a URI and then uses it to upload the artifact to the returned URI himself. So we need to give the MLflow client the ability to upload, download, list, delete, etc. from a managed folder. In the plugin, an instance of dataikuapi's client is spawned by the MLflow client to do the operations.

Note on the implementation:
To avoid loading the plugin for every user installing dataiku-api-client-python, the entry points of the plugins are added dynamically using "load_dss_mlflow_plugin" function instead of defining them in the setup.py.

The plugin can be used using the PR: https://github.com/dataiku/dip/pull/14442 . Here is a code sample:

import dataiku
client = dataiku.api_client()
with client.setup_mlflow("projectkey") as mlflow:
    experiment_name="My experiment"
    mlflow.set_experiment(experiment_name)

@vthorey vthorey added this to the V 11.0.0 milestone Nov 26, 2021
@vthorey vthorey requested a review from lpenet November 26, 2021 10:52
@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #75434: Write a mlflow.request_header_provider plugin for MLflow.

@vthorey vthorey marked this pull request as ready for review November 26, 2021 11:06
@vthorey vthorey added the team-mielpops Team MieL pOps label Nov 26, 2021
@vthorey vthorey self-assigned this Nov 26, 2021
@lpenet
Copy link
Contributor

lpenet commented Nov 30, 2021

Can you please:

  • create a feature/mlflow-experiment-tracking for this repo
  • base this PR on it
    ?

@vthorey vthorey changed the base branch from master to feature/mlflow-experiment-tracking November 30, 2021 16:17
Copy link
Contributor

@lpenet lpenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In load_dss_mlflow_plugin(), having a fixed name will be an issue:

  • if we have multiple API clients running on the same host
  • even more if we change the content of the generated file in a later version

So, I would:

  • randomize a bit the dir name
  • give to the user the opportunity to clean up at the end.

I would probably return here the full path of the created dir and, have setup_mlflow in DSSClient return an object containing the set variables and this path.

In addition to this, I would add a context manager, so that we can write something like:

with client.setup_mlflow("pouet") as pouet:
...

And automatically unset env variables and cleanup directory at the end.

@vthorey vthorey requested a review from lpenet December 2, 2021 13:48
@lpenet lpenet requested review from lpenet and instanceofme December 3, 2021 07:49
Co-authored-by: Ludovic Pénet <[email protected]>
Copy link
Contributor

@lpenet lpenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM+Tested. Note: merging in the feature branch.

@lpenet lpenet merged commit 67cf94b into feature/mlflow-experiment-tracking Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-mielpops Team MieL pOps
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants