Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

druid ingest executor #5234

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

druid ingest executor #5234

wants to merge 4 commits into from

Conversation

rohithreddykota
Copy link
Contributor

Summary

This PR introduces a druidIndexExecutor which is responsible for executing Druid index tasks using configurations from YAML files. The executor handles the ingestion of data from an object store into Druid, calculating necessary intervals, generating dynamic index JSON, and performing the ingestion process based on the model input and output configurations.

Example:

kind: model

refresh:
  cron: 2 * * * *

connector: gcs
path: gs://hitech.rilldata.com/data-export/etl/bids/monthly/
pattern: 'yyyy=2006/mm=01/dd=02/HH=15'
gran: 1h
format: parquet
file_pattern: '.*\.parquet'
retry_period: 15m
max_retires: 3

incremental: true
output:
  connector: druid
  dataSource: demand_log_qa
  initial_look_back_period: 3h
  period_before: 1h
  quiet_period: 1h
  catchup: false
  max_work: 2h # maximum interval to index at a time
  coordinator_url: https://druid.ws1.hitech.rilldata.com/druid/coordinator/v1/datasources
  datasource_name: demand_log
  spec_json: >
    {
      "type": "index_parallel",
      "spec": {
        "ioConfig": {
          "type": "index_parallel",
          "inputSource": {
            "type": "google",
            "prefixes": {{ .prefixes | toJson }},
          },
          "inputFormat": {
            "type": "csv",
            "findColumnsFromHeader": true
          }
        },
        "tuningConfig": {
          "type": "index_parallel",
          "partitionsSpec": {
            "type": "dynamic"
          }
        },
        "dataSchema": {
          "dataSource": "%s",
          "timestampSpec": {
            "column": "timestamp",
            "format": "iso"
          },
          "transformSpec": {},
          "dimensionsSpec": {
            "dimensions": [
              {"type": "long", "name": "id"},
              "publisher",
              "domain",
              {"type": "double", "name": "bid_price"}
            ]
          },
          "granularitySpec": {
            "queryGranularity": "none",
            "rollup": false,
            "segmentGranularity": "day",
            "intervals": {{ .intervals | toJson }}
          }
        }
      }
    }

@rohithreddykota rohithreddykota marked this pull request as ready for review July 10, 2024 06:29
@rohithreddykota rohithreddykota marked this pull request as draft July 10, 2024 14:28
Copy link
Contributor

@begelundmuller begelundmuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level questions:

  1. I didn't go deep on the time manipulation logic here, but it seems like it's basically trying to emulate splits/partitions for incremental ingestion? If yes, it would be better to support that with native splits or something similar – and then the actual input/output properties could just template in values from the split currently being executed. Does that make sense? And do you have any inputs or new discoveries for things to think about here?
  2. Did you look into using Druid's INSERT or REPLACE SQL commands instead? I know it's less mature, but would be sweet if we could get away with only supporting the SQL interface.

@@ -256,12 +257,19 @@ func (c *connection) AsObjectStore() (drivers.ObjectStore, bool) {

// AsModelExecutor implements drivers.Handle.
func (c *connection) AsModelExecutor(instanceID string, opts *drivers.ModelExecutorOptions) (drivers.ModelExecutor, bool) {
if opts.OutputHandle == c && opts.InputConnector == "gcs" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connector names can be aliased (e.g. if you connect with two different service accounts for different buckets). So instead of checking InputConnector it should check InputHandle.Driver() instead.

db *sqlx.DB
config *configProperties
logger *zap.Logger
instanceID string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used, but not assigned anywhere

Comment on lines 101 to 102
fmt.Println("==>inputProperties", inputProperties)
fmt.Println("==>outputProperties", outputProperties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use e.connection.logger

Comment on lines +73 to +74
PreviousExecutionTime string `mapstructure:"previous_execution_time"`
PreviousIntervalEndTime string `mapstructure:"previous_interval_end_time"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it use time.Time (it serializes/deserializes naturally in JSON)?

Comment on lines +12 to +20
type ModelInputProperties struct {
Path string `mapstructure:"path"`
Pattern string `mapstructure:"pattern"`
Granularity string `mapstructure:"gran"`
Format string `mapstructure:"format"`
FilePattern string `mapstructure:"file_pattern"`
RetriesPeriod string `mapstructure:"retry_period"`
MaxRetries int `mapstructure:"max_retries"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these are actually properties for an object store connector, not for the Druid connector. Ideally the name druid.ModelInputProperties would be reserved for input properties for Druid (i.e. models where Druid is the input and something else is the output – obviously we don't support that now, but could make sense in the future for e.g. models that export to S3).

Ideally the input properties for GCS would be defined in gcs.ModelInputProperties (or maybe drivers.ObjectStoreModelInputProperties if shared across multiple object store drivers), but I realize these are quite specific to this Druid driver. So maybe having it be an internal gcsModelInputProperties would make more sense.

@rohithreddykota
Copy link
Contributor Author

Some high-level questions:

  1. I didn't go deep on the time manipulation logic here, but it seems like it's basically trying to emulate splits/partitions for incremental ingestion? If yes, it would be better to support that with native splits or something similar – and then the actual input/output properties could just template in values from the split currently being executed. Does that make sense? And do you have any inputs or new discoveries for things to think about here?
  2. Did you look into using Druid's INSERT or REPLACE SQL commands instead? I know it's less mature, but would be sweet if we could get away with only supporting the SQL interface.
  1. You are right. It is kind of emulating splits for the incremental ingestion. Having native splits would definitely make things lot easier. From my understanding, splits should be able to serve all the options with a single query.
  2. My initial idea was to add support for both sql based ingestion and specJSON based ingestion. The reason I want to add specJson based ingestion is that I can port the existing implementations into rill cloud. And another reasons are the limitations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants