druid ingest executor #5234

rohithreddykota · 2024-07-10T06:29:17Z

Summary

This PR introduces a druidIndexExecutor which is responsible for executing Druid index tasks using configurations from YAML files. The executor handles the ingestion of data from an object store into Druid, calculating necessary intervals, generating dynamic index JSON, and performing the ingestion process based on the model input and output configurations.

Example:

kind: model

refresh:
  cron: 2 * * * *

connector: gcs
path: gs://hitech.rilldata.com/data-export/etl/bids/monthly/
pattern: 'yyyy=2006/mm=01/dd=02/HH=15'
gran: 1h
format: parquet
file_pattern: '.*\.parquet'
retry_period: 15m
max_retires: 3

incremental: true
output:
  connector: druid
  dataSource: demand_log_qa
  initial_look_back_period: 3h
  period_before: 1h
  quiet_period: 1h
  catchup: false
  max_work: 2h # maximum interval to index at a time
  coordinator_url: https://druid.ws1.hitech.rilldata.com/druid/coordinator/v1/datasources
  datasource_name: demand_log
  spec_json: >
    {
      "type": "index_parallel",
      "spec": {
        "ioConfig": {
          "type": "index_parallel",
          "inputSource": {
            "type": "google",
            "prefixes": {{ .prefixes | toJson }},
          },
          "inputFormat": {
            "type": "csv",
            "findColumnsFromHeader": true
          }
        },
        "tuningConfig": {
          "type": "index_parallel",
          "partitionsSpec": {
            "type": "dynamic"
          }
        },
        "dataSchema": {
          "dataSource": "%s",
          "timestampSpec": {
            "column": "timestamp",
            "format": "iso"
          },
          "transformSpec": {},
          "dimensionsSpec": {
            "dimensions": [
              {"type": "long", "name": "id"},
              "publisher",
              "domain",
              {"type": "double", "name": "bid_price"}
            ]
          },
          "granularitySpec": {
            "queryGranularity": "none",
            "rollup": false,
            "segmentGranularity": "day",
            "intervals": {{ .intervals | toJson }}
          }
        }
      }
    }

begelundmuller

Some high-level questions:

I didn't go deep on the time manipulation logic here, but it seems like it's basically trying to emulate splits/partitions for incremental ingestion? If yes, it would be better to support that with native splits or something similar – and then the actual input/output properties could just template in values from the split currently being executed. Does that make sense? And do you have any inputs or new discoveries for things to think about here?
Did you look into using Druid's INSERT or REPLACE SQL commands instead? I know it's less mature, but would be sweet if we could get away with only supporting the SQL interface.

begelundmuller · 2024-07-11T15:13:40Z

runtime/drivers/druid/druid.go

@@ -256,12 +257,19 @@ func (c *connection) AsObjectStore() (drivers.ObjectStore, bool) {

 // AsModelExecutor implements drivers.Handle.
 func (c *connection) AsModelExecutor(instanceID string, opts *drivers.ModelExecutorOptions) (drivers.ModelExecutor, bool) {
+	if opts.OutputHandle == c && opts.InputConnector == "gcs" {


Connector names can be aliased (e.g. if you connect with two different service accounts for different buckets). So instead of checking InputConnector it should check InputHandle.Driver() instead.

begelundmuller · 2024-07-11T15:14:25Z

runtime/drivers/druid/druid.go

+	db         *sqlx.DB
+	config     *configProperties
+	logger     *zap.Logger
+	instanceID string


This is used, but not assigned anywhere

begelundmuller · 2024-07-11T15:19:51Z

runtime/drivers/druid/model_executor_index.go

+	fmt.Println("==>inputProperties", inputProperties)
+	fmt.Println("==>outputProperties", outputProperties)


Use e.connection.logger

begelundmuller · 2024-07-11T15:20:52Z

runtime/drivers/druid/model_manager.go

+	PreviousExecutionTime   string `mapstructure:"previous_execution_time"`
+	PreviousIntervalEndTime string `mapstructure:"previous_interval_end_time"`


Could it use time.Time (it serializes/deserializes naturally in JSON)?

begelundmuller · 2024-07-11T15:31:16Z

runtime/drivers/druid/model_manager.go

+type ModelInputProperties struct {
+	Path          string `mapstructure:"path"`
+	Pattern       string `mapstructure:"pattern"`
+	Granularity   string `mapstructure:"gran"`
+	Format        string `mapstructure:"format"`
+	FilePattern   string `mapstructure:"file_pattern"`
+	RetriesPeriod string `mapstructure:"retry_period"`
+	MaxRetries    int    `mapstructure:"max_retries"`
+}


So these are actually properties for an object store connector, not for the Druid connector. Ideally the name druid.ModelInputProperties would be reserved for input properties for Druid (i.e. models where Druid is the input and something else is the output – obviously we don't support that now, but could make sense in the future for e.g. models that export to S3).

Ideally the input properties for GCS would be defined in gcs.ModelInputProperties (or maybe drivers.ObjectStoreModelInputProperties if shared across multiple object store drivers), but I realize these are quite specific to this Druid driver. So maybe having it be an internal gcsModelInputProperties would make more sense.

rohithreddykota · 2024-07-11T16:31:26Z

Some high-level questions:

I didn't go deep on the time manipulation logic here, but it seems like it's basically trying to emulate splits/partitions for incremental ingestion? If yes, it would be better to support that with native splits or something similar – and then the actual input/output properties could just template in values from the split currently being executed. Does that make sense? And do you have any inputs or new discoveries for things to think about here?

Did you look into using Druid's INSERT or REPLACE SQL commands instead? I know it's less mature, but would be sweet if we could get away with only supporting the SQL interface.

You are right. It is kind of emulating splits for the incremental ingestion. Having native splits would definitely make things lot easier. From my understanding, splits should be able to serve all the options with a single query.
My initial idea was to add support for both sql based ingestion and specJSON based ingestion. The reason I want to add specJson based ingestion is that I can port the existing implementations into rill cloud. And another reasons are the limitations.

rohithreddykota added 2 commits July 9, 2024 22:07

druid executor

c9222c8

standardize

e842340

rohithreddykota marked this pull request as ready for review July 10, 2024 06:29

cosmetic changes

e19820b

rohithreddykota marked this pull request as draft July 10, 2024 14:28

begelundmuller reviewed Jul 11, 2024

View reviewed changes

remove unwanted prints

3899858

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

druid ingest executor #5234

druid ingest executor #5234

rohithreddykota commented Jul 10, 2024

begelundmuller left a comment

begelundmuller Jul 11, 2024

begelundmuller Jul 11, 2024

begelundmuller Jul 11, 2024

begelundmuller Jul 11, 2024

begelundmuller Jul 11, 2024

rohithreddykota commented Jul 11, 2024

		fmt.Println("==>inputProperties", inputProperties)
		fmt.Println("==>outputProperties", outputProperties)

		PreviousExecutionTime string `mapstructure:"previous_execution_time"`
		PreviousIntervalEndTime string `mapstructure:"previous_interval_end_time"`

druid ingest executor #5234

Are you sure you want to change the base?

druid ingest executor #5234

Conversation

rohithreddykota commented Jul 10, 2024

Summary

begelundmuller left a comment

Choose a reason for hiding this comment

begelundmuller Jul 11, 2024

Choose a reason for hiding this comment

begelundmuller Jul 11, 2024

Choose a reason for hiding this comment

begelundmuller Jul 11, 2024

Choose a reason for hiding this comment

begelundmuller Jul 11, 2024

Choose a reason for hiding this comment

begelundmuller Jul 11, 2024

Choose a reason for hiding this comment

rohithreddykota commented Jul 11, 2024