Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve connectors/globs for large datasets #1616

Closed
6 of 10 tasks
begelundmuller opened this issue Jan 17, 2023 · 0 comments · Fixed by #1647
Closed
6 of 10 tasks

Improve connectors/globs for large datasets #1616

begelundmuller opened this issue Jan 17, 2023 · 0 comments · Fixed by #1647
Assignees
Labels
Team:Platform Platform Working Group

Comments

@begelundmuller
Copy link
Contributor

begelundmuller commented Jan 17, 2023

  • Add support for ingesting only a subset/sample of data through an extract config option
  • Iteratively ingest files into DuckDB instead of downloading all before ingestion
  • Add a timeout: [seconds integer] config key for sources
    • Source ingestion should be aborted after timeout, do any relevant cleanup, then return an error
  • Add a hive_partitioning: true config option to parse Hive partitions into columns
    • This applies also to regular (non-glob) connectors
    • This can be achieved for DuckDB just by passing HIVE_PARTITIONING=1 to read_parquet
  • Add docs for new config options
  • Do thorough QA to ensure partial Parquet downloads never lead to degraded performance (versus full download)
  • Address code review(s)
  • With product input, decide on default config and confirm YAML format for partial ingest
  • (Pending) Add support for a latest strategy that lists all files and sorts by last updated timestamp
  • (Pending) Add head files of tail folder support
@begelundmuller begelundmuller changed the title Improve glob user experience Improve connectors and globs for large datasets Jan 17, 2023
@begelundmuller begelundmuller changed the title Improve connectors and globs for large datasets Improve connectors/globs for large datasets Jan 17, 2023
@begelundmuller begelundmuller added the Team:Platform Platform Working Group label Jan 17, 2023
@begelundmuller begelundmuller linked a pull request Jan 27, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Platform Platform Working Group
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants