Jump to content

Data loading: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Ta bu shi da yu (talk | contribs)
Redirected page to Extract, transform, load
Created by translating the page "Datalasting"
Line 1: Line 1:
'''Data loading''', or simply '''loading''', is a part of [[data processing]] where [[data]] is moved between two systems so that it ends up in a [[Staging (data)|staging area]] on the target system.
#REDIRECT [[Extract, transform, load]]

With the traditional [[extract, transform and load]] (ETL) method, the load job is the last step, and the data that is loaded has already been transformed. With the alternative method [[extract, load and transform]] (ELT), the loading job is the middle step, and the transformed data is loaded in its original format for [[Data transformation (computing)|data transformation]] in the target system.

Traditionally, loading jobs on large systems have taken a long time, and have typically been run at night outside a company's opening hours.

== Purpose ==
Two main goals of data loading are to obtain fresher data in the systems after loading, and that the loading is fast so that the data can be updated frequently. For full data refresh, faster loading can be achieved by turning off [[referential integrity]], secondary indexes and [[Logging (computing)|logging]], but this is usually not allowed with incremental update or trickle feed.

== Types ==
Data loading can be done either by complete update (immediate), incremental loading and updating (immediate), or trickle feed (deferred). The choice of technique may depend on the amount of data that is updated, changed or added, and how up-to-date the data must be. The type of data delivered by the source system, and whether historical data delivered by the source system can be trusted are also important factors.

=== Full refresh ===
Full data refresh means that existing data in the target table is deleted first. All data from the source is then loaded into the target table, new indexes are created in the target table, and new [[Measure (data warehouse)|measures]] are calculated for the updated table.

Full refresh is easy to implement, but involves moving of much data which can take a long time, and can make it challenging to keep historical data.<ref name=":0">{{Cite web |date=2022-04-14 |title=Incremental Data Load vs Full Load ETL: 4 Critical Differences - Learn {{!}} Hevo |url=https://hevodata.com/learn/incremental-data-load-vs-full-load/ |access-date=2023-02-18 |language=en-US}}</ref>

=== Incremental update ===
{{Main|Change data capture}}
Incremental update or incremental refresh means that only new or updated data is retrieved from the source system.<ref>{{Cite web |title=Incremental Loading |url=https://help.pyramidanalytics.com/content/root/MainClient/apps/Model/Model%20Pro/Data%20Flow/IncrementalLoading.htm |access-date=2023-02-18}}</ref><ref>{{Cite web |last=Mitchell |first=Tim |date=2020-07-23 |title=The What, Why, When, and How of Incremental Loads |url=https://www.timmitchell.net/post/2020/07/23/incremental-loads/ |access-date=2023-02-18 |language=en-US}}</ref> The updated data is then added to the existing data in the target system, and the existing data in the target system is updated. The indices and statistics are updated accordingly. Incremental update can make loading faster and make it easier to keep track of history, but can be demanding to set up and maintain.<ref name=":0">{{Cite web |date=2022-04-14 |title=Incremental Data Load vs Full Load ETL: 4 Critical Differences - Learn {{!}} Hevo |url=https://hevodata.com/learn/incremental-data-load-vs-full-load/ |access-date=2023-02-18 |language=en-US}}</ref>

=== Tricle feed ===
Tricle feed or trickle loading means that when the source system is updated, the changes in the target system will occur almost immediately.<ref>{{Cite encyclopedia |title=Near Real-Time Data Warehousing with Multi-stage Trickle and Flip |publisher=Springer Berlin Heidelberg |url=http://link.springer.com/10.1007/978-3-642-24511-4_6 |date=2011 |volume=90 |pages=73–82 |doi=10.1007/978-3-642-24511-4_6, author="zuters, janis", editor="grabis, janis and kirikova, marite", title="near real-time data warehousing with multi-stage trickle and flip", booktitle="perspectives in business informatics research", year="2011", publisher="springer berlin heidelberg", address="berlin, heidelberg", pages="73--82", abstract="a data warehouse typically is a collection of historical data designed for decision support, so it is updated from the sources periodically, mostly on a daily basis. today's business however asks for fresher data. real-time warehousing is one of the trends to accomplish this, but there are a number of challenges to move towards true real-time. this paper proposes `multi-stage trickle {\&} flip' methodology for data warehouse refreshment. it is based on the `trickle {\&} flip' principle and extended in order to further insulate loading and querying activities, thus enabling both of them to be more efficient.", isbn="978-3-642-24511-4" } |isbn=978-3-642-24510-7}}</ref><ref>{{Cite web |title=Trickle Loading Data |url=https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/TrickleLoading/TrickleLoadingData.htm |access-date=2023-02-18}}</ref>

== Loading to systems that are in use ==
{{main|Real-time computing}}
When loading data into a system that is currently in use by users or other systems, one must decide when the system should be updated and what will happen to tables that are in use at the same time as the system is to be updated. One possible solution is to make use of [[Shadow table|shadow tables]].<ref>{{Cite web |title=Create shadow tables for synchronization - Data Management - Alibaba Cloud Documentation Center |url=https://www.alibabacloud.com/help/en/data-management-service/latest/synchronize-shadow-tables |access-date=2023-02-18}}</ref><ref>{{Cite web |date=2015-08-10 |title=Shadow tables |url=https://www.ibm.com/docs/en/db2/10.5?topic=tables-shadow |access-date=2023-02-18 |language=en-us}}</ref>

== See also ==

* [[Database]]
* [[Data warehouse]]

== References ==

[[Category:Extract, transform, load tools]]
[[Category:Data warehousing]]

Revision as of 12:09, 10 February 2024

Data loading, or simply loading, is a part of data processing where data is moved between two systems so that it ends up in a staging area on the target system.

With the traditional extract, transform and load (ETL) method, the load job is the last step, and the data that is loaded has already been transformed. With the alternative method extract, load and transform (ELT), the loading job is the middle step, and the transformed data is loaded in its original format for data transformation in the target system.

Traditionally, loading jobs on large systems have taken a long time, and have typically been run at night outside a company's opening hours.

Purpose

Two main goals of data loading are to obtain fresher data in the systems after loading, and that the loading is fast so that the data can be updated frequently. For full data refresh, faster loading can be achieved by turning off referential integrity, secondary indexes and logging, but this is usually not allowed with incremental update or trickle feed.

Types

Data loading can be done either by complete update (immediate), incremental loading and updating (immediate), or trickle feed (deferred). The choice of technique may depend on the amount of data that is updated, changed or added, and how up-to-date the data must be. The type of data delivered by the source system, and whether historical data delivered by the source system can be trusted are also important factors.

Full refresh

Full data refresh means that existing data in the target table is deleted first. All data from the source is then loaded into the target table, new indexes are created in the target table, and new measures are calculated for the updated table.

Full refresh is easy to implement, but involves moving of much data which can take a long time, and can make it challenging to keep historical data.[1]

Incremental update

Incremental update or incremental refresh means that only new or updated data is retrieved from the source system.[2][3] The updated data is then added to the existing data in the target system, and the existing data in the target system is updated. The indices and statistics are updated accordingly. Incremental update can make loading faster and make it easier to keep track of history, but can be demanding to set up and maintain.[1]

Tricle feed

Tricle feed or trickle loading means that when the source system is updated, the changes in the target system will occur almost immediately.[4][5]

Loading to systems that are in use

When loading data into a system that is currently in use by users or other systems, one must decide when the system should be updated and what will happen to tables that are in use at the same time as the system is to be updated. One possible solution is to make use of shadow tables.[6][7]

See also

References

  1. ^ a b "Incremental Data Load vs Full Load ETL: 4 Critical Differences - Learn | Hevo". 2022-04-14. Retrieved 2023-02-18.
  2. ^ "Incremental Loading". Retrieved 2023-02-18.
  3. ^ Mitchell, Tim (2020-07-23). "The What, Why, When, and How of Incremental Loads". Retrieved 2023-02-18.
  4. ^ Near Real-Time Data Warehousing with Multi-stage Trickle and Flip. Vol. 90. Springer Berlin Heidelberg. 2011. pp. 73–82. doi:10.1007/978-3-642-24511-4_6, author="zuters, janis", editor="grabis, janis and kirikova, marite", title="near real-time data warehousing with multi-stage trickle and flip", booktitle="perspectives in business informatics research", year="2011", publisher="springer berlin heidelberg", address="berlin, heidelberg", pages="73--82", abstract="a data warehouse typically is a collection of historical data designed for decision support, so it is updated from the sources periodically, mostly on a daily basis. today's business however asks for fresher data. real-time warehousing is one of the trends to accomplish this, but there are a number of challenges to move towards true real-time. this paper proposes `multi-stage trickle {\&} flip' methodology for data warehouse refreshment. it is based on the `trickle {\&} flip' principle and extended in order to further insulate loading and querying activities, thus enabling both of them to be more efficient.", isbn="978-3-642-24511-4" }. ISBN 978-3-642-24510-7. {{cite encyclopedia}}: Check |doi= value (help); Missing pipe in: |doi= (help)
  5. ^ "Trickle Loading Data". Retrieved 2023-02-18.
  6. ^ "Create shadow tables for synchronization - Data Management - Alibaba Cloud Documentation Center". Retrieved 2023-02-18.
  7. ^ "Shadow tables". 2015-08-10. Retrieved 2023-02-18.