Domain 2
Domain 2
0: Data Mining
31. Data integration combines business and technical processes for collating
data from different sources into valuable and meaningful datasets.
32. Extract, transform, load (ETL) enables data engineers to extract data
from multiple source systems, transform the raw data into a more
usable/workable dataset, and finally load the data into a storage system so
end users can access meaningful data in reports or dashboards.
33. Extract, load, transform (ELT) enables data engineers to extract the data
from data sources, load it to target datastore, and transform it as the queries
are executed to get insights in reports or dashboards.
34. Delta loading refers to the process of extracting the delta, or difference in
the data compared to what was previously extracted as part of the ETL
process.
37. Web scraping, also known as web data extraction or web harvesting, is a
method used to the extract data from websites.
42. Data merging simplifies data analysis by merging multiple datasets into
one larger dataset.
43. Data blending brings together data from multiple sources that may be
very dissimilar.
44. Duplicate data can lead to similar entities of the same data values being
created in the database/warehouse.
48. Data redundancy occurs when the same datasets are stored in multiple
data sources.
51. Many data functions are available to help collate or get focused insights
from data. Some examples are aggregate functions, logical functions,
sorting, and filtering.
52. Missing data is one of the key issues with data accuracy and consistency.
55. Invalid data refers to values that were initially generated inaccurately.
56. Non-parametric data is data that does not fit a well-defined or well-stated
distribution.
57. Data type validation ensures that data has the correct data type before it
is leveraged at the destination system.
58. An execution plan works behind the scenes to ensure that a query gets
all the needed resources and is executed; it outlines the steps for execution
of the query from start through output.
61. A B-tree is formed of nodes where the tree starts at a root that has no
parent node and the other nodes in the tree each have one parent node,
which might or might not have child nodes.
62. A clustered index sorts the way records in the table are physically stored,
whereas a non-clustered index collects data in one place and records in
another place, like a pointer to the data.
63. Temporary tables offer workspace for transitional results when processing
data.
64. There are two types of temporary tables that you can create in Microsoft
SQL: global and local.
66. Data subsetting can be performed by using two methods: data sharding
and data partitioning. Data sharding involves creating logical horizontal
partitions in database to quickly access the data of interest. Partitioning
involves creating logical vertical partitions in a database.