0% found this document useful (0 votes)
7 views3 pages

Domain 2

The document discusses various data mining concepts, including data integration, ETL and ELT processes, and the importance of APIs for data sharing. It also covers techniques for data manipulation, such as data merging, blending, and normalization, as well as issues like missing data and data redundancy. Additionally, it explains database indexing, temporary tables, and methods for data subsetting.

Uploaded by

baka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views3 pages

Domain 2

The document discusses various data mining concepts, including data integration, ETL and ELT processes, and the importance of APIs for data sharing. It also covers techniques for data manipulation, such as data merging, blending, and normalization, as well as issues like missing data and data redundancy. Additionally, it explains database indexing, temporary tables, and methods for data subsetting.

Uploaded by

baka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Domain 2.

0: Data Mining
31. Data integration combines business and technical processes for collating
data from different sources into valuable and meaningful datasets.

32. Extract, transform, load (ETL) enables data engineers to extract data
from multiple source systems, transform the raw data into a more
usable/workable dataset, and finally load the data into a storage system so
end users can access meaningful data in reports or dashboards.

33. Extract, load, transform (ELT) enables data engineers to extract the data
from data sources, load it to target datastore, and transform it as the queries
are executed to get insights in reports or dashboards.

34. Delta loading refers to the process of extracting the delta, or difference in
the data compared to what was previously extracted as part of the ETL
process.

35. An application programming interface (API) provides a programmable


interface for interacting with applications and infrastructure and acts as a
middleware integration layer.

36. APIs enable organizations to selectively share their applications in terms


of data and functionality with internal stakeholders (developers and users) as
well as external stakeholders, such as business partners, third-party
developers, and vendors.

37. Web scraping, also known as web data extraction or web harvesting, is a
method used to the extract data from websites.

38. Surveys are commonly used to collect data from respondents.

39. Sampling is the process of collecting data from a subdivision/subset of a


given population to get insights that represent the whole population.

40. A derived variable is defined by a parameter or an expression related to


existing variables in a dataset.

41. The process of recoding a variable can be used to transform a current


variable into a different one, based on certain criteria and business
requirements.

42. Data merging simplifies data analysis by merging multiple datasets into
one larger dataset.
43. Data blending brings together data from multiple sources that may be
very dissimilar.

44. Duplicate data can lead to similar entities of the same data values being
created in the database/warehouse.

45. Data appending refers to adding new data elements to an existing


dataset/database.

46. Imputation is helpful in filling in missing values. Imputation can be based


on logical rules, based on related observations, based on the last observation
carried forward, and based on creating new variable categories.

47. Data reduction is a data manipulation technique that is used to minimize


the size of a dataset by aggregating, clustering, or removing any redundant
features.

48. Data redundancy occurs when the same datasets are stored in multiple
data sources.

49. Data manipulation is an important step for business operation and


optimization when dealing with data and analysis. Data analysts and
engineers can manipulate data so that analysis can be performed on
cleansed, focused, and more accurate datasets.

50. Normalization is aimed at removing redundant information from a


database and ensuring that only related data is stored in a table.

51. Many data functions are available to help collate or get focused insights
from data. Some examples are aggregate functions, logical functions,
sorting, and filtering.

52. Missing data is one of the key issues with data accuracy and consistency.

53. Specification mismatch is caused by data at the source being a mismatch


for data at the destination due to unrecognized symbols, bad data entry,
invalid calculations, or mismatching of units/labels.

54. A data outlier in a dataset is an observation that is inconsistent or very


dissimilar to the remaining information.

55. Invalid data refers to values that were initially generated inaccurately.

56. Non-parametric data is data that does not fit a well-defined or well-stated
distribution.
57. Data type validation ensures that data has the correct data type before it
is leveraged at the destination system.

58. An execution plan works behind the scenes to ensure that a query gets
all the needed resources and is executed; it outlines the steps for execution
of the query from start through output.

59. A parameterized query makes it possible to use placeholders for


parameters, where the parameter values are supplied at execution time.

60. Indexing speeds up the execution of queries by rapidly finding records by


delivering all the columns requested by the query without executing full
table scans.

61. A B-tree is formed of nodes where the tree starts at a root that has no
parent node and the other nodes in the tree each have one parent node,
which might or might not have child nodes.

62. A clustered index sorts the way records in the table are physically stored,
whereas a non-clustered index collects data in one place and records in
another place, like a pointer to the data.

63. Temporary tables offer workspace for transitional results when processing
data.

64. There are two types of temporary tables that you can create in Microsoft
SQL: global and local.

65. A subset is a smaller set of data from a larger database or a data


warehouse that allows you to focus on only the relevant information.

66. Data subsetting can be performed by using two methods: data sharding
and data partitioning. Data sharding involves creating logical horizontal
partitions in database to quickly access the data of interest. Partitioning
involves creating logical vertical partitions in a database.

You might also like