Data class notes
Data class notes
File transfer protocols: Use FTP or SFTP to share data files with internal and
external users.
Cloud files: Use scalable file stores from Google, Amazon, or Microsoft to share
files.
Web APIs: Use an API to share data streams in near real-time.
Event-based services: Use event management systems to provide immediate access to
data.
Downloads: Share data using web downloads, which can be hyperlinked in page text or
shared as URLs in emails.
Databases: Use a higher-level API like SQL access to share data.
Data exchange platforms: Use a data exchange platform to share data with a selected
group of accounts.
Blockchain technology: Use blockchain to track transactions, which provides a layer
of security and integrity to data-sharing processes.
Federated learning: Use federated learning to allow AI systems to train on
distributed datasets from diverse sources.
Data sharing can be done within or outside an organization, and can be public or
private. The type of data sharing depends on the nature of the data and the purpose
of the sharing.
Data preparation is the process of transforming raw data into a form that can be
used for analysis and machine learning. It involves several steps, including:
Gathering data: Finding the right data to use, either from an existing data catalog
or by adding new sources
Assessing data: Getting to know the data and understanding what needs to be done to
make it useful
Cleaning and validating data: Removing faulty data, filling in gaps, and fixing
mistakes
Transforming and enriching data: Updating the format or value entries, or adding
related information
Storing data: Saving the prepared data or sending it to a third-party application
Data preparation can be a lengthy process, but it's essential to ensure that data
is accurate and relevant before it's used for analysis. Some key practices to keep
in mind include:
Using a common format for storing and organizing data, such as CSV, JSON, or XML
Centralizing data storage in a data warehouse, data lake, or cloud storage
Defining clear objectives and key metrics to help prioritize efforts
Using validation techniques, such as checksums, rules, and tests, to ensure data is
correct