0% found this document useful (0 votes)
15 views14 pages

Lecture 2

The document provides an overview of data collection and acquisition, defining key terms such as data, information, and datasets, and discussing the characteristics of big data through the 6 V's: volume, velocity, variety, veracity, value, and variability. It also covers types of data, methods for gathering machine learning datasets, data visualization tools, and the importance of data cleaning along with various techniques for ensuring data quality. Overall, it emphasizes the significance of accurate and reliable data for informed decision-making and effective analysis.

Uploaded by

keza loenah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

Lecture 2

The document provides an overview of data collection and acquisition, defining key terms such as data, information, and datasets, and discussing the characteristics of big data through the 6 V's: volume, velocity, variety, veracity, value, and variability. It also covers types of data, methods for gathering machine learning datasets, data visualization tools, and the importance of data cleaning along with various techniques for ensuring data quality. Overall, it emphasizes the significance of accurate and reliable data for informed decision-making and effective analysis.

Uploaded by

keza loenah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Lecture 2

By
Japhet
Moise H.
Data Collection and Acquisition

•Description of key terms


1. Data: Data is a collection of information gathered by observations,
measurements, research or analysis. They may consist of facts,
numbers, names, figures or even description of things. Data is
organized in the form of graphs, charts or tables.
simply data refers to raw facts of information.
2. Information: meaning assigned to data within some context for the
use of that data.
3. dataset: a collection of data taken from a single source or intended for
a single project.
4. Data warehouse: A data warehouse is a system that stores and
analyzes data from multiple sources.
5. Big data:large and diverse datasets that are huge in volume and also
rapidly grow in size over time.
Identification of Source of data

1.IoT Sensors
2.Camera
3.Computer
4.Smartphone
5.Social data
6.Transactional data
Description of 6 V's of Big Data
• The 6 V's of Big Data are a framework used to characterize the key
challenges and opportunities associated with large-scale data sets.
• 1. Volume: This refers to the sheer amount of data generated. Big
data sets are typically massive in size, often exceeding terabytes or
even petabytes.
• 2. Velocity: This refers to the speed at which data is generated
and processed. Big data often arrives at a rapid pace, requiring real-
time or near-real-time analysis.
• 3. Variety: This refers to the diversity of data types. Big data can
include structured data (like databases), semi-structured data (like
XML or JSON), and unstructured data (like text, images, and videos).
• 4. Veracity: This refers to the quality and accuracy of the
data. Big data sets can often contain errors, inconsistencies, or
biases that need to be addressed before analysis.
• 5. Value: This refers to the potential benefits that can be
derived from analyzing the data. Big data can provide valuable
insights into business operations, customer behavior, and
market trends.
• 6. Variability: Big Data Variability refers to the dynamic
nature of data flow within large datasets.
Description of Types of data
• · Structured data: Organized in a predefined format (e.g.,
databases, spreadsheets).
• · Unstructured data: Not organized in a predefined format (e.g.,
text, images, audio).
• · Semi-structured data: Partially structured (e.g., XML, JSON).
Gathering Machine Learning Datasets
 Web scraping: Extracting data from websites using automated
tools.
 APIs: Interacting with APIs to retrieve data from online services.
 Surveys and questionnaires: Gathering data directly from
individuals.
 Sensor data: Collecting data from physical sensors.
 Data purchases: Acquiring data from commercial data providers.
Data Visualization tools
1. Tableau: A powerful and user-friendly tool for creating interactive
dashboards and visualizations.
2. Power BI: Microsoft's business intelligence tool with strong
integration with Office products.
3. Qlik Sense: Offers associative exploration, enabling users to
discover relationships between data points.
4. Plotly: A Python library for creating interactive plots and graphs.
5. Matplotlib: A Python library for creating static plots and graphs.
6. Seaborn: A Python library built on top of Matplotlib, offering a
higher-level interface for creating attractive statistical visualizations.
Description of Characteristics of quality
Data
1. Accuracy: The term “accuracy” refers to the degree to which
information correctly reflects an event, location, person, or other
entity.
2.Completeness: Data is considered “complete” when it fulfills
expectations of comprehensiveness.
3.Consistency: At many companies, the same information may be
stored in more than one place. If that information matches, it’s
considered to be “consistent.”
4.Timeliness: Is your information available right when it’s needed?
That data quality dimension is called “timeliness.”.
5. Validity: Validity is a data quality dimension that refers to
information that conforms to a specific format or follows
business rules. To meet this data quality dimension, you must
confirm all of your information follows a specific format or
business rules.
6. Uniqueness: “Unique” information means that there’s only
one instance of it appearing in a database.
7.Relevance: It refers to the extent to which data is useful and
meaningful for a specific purpose.
The importance of data cleaning
1. Accuracy and Reliability: Ensures the data you work with is correct
and dependable, which is crucial for making informed decisions.
2. Better Decision-Making: Clean data leads to more accurate insights,
helping organizations make better strategic choices.
3. Efficiency: Streamlines the data analysis process by removing
irrelevant or redundant information, making datasets easier to manage.
4. Enhanced Data Quality: Maintains high data quality, making the data
more useful and valuable for analysis.
5. Compliance and Risk Management: Helps organizations comply with
regulations and manage risks by ensuring data is handled properly.
6. Cost Savings: Prevents errors and reduces costs associated with
correcting mistakes and dealing with poor data quality.
Data cleaning Techniques
1. Removing Duplicates: Identifying and eliminating duplicate
records to prevent redundancy and ensure each entry is unique.
2. Handling Missing Values: Addressing missing data by either
filling in the gaps with appropriate values (imputation) or removing
incomplete records, depending on the context.
3. Standardizing Data: Ensuring consistency in data formats, such
as dates, addresses, and names, to make the data uniform and
easier to analyze.
4. Correcting Errors: Identifying and fixing errors in the data, such
as typos, incorrect values, or inconsistencies.
1. Validating Data: Checking data against predefined rules or
criteria to ensure it meets the required standards and is within
acceptable ranges.
2. Filtering Outliers: Identifying and handling outliers that may
skew the analysis, either by removing them or adjusting their
values.
3. Normalization: Transforming data into a common scale without
distorting differences in the ranges of values, which is particularly
useful for numerical data.
4. Data Enrichment: Enhancing the dataset by adding relevant
information from external sources to provide more context and
improve analysis.
Thank you!!!!

You might also like