Lecture 2
Lecture 2
By
Japhet
Moise H.
Data Collection and Acquisition
1.IoT Sensors
2.Camera
3.Computer
4.Smartphone
5.Social data
6.Transactional data
Description of 6 V's of Big Data
• The 6 V's of Big Data are a framework used to characterize the key
challenges and opportunities associated with large-scale data sets.
• 1. Volume: This refers to the sheer amount of data generated. Big
data sets are typically massive in size, often exceeding terabytes or
even petabytes.
• 2. Velocity: This refers to the speed at which data is generated
and processed. Big data often arrives at a rapid pace, requiring real-
time or near-real-time analysis.
• 3. Variety: This refers to the diversity of data types. Big data can
include structured data (like databases), semi-structured data (like
XML or JSON), and unstructured data (like text, images, and videos).
• 4. Veracity: This refers to the quality and accuracy of the
data. Big data sets can often contain errors, inconsistencies, or
biases that need to be addressed before analysis.
• 5. Value: This refers to the potential benefits that can be
derived from analyzing the data. Big data can provide valuable
insights into business operations, customer behavior, and
market trends.
• 6. Variability: Big Data Variability refers to the dynamic
nature of data flow within large datasets.
Description of Types of data
• · Structured data: Organized in a predefined format (e.g.,
databases, spreadsheets).
• · Unstructured data: Not organized in a predefined format (e.g.,
text, images, audio).
• · Semi-structured data: Partially structured (e.g., XML, JSON).
Gathering Machine Learning Datasets
Web scraping: Extracting data from websites using automated
tools.
APIs: Interacting with APIs to retrieve data from online services.
Surveys and questionnaires: Gathering data directly from
individuals.
Sensor data: Collecting data from physical sensors.
Data purchases: Acquiring data from commercial data providers.
Data Visualization tools
1. Tableau: A powerful and user-friendly tool for creating interactive
dashboards and visualizations.
2. Power BI: Microsoft's business intelligence tool with strong
integration with Office products.
3. Qlik Sense: Offers associative exploration, enabling users to
discover relationships between data points.
4. Plotly: A Python library for creating interactive plots and graphs.
5. Matplotlib: A Python library for creating static plots and graphs.
6. Seaborn: A Python library built on top of Matplotlib, offering a
higher-level interface for creating attractive statistical visualizations.
Description of Characteristics of quality
Data
1. Accuracy: The term “accuracy” refers to the degree to which
information correctly reflects an event, location, person, or other
entity.
2.Completeness: Data is considered “complete” when it fulfills
expectations of comprehensiveness.
3.Consistency: At many companies, the same information may be
stored in more than one place. If that information matches, it’s
considered to be “consistent.”
4.Timeliness: Is your information available right when it’s needed?
That data quality dimension is called “timeliness.”.
5. Validity: Validity is a data quality dimension that refers to
information that conforms to a specific format or follows
business rules. To meet this data quality dimension, you must
confirm all of your information follows a specific format or
business rules.
6. Uniqueness: “Unique” information means that there’s only
one instance of it appearing in a database.
7.Relevance: It refers to the extent to which data is useful and
meaningful for a specific purpose.
The importance of data cleaning
1. Accuracy and Reliability: Ensures the data you work with is correct
and dependable, which is crucial for making informed decisions.
2. Better Decision-Making: Clean data leads to more accurate insights,
helping organizations make better strategic choices.
3. Efficiency: Streamlines the data analysis process by removing
irrelevant or redundant information, making datasets easier to manage.
4. Enhanced Data Quality: Maintains high data quality, making the data
more useful and valuable for analysis.
5. Compliance and Risk Management: Helps organizations comply with
regulations and manage risks by ensuring data is handled properly.
6. Cost Savings: Prevents errors and reduces costs associated with
correcting mistakes and dealing with poor data quality.
Data cleaning Techniques
1. Removing Duplicates: Identifying and eliminating duplicate
records to prevent redundancy and ensure each entry is unique.
2. Handling Missing Values: Addressing missing data by either
filling in the gaps with appropriate values (imputation) or removing
incomplete records, depending on the context.
3. Standardizing Data: Ensuring consistency in data formats, such
as dates, addresses, and names, to make the data uniform and
easier to analyze.
4. Correcting Errors: Identifying and fixing errors in the data, such
as typos, incorrect values, or inconsistencies.
1. Validating Data: Checking data against predefined rules or
criteria to ensure it meets the required standards and is within
acceptable ranges.
2. Filtering Outliers: Identifying and handling outliers that may
skew the analysis, either by removing them or adjusting their
values.
3. Normalization: Transforming data into a common scale without
distorting differences in the ranges of values, which is particularly
useful for numerical data.
4. Data Enrichment: Enhancing the dataset by adding relevant
information from external sources to provide more context and
improve analysis.
Thank you!!!!