Lecture 2

The document provides an overview of data collection and acquisition, defining key terms such as data, information, and datasets, and discussing the characteristics of big data through the 6 V's: volume, velocity, variety, veracity, value, and variability. It also covers types of data, methods for gathering machine learning datasets, data visualization tools, and the importance of data cleaning along with various techniques for ensuring data quality. Overall, it emphasizes the significance of accurate and reliable data for informed decision-making and effective analysis.

Uploaded by

keza loenah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views14 pages

Lecture 2

Uploaded by

keza loenah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Lecture 2

By
Japhet
Moise H.
Data Collection and Acquisition

•Description of key terms

1. Data: Data is a collection of information gathered by observations,
measurements, research or analysis. They may consist of facts,
numbers, names, figures or even description of things. Data is
organized in the form of graphs, charts or tables.
simply data refers to raw facts of information.
2. Information: meaning assigned to data within some context for the
use of that data.
3. dataset: a collection of data taken from a single source or intended for
a single project.
4. Data warehouse: A data warehouse is a system that stores and
analyzes data from multiple sources.
5. Big data:large and diverse datasets that are huge in volume and also
rapidly grow in size over time.
Identification of Source of data

1.IoT Sensors
2.Camera
3.Computer
4.Smartphone
5.Social data
6.Transactional data
Description of 6 V's of Big Data
• The 6 V's of Big Data are a framework used to characterize the key
challenges and opportunities associated with large-scale data sets.
• 1. Volume: This refers to the sheer amount of data generated. Big
data sets are typically massive in size, often exceeding terabytes or
even petabytes.
• 2. Velocity: This refers to the speed at which data is generated
and processed. Big data often arrives at a rapid pace, requiring real-
time or near-real-time analysis.
• 3. Variety: This refers to the diversity of data types. Big data can
include structured data (like databases), semi-structured data (like
XML or JSON), and unstructured data (like text, images, and videos).
• 4. Veracity: This refers to the quality and accuracy of the
data. Big data sets can often contain errors, inconsistencies, or
biases that need to be addressed before analysis.
• 5. Value: This refers to the potential benefits that can be
derived from analyzing the data. Big data can provide valuable
insights into business operations, customer behavior, and
market trends.
• 6. Variability: Big Data Variability refers to the dynamic
nature of data flow within large datasets.
Description of Types of data
• · Structured data: Organized in a predefined format (e.g.,
databases, spreadsheets).
• · Unstructured data: Not organized in a predefined format (e.g.,
text, images, audio).
• · Semi-structured data: Partially structured (e.g., XML, JSON).
Gathering Machine Learning Datasets
 Web scraping: Extracting data from websites using automated
tools.
 APIs: Interacting with APIs to retrieve data from online services.
 Surveys and questionnaires: Gathering data directly from
individuals.
 Sensor data: Collecting data from physical sensors.
 Data purchases: Acquiring data from commercial data providers.
Data Visualization tools
1. Tableau: A powerful and user-friendly tool for creating interactive
dashboards and visualizations.
2. Power BI: Microsoft's business intelligence tool with strong
integration with Office products.
3. Qlik Sense: Offers associative exploration, enabling users to
discover relationships between data points.
4. Plotly: A Python library for creating interactive plots and graphs.
5. Matplotlib: A Python library for creating static plots and graphs.
6. Seaborn: A Python library built on top of Matplotlib, offering a
higher-level interface for creating attractive statistical visualizations.
Description of Characteristics of quality
Data
1. Accuracy: The term “accuracy” refers to the degree to which
information correctly reflects an event, location, person, or other
entity.
2.Completeness: Data is considered “complete” when it fulfills
expectations of comprehensiveness.
3.Consistency: At many companies, the same information may be
stored in more than one place. If that information matches, it’s
considered to be “consistent.”
4.Timeliness: Is your information available right when it’s needed?
That data quality dimension is called “timeliness.”.
5. Validity: Validity is a data quality dimension that refers to
information that conforms to a specific format or follows
business rules. To meet this data quality dimension, you must
confirm all of your information follows a specific format or
business rules.
6. Uniqueness: “Unique” information means that there’s only
one instance of it appearing in a database.
7.Relevance: It refers to the extent to which data is useful and
meaningful for a specific purpose.
The importance of data cleaning
1. Accuracy and Reliability: Ensures the data you work with is correct
and dependable, which is crucial for making informed decisions.
2. Better Decision-Making: Clean data leads to more accurate insights,
helping organizations make better strategic choices.
3. Efficiency: Streamlines the data analysis process by removing
irrelevant or redundant information, making datasets easier to manage.
4. Enhanced Data Quality: Maintains high data quality, making the data
more useful and valuable for analysis.
5. Compliance and Risk Management: Helps organizations comply with
regulations and manage risks by ensuring data is handled properly.
6. Cost Savings: Prevents errors and reduces costs associated with
correcting mistakes and dealing with poor data quality.
Data cleaning Techniques
1. Removing Duplicates: Identifying and eliminating duplicate
records to prevent redundancy and ensure each entry is unique.
2. Handling Missing Values: Addressing missing data by either
filling in the gaps with appropriate values (imputation) or removing
incomplete records, depending on the context.
3. Standardizing Data: Ensuring consistency in data formats, such
as dates, addresses, and names, to make the data uniform and
easier to analyze.
4. Correcting Errors: Identifying and fixing errors in the data, such
as typos, incorrect values, or inconsistencies.
1. Validating Data: Checking data against predefined rules or
criteria to ensure it meets the required standards and is within
acceptable ranges.
2. Filtering Outliers: Identifying and handling outliers that may
skew the analysis, either by removing them or adjusting their
values.
3. Normalization: Transforming data into a common scale without
distorting differences in the ranges of values, which is particularly
useful for numerical data.
4. Data Enrichment: Enhancing the dataset by adding relevant
information from external sources to provide more context and
improve analysis.
Thank you!!!!

DV Classnotes
No ratings yet
DV Classnotes
28 pages
Introduction To Data: - Manish Lamba
No ratings yet
Introduction To Data: - Manish Lamba
23 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Big Data - Iv Bda
No ratings yet
Big Data - Iv Bda
143 pages
Module 1 & 2 DAEH QB
No ratings yet
Module 1 & 2 DAEH QB
69 pages
21CS71 Imp
No ratings yet
21CS71 Imp
29 pages
Introduction To Business Analytics
No ratings yet
Introduction To Business Analytics
63 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
Super 25 Unit 1 and Unit 2
No ratings yet
Super 25 Unit 1 and Unit 2
15 pages
DAUnit 1
No ratings yet
DAUnit 1
20 pages
DA Assignment 20241015 091512 0000
No ratings yet
DA Assignment 20241015 091512 0000
19 pages
Basic Data Analysis
No ratings yet
Basic Data Analysis
16 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Introduction To Data Science Module 2
No ratings yet
Introduction To Data Science Module 2
35 pages
BDA Unit1 Notes
No ratings yet
BDA Unit1 Notes
14 pages
Bda U-5
No ratings yet
Bda U-5
33 pages
Afh14-133 Intelligence Analysis Sep 2017
No ratings yet
Afh14-133 Intelligence Analysis Sep 2017
104 pages
BigData Theory Updated 2
No ratings yet
BigData Theory Updated 2
28 pages
Partiiunit5characteristics of Big Data and Data Analytics
No ratings yet
Partiiunit5characteristics of Big Data and Data Analytics
6 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
Unit 1
No ratings yet
Unit 1
36 pages
Report On Summer Internship
No ratings yet
Report On Summer Internship
30 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Science
No ratings yet
Data Science
12 pages
DA Unit 2 Trio 1
No ratings yet
DA Unit 2 Trio 1
26 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
15 pages
DA Chapter 1 Notes Final
No ratings yet
DA Chapter 1 Notes Final
2 pages
Unit 2 Data Analytics
No ratings yet
Unit 2 Data Analytics
16 pages
All About Data Science
No ratings yet
All About Data Science
35 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Data For Business Analytics Unit 2
No ratings yet
Data For Business Analytics Unit 2
23 pages
Updated Notes of APR - 084732
No ratings yet
Updated Notes of APR - 084732
6 pages
Data Analytics
No ratings yet
Data Analytics
30 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Unit 1 Introduction To Data Analytics
No ratings yet
Unit 1 Introduction To Data Analytics
20 pages
Big Data
No ratings yet
Big Data
10 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
MBA Data Mining Unit 1 Notes
No ratings yet
MBA Data Mining Unit 1 Notes
12 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Research Proposal - MPhil Public Policy and Management
No ratings yet
Research Proposal - MPhil Public Policy and Management
32 pages
Insights Into Big Data: An Industrial Perspective
No ratings yet
Insights Into Big Data: An Industrial Perspective
52 pages
Unit 1
No ratings yet
Unit 1
61 pages
AHLC Catalog Certified HR Analytics Professional
100% (1)
AHLC Catalog Certified HR Analytics Professional
29 pages
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
No ratings yet
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
35 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Fda 1
No ratings yet
Fda 1
5 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Machine Learning Unit Wise Important Questions
100% (3)
Machine Learning Unit Wise Important Questions
2 pages
The Impact of Social Media To Academic Performance
No ratings yet
The Impact of Social Media To Academic Performance
30 pages
Mini Report Python
No ratings yet
Mini Report Python
24 pages
ML Unit 4
No ratings yet
ML Unit 4
34 pages
R Packages For Machine Learning
No ratings yet
R Packages For Machine Learning
3 pages
Designing An Effective Training Program
100% (1)
Designing An Effective Training Program
41 pages
Regression: An Introduction To Econometrics
No ratings yet
Regression: An Introduction To Econometrics
19 pages
Automobile Mechatronics
No ratings yet
Automobile Mechatronics
68 pages
The Challenges of Nursing Students in The Clinical Learning Environment
No ratings yet
The Challenges of Nursing Students in The Clinical Learning Environment
20 pages
Notes For Quantitative and Qualitative Research
No ratings yet
Notes For Quantitative and Qualitative Research
4 pages
QUALITY ASSURANCE pp2
No ratings yet
QUALITY ASSURANCE pp2
37 pages
CH 12
0% (2)
CH 12
25 pages
BLOCKCHAIN
No ratings yet
BLOCKCHAIN
16 pages
Adedokun Esther
No ratings yet
Adedokun Esther
46 pages
Data Analysis CORRELATION
No ratings yet
Data Analysis CORRELATION
4 pages
Block Chain
No ratings yet
Block Chain
10 pages
AIPROJECTSPK2
No ratings yet
AIPROJECTSPK2
29 pages
DAX Functions - Time Intelligence Functions
No ratings yet
DAX Functions - Time Intelligence Functions
7 pages
Operations Management: - Forecasting
No ratings yet
Operations Management: - Forecasting
96 pages
Homework CH 1 Version1 0
No ratings yet
Homework CH 1 Version1 0
11 pages
ML, DL Questions: Downloaded From
No ratings yet
ML, DL Questions: Downloaded From
4 pages
Python
No ratings yet
Python
11 pages
Introduction To Analytics
No ratings yet
Introduction To Analytics
3 pages
UNIT 4 Design Thinking
No ratings yet
UNIT 4 Design Thinking
3 pages
Studies On Economic Efficiency of Coffee Production in Ilu Abbabor Zone, Oromia Region, Ethiopia
No ratings yet
Studies On Economic Efficiency of Coffee Production in Ilu Abbabor Zone, Oromia Region, Ethiopia
14 pages
(AR) An Alternative Method For Quantitative Synthesis of Single-Subject Researches. Percentage of Data Points Exceeding The Median (2006)
No ratings yet
(AR) An Alternative Method For Quantitative Synthesis of Single-Subject Researches. Percentage of Data Points Exceeding The Median (2006)
20 pages
09 - Machine Learning
No ratings yet
09 - Machine Learning
7 pages
Megan Bryant Hw2
No ratings yet
Megan Bryant Hw2
14 pages
0 0 1 1 1 W A P 1 N N I 1 I X I N 1 N N I 1 I 2
No ratings yet
0 0 1 1 1 W A P 1 N N I 1 I X I N 1 N N I 1 I 2
2 pages
New Bayesian Lasso
No ratings yet
New Bayesian Lasso
12 pages
Economics Assignment - Group B
No ratings yet
Economics Assignment - Group B
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Lecture 2

•Description of key terms

You might also like