0% found this document useful (0 votes)
16 views16 pages

It125 Finals

The document discusses key concepts in data analytics including data, data analysis, data mining, data visualization, machine learning, artificial intelligence, and big data. It provides definitions for data types, databases, data warehousing, data lakes, ETL processes, and business intelligence. The document also discusses the relationships between computer science, data science, and the Internet of Things.

Uploaded by

Dana Cortez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views16 pages

It125 Finals

The document discusses key concepts in data analytics including data, data analysis, data mining, data visualization, machine learning, artificial intelligence, and big data. It provides definitions for data types, databases, data warehousing, data lakes, ETL processes, and business intelligence. The document also discusses the relationships between computer science, data science, and the Internet of Things.

Uploaded by

Dana Cortez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

3.2 WORKING WITH DATA AND (BI) activities, especially data analytics.

INFORMATIONS Data Warehousing integrates data and


information collected from various
BASIC TERMINOLOGIES OF DATA ANALYTICS
sources into one comprehensive
1. Data is a collection of raw facts, such as database. A data Warehouse is a group of
numbers, words, measurements, data specific to the entire organization,
observations or just descriptions of things, not only to a particular group of users.
and stored in a computer system.. Data are 7. Data Lake is a centralized repository that
measurements or observations that are allows organizations to store all their
collected as a source of information. This is structured and unstructured data at any
equivalent to the cell value in the scale. Unlike traditional data warehouses,
spreadsheet. Raw facts and figures that are which require data to be processed and
Data can be structured (organized in a structured before storage, a data lake
tabular format) or unstructured (e.g., text, allows raw data to be stored in its native
images, videos). format until it's needed. This means that
2. A data unit is a group of related data held data can be ingested from various sources
within the same structure. It is one entity without the need for pre-defined schema
(such as a person or business in the or organization.
population being studied, about which 8. ETL (Extract, Transform, Load) is the
data is collected (person's name, person's process of extracting data from various
date of birth, person's address, etc.). A sources, transforming it into a consistent
data unit is also referred to as a unit record format, and loading it into a data
or record in a database. This is equivalent warehouse or another storage system for
to a row in a spreadsheet. analysis. ETL is a process used in data
3. A data item is a characteristic (or attribute integration and data warehousing to
of a data unit which is measured or gather data from various sources,
counted, such as height, country of birth. transform it into a consistent format, and
or income. Data items are the components then load it into a target destination,
that provide structure for a table. A data typically a data warehouse, database, or
item is also referred to as field in database. data lake.
This is equivalent to a column in 9. Data Analysis is the process of inspecting,
spreadsheet cleaning, transforming, and modeling data
4. A dataset is a related sets or collections of with the goal of discovering useful
data and information that is composed of information, informing conclusions, and
separate elements but can be manipulated supporting decision-making. It involves
as a unit by a computer. This set is applying various techniques and methods
normally presented in a tabular pattern. It to interpret and make sense of data,
is also known as table in database extract insights, and derive meaningful
management system. This is equivalent to conclusions
a worksheet in spreadsheet. 10. Data analytics is the process of examining,
5. Database is an organized collection of cleaning, transforming, and interpreting
structured information, or data, typically data to uncover meaningful insights, draw
stored electronically in a computer system. conclusions, and support decision-making.
It is a collection of related datasets or It involves the use of various techniques
tables. Database is designed to efficiently and tools to analyze large and diverse
manage, retrieve, and manipulate large datasets, with the goal of extracting
volumes of data. Databases are central to valuable information and understanding
storing and managing information for patterns, trends, and relationships within
various applications, ranging from simple the data.
personal lists to complex enterprise 11. Business intelligence (BI) primarily
systems. This is equivalent to excel file with focuses on gathering, storing, and
several worksheets in spreadsheet. analyzing historical data to provide insights
6. Data warehouse is a type of data in form of data visualization, reporting
management system that is designed to and querying into past and current
enable and support business intelligence performance. The ultimate goal is to drive
better business decisions that enable processes by machines, especially
organizations to increase revenue, computer systems. Artificial intelligence
improve operational efficiency and gain encompasses a wide range of techniques,
competitive advantages over business algorithms, and methodologies that
rivals. enable machines to perform tasks that
12. Business analytics is the practice of using typically require human intelligence. Al
data analysis and statistical methods to aims to create systems that can perceive
derive insights and make data-driven their environment, reason, lear from
decisions within business and experience, and interact with humans in
organization. Business Analytics includes natural ways.
predictive and prescriptive analytics, which 18. Machine learning is a branch of artificial
focus on forecasting future outcomes and intelligence (Al) and computer science
recommending actions to improve future which focuses on the use of data and
performance. BA often involves looking algorithms to imitate the way that
ahead and making proactive decisions humans learn, gradually improving its
based on predictive models and insights. accuracy. Machine learning focuses on the
13. Data visualization is the graphical development of algorithms and statistical
representation of data to facilitate models that enable computers to learn
understanding, analysis, and and improve their performance on a
interpretation. It involves presenting data specific task without being explicitly
in visual formats such as charts, graphs, programmed.
maps, and dashboards to communicate 19. Data science combines math and
complex information clearly and effectively. statistics, specialized programming,
Data visualization helps users gain insights, advanced analytics, artificial intelligence
identify trends, detect patterns, and make (Al), and machine learning with specific
informed decisions by visually exploring subject matter expertise to uncover
and interacting with data. actionable insights hidden in an
14. Data mining is the process of searching organization's data. These insights can be
and analyzing a large batch of raw data in used to guide decision making and
order to identify patterns and extract strategic planning.
useful information. The primary purpose of 20. The Internet of Things (loT) describes the
data mining is to discover patterns, network of physical objects-*things -that
correlations, and insights within large are embedded with sensors, software, and
datasets. Data mining often involves other technologies for the purpose of
analyzing historical data to identify connecting and exchanging data with
patterns and trends that can be used to other devices and systems over the
make predictions about future outcomes. internet. loT can also make use of artificial
15. Computer science is both a theoretical intelligence (Al) and machine learning to
and practical discipline, with roots in aid in making data collecting processes
mathematics, engineering, and easier and more dynamic.
information theory. It involves the
systematic study of algorithms for 3.1 DATA ANALYSIS
processing, storing, and transmitting
TYPES OF DATA
information, as well as the development of
software and hardware systems to 1. There are two types of data: Quantitative
implement these algorithms effectively. and Qualitative data. In computer
16. Big data is a combination of structured, programming, these are called numeric
semi-structured and unstructured data and non-numeric.
collected by organizations that can be 2. Quantitative data (Quantity) or
mined for information and used in Numerical Data refers to any information
machine learning projects, predictive that can be quantified, counted or
modeling and other advanced analytics measured, and given a numerical value.
applications Quantitative data are measures of values
17. Artificial intelligence (Al) is the or counts and are expressed as numbers.
simulation of human intelligence
These are data about numeric variables 2. Discrete data, Continuous data.
(e.g. how many, how much or how often). 3. The Qualitative data are further classified
3. Qualitative or Categorical Data is data into two parts: Nominal data and Ordinal
that can't be measured or counted in the data.
form of numbers. These are descriptive in 4. The Quantitative data are further classified
nature, expressed in terms of language into two parts : Discrete data and
rather than numerical values. Continuous data.
4. These are data about categorical variables
(e.g. what type). Qualitative data measures
of 'types' and may be represented by a
name, symbol, or a number code.
5. It is important to identify whether the data
are quantitative or qualitative as this
affects the results (conclusions or NOMINAL DATA
decisions) that can be produced.
6. It's hard to conduct a successful data 1. Nominal Data is used to label variables
analysis without qualitative and without any order or quantitative value.
quantitative data. They both have their 2. The color of hair can be considered
advantages and disadvantages and often nominal data, as one color can't be
complement each other. compared with another color.
3. The name "nominal" comes from the Latin
EXAMPLES OF QUANTITATIVE DATA name "nomen," which means "name." With
the help of nominal data, we can't do any
numerical tasks or can't give any order to
sort the data.
4. These data don't have any meaningful
order; their values are distributed into
distinct categories.
5. Examples of Nominal Data :
EXAMPLES OF QUALITATIVE DATA ● Color of hair (Blonde, red, Brown,
Black, etc.
● Marital status (Single, Widowed,
Married)
● Nationality (Indian, German,
American)
● Gender (Male, Female, Others)
● Eye Color (Black, Brown, etc.)

ORDINAL DATA
1. Ordinal data have natural ordering where
a number is present in some kind of order
QUALITATIVE VS. QUANTITATIVE by their position on the scale.
2. Ordinal data are used for observation like
customer satisfaction, happiness, etc., but
we can't do any arithmetical tasks on
them.
3. Ordinal data is qualitative data for which
their values have some kind of relative
position.
4. These kinds of data can be considered
"in-between" qualitative and quantitative
data. The ordinal data only shows the
CATEGORIES OF DATA
sequences and cannot use for statistical
analysis.
1. Data are further classified into four
categories : Nominal data, Ordinal data,
5. Compared to nominal data, ordinal data CONTINUOUS DATA
have some kind of order that is not present
1. Continuous data are in the form of
in nominal data.
fractional numbers or real numbers. It
6. Examples of Ordinal Data :
can be the version of an android phone,
● Feedback, experience, or
the height of a person, the length of an
satisfaction on a scale of 1 to 10
object, etc.
● Letter grades in the exam (A, B, C, D,
2. Continuous data represents information
etc.)
that can be divided into smaller levels. The
● Ranking of people in a competition
continuous variable can take any value
(First, Second, Third, etc.)
within a range.
● Economic Status (High, Medium,
3. The key difference between discrete and
and Low)
continuous data is that discrete data
● Education Level (Higher, Secondary,
contains the integer or whole number
Primary)
while continuous data contains numeric
NOMINAL VS ORDINAL DATA value with fractional b..
4. The continuous data stores the fractional
NOMINAL ORDINAL
Can’t be quantified, neither Give some kind of numbers to record different types of data
they have any intrinsic sequential order by their such as temperature, height, width, time,
ordering position on the scale speed, etc.
Is qualitative or categorical Said to be “in-between 5. Examples of Continuous Data :
data qualitative and quantitative
data ● Height of a person
Does not provide any Provides sequence and can ● Speed of a vehicle
quantitative value, neither assign numbers to ordinal ● "Time-taken" to finish the work
can we perform any data but cannot perform ● Wi-Fi Frequency
arithmetical operation the arithmetical operation
Cannot be used to Can help to compare one ● Market share price
compare with one another item with another by
ranking or ordering DIFFERENCE BETWEEN DISCRETE AND
Eye color, housing style, Economic status, customer CONTINUOUS
gender, hair, color, religion, satisfaction, education
marital status, etc. level, letter grades DISCRETE CONTINUOUS
Are countable and finite; Are measurable; they are in
they are whole numbers or the form of fractions or
DISCRETE DATA
integers decimals
Represented mainly by bar Represented in the form of
1. The discrete data contain the values that graphs a histogram
fall under integers or whole numbers. The Are values that cannot be Values that can be divided
term discrete means distinct or separate. divided into subdivisions into subdivisions and
and smaller pieces smaller pieces
2. The total number of students in a class is
Have spaces between the In the form of continuous
an example of discrete data. values sequences
3. These data can't be broken into decimal or Total number of students in Temperature of room, the
fraction values. class, number of days in a weight of a person, length
week, size of a shoe, etc. . of an object
4. The discrete data are countable and have
finite values; their subdivision is not
3.3 INTRODUCTION TO DATA ANALYTICS
possible. These data are represented
mainly by a bar graph, number line, or TYPES OF DATA ANALYTICS
frequency table. Various approaches to data analytics include
5. Examples of Discrete Data : looking at what happened (descriptive analytics),
● Total numbers of students present why something happened (diagnostic analytics),
in a class what is going to happen (predictive analytics), or
● Cost of a cell phone what should be done next (prescriptive analytics).
● Numbers of employees in a
TYPES OF DATA ANALYTICS QUESTIONS ANSWERED
company
Descriptive Analytics What happened?
● The total number of players who Diagnostic Analytics Why did it happen?
participated in a competition Predictive Analytics What will happen?
● Days in a week Prescriptive Analytics What should we do?
DESCRIPTIVE ANALYTICS behind specific outcomes or trends
observed in descriptive analytics.
1. Descriptive analytics is the examination of 3. Diagnostics analytics identifies trends or
data or content to answer the question patterns in the past and then goes a step
"What happened?" (or What is further to explain why the trends occurred
happening?), characterized by traditional the way they did. It's a logical step after
business intelligence (BI) and visualizations descriptive analytics because it answers
such as pie charts, bar charts, line graphs, questions like why a certain amount was
tables, or generated narratives. sold or why Q1 targets were hit.
2. Descriptive analytics involves analyzing 4. Diagnostic analytics is also a useful tool for
historical data to understand what businesses that want more confidence to
happened in the past. It focuses on duplicate good outcomes and avoid
summarizing and visualizing data to negative ones. Descriptive analytics can
provide insights into trends, patterns, and tell you what happened but then it is up to
relationships. your team to figure out what to do with
3. Descriptive analytics examines what that data.
happened in the past. You're utilizing 5. Diagnostic analytics applies data to figure
descriptive analytics when you examine out why something happened so you can
past data sets for patterns and trends. This develop better strategies without so much
is the core of most businesses' analytics trial and error. The main flaw with
because it answers important questions diagnostic analytics is its limitation of
like how much you sold and if you hit providing actionable observations about
specific goals. Its easy to understand even the future by focusing on past occurrences.
for non-data analysts. 6. Understanding the causal relationships
4. Descriptive analytics functions by and sequences may be enough for some
identifying what metrics you want to businesses, but it may not provide
measure, collecting that data, and sufficient answers for others. For the latter,
analyzing it. It turns the stream of facts managing big data will likely require more
your business has collected into advanced analytics solutions and you
information you can act on, plan around, might have to implement additional tools -
and measure. venturing into predictive or prescriptive
5. Once descriptive analytics is done, it's up to analytics - to find meaningful insights
your team to ask how or why those trends 7. Diagnostic analytics answer the following
occurred, brainstorm and develop possible questions:
responses or solutions, and choose how to ● Why did year-over-year sales go up?
move forward. ● Why did a certain product perform
6. Example: Generating reports, creating above expectations?
dashboards, and using data visualization ● Why did we lose customers in Q3?
techniques like charts and graphs to 8. Example of diagnostic analytics include
present historical data. Other examples conducting root cause analysis, performing
include Annual revenue reports, survey variance analysis and using techniques like
response summaries and year-over-year drill-down and data discovery to
sales reports investigate data anomalies

DIAGNOSTIC ANALYTICS PREDICTIVE ANALYTICS

1. Diagnostic analytics helps explain why 1. Predictive analytics aims to predict likely
things happened the way they did. It's a outcomes and make educated forecasts
more complex version of descriptive using historical data Simply put, it seeks to
analytics, extending beyond what answer the question. "What will happen?".
happened to why it happened Predictive analytics involves using
2. Diagnostic analytics involves digging historical data to predict future outcomes
deeper into historical data to understand or trends. It focuses on forecasting and
why certain events occurred. It focuses on making informed predictions based on
identifying the root causes or factors patterns and relationships identified in
historical data.
2. Predictive analytics extends trends into the ● Building predictive models, using
future to see possible outcomes. This is a statistical techniques like regression
more complex version of data analytics analysis and machine learning
because it uses probabilities for algorithms to forecast future sales,
predictions instead of simply interpreting demand, or customer behavior.
existing facts.
PRESCRIPTIVE ANALYTICS
3. Use predictive analytics by first identifying
what you want to predict and then 1. Prescriptive analytics is the use of
bringing existing data together to project advanced processes and tools to analyze
possibilities to a particular date. Statistical data and content to recommend the
modeling or machine learning are optimal course of action or strategy
commonly used with predictive analytics. moving forward. Simply put, it seeks to
This is how you answer planning questions answer the question, "What should we
such as how much you might sell or if do?". Prescriptive analytics involves
you're on track to hit your Q4 targets recommending actions or decisions based
4. A business is in a better position to set on insights derived from descriptive,
realistic goals and avoid risks if they use diagnostic, and predictive analytics. It
data to create a list of likely outcomes. focuses on providing actionable
Predictive analytics can keep your team or recommendations to optimize outcomes
the company as a whole aligned on the or achieve specific objectives.
same strategic vision. 2. Ultimately, prescriptive analytics helps you
5. The primary challenge with predictive make better decisions about what your
analytics is that the insights it generates next course of action should be. This can
are limited to the data First, that means involve any aspect of your business, such
that smaller or incomplete data sets will as increasing revenue, reducing customer
not yield predictions as accurate as larger churn, preventing fraud, and increasing
data sets might. efficiency.
6. Getting good business intelligence (BI) 3. Prescriptive analytics uses the data from a
from predictive analytics requires sufficient variety of sources - including statistics,
data, but what counts as "sufficient" machine learning. and data mining - to
depends on the industry, business, identify possible future outcomes and
audience, and the use case show the best option.
7. Additionally, the challenge of predictive 4. Prescriptive analytics is the most advanced
analytics being restricted to the data of the four types because it provides
simply means that even the best actionable insights instead of raw data.
algorithms with the biggest data sets can't This methodology is how you determine
weigh intangible or distinctly human what should happen, not just what could
factors. A sudden economic shift or even a happen. Using prescriptive analytics
change in the weather can affect enables you to not only envision future
spending, but a predictive analytics model outcomes but to understand why they will
cant account for those variables happen.
8. Examples of predictive analytics include: 5. Prescriptive analytics also can predict the
● Ecommerce businesses that use a effect of future decisions, including the
customer's browsing and ripple effects those decisions can have on
purchasing history to make product different parts of the business. And it does
recommendations. this in whatever order the decisions may
● Financial organizations that need occur.
help determining whether a 6. Prescriptive analytics is a complex
customer is likely to pay their credit process that involves many variables and
card bill on time tools like algorithms, machine learning.
● Marketers who analyze data to and big data. Proper data infrastructures
determine the likelihood that new need to be established or this type of
customers will respond favorably to analytics could be a challenge to manage.
a given campaign or product 7. The most common issue with prescriptive
offering. analytics is that it requires a lot of data to
produce useful results, but a large amount to ingest, transform, and store large
of data isn't always available. This type of volumes of data for analysis.
analytics could easily become inaccessible 4. Marketing Analyst: Marketing analysts
for most. analyze marketing data to measure the
8. Examples of prescriptive analytics effectiveness of marketing campaigns,
include: understand customer behavior, and
● Calculating client risk in the identify opportunities for targeting and
insurance industry to determine segmentation. They use data analytics
what plans and rates an account techniques to optimize marketing
should be offered. strategies and drive business growth.
● Discovering what features to 5. Quantitative Analyst (Quant): Quants use
include in a new product to ensure mathematical and statistical techniques to
its success in the market, possibly analyze financial data and develop
by analyzing data like customer quantitative models for trading, risk
surveys and market research to management, and investment strategies.
identify what features are most They often work in the finance industry
desirable for customers and and require strong analytical and
prospects. programming skills.
● Identifying tactics to optimize 6. Data Architect: Data architects design and
patient care in healthcare, like oversee the structure and organization of
assessing the risk for developing data systems and databases to ensure they
specific health problems in the meet the needs of an organization's data
future and targeting treatment analytics initiatives. They collaborate with
decisions to reduce those risks. data engineers and analysts to design data
● Implementing decision support models, schemas, and architectures that
systems, using optimization support data analysis and reporting.
algorithms, and recommending 7. Data Visualization Specialist: Data
courses of action based on visualization specialists design and create
predictive models to improve visual representations of data, such as
business processes or strategies. charts, graphs, and dashboards, to
communicate insights effectively to
JOBS RELATED TO DATA ANALYTICS
stakeholders. They use data visualization
1. Data Analyst: Data analysts are tools like Tableau. Power BI. and D3.is to
responsible for collecting, processing, and create interactive and informative
analyzing data to uncover insights and visualizations.
trends that can inform business decisions. 8. Machine Learning Engineer. Machine
They often work with databases, learning engineers develop and deploy
spreadsheets, statistical software, and data machine learning models and algorithms
visualization tools to analyze data and to solve complex problems and make
present findings to stakeholders predictions based on data. They work
2. Business Analyst: Business analysts focus closely with data scientists and software
on understanding business processes, engineers to build and optimize machine
identifying opportunities for improvement, learning pipelines and algorithms.
and making data-driven 9. Data Scientist: Data scientists use
recommendations to enhance business advanced statistical and machine learning
performance. They use data analysis techniques to analyze complex datasets
techniques to assess market trends, and extract valuable insights. They often
customer behavior, and operational work with big data technologies,
efficiency. programming languages like Python and
3. Data Engineer: Data engineers are R, and machine learning frameworks to
responsible for designing, building, and develop predictive models and algorithms
maintaining data pipelines and
infrastructure to support data analytics These are just a few examples of job roles related
initiatives. They work with big data to data analytics. The field of data analytics is
technologies like Hadoop, Spark, and Kafka constantly evolving, and new job roles and
specialties continue to emerge as organizations 4. Structured data is commonly stored in
increasingly rely on data-driven insights to inform data warehouses and unstructured data is
decision-making and drive innovation. stored in data lakes (storage). Both have
cloud-use potential, but structured data
3.4 STRUCTURED AND UNSTRUCTURED DATA allows for less storage space and
STRUCTURED AND UNSTRUCTURED DATA unstructured data requires more.
5. Structured Data is regarded as the most
Data can come in various variants, like structured traditional' type of data storage. This is
and unstructured data. Structured data is highly because the oldest implementations of
organized and formatted so that its easily relational DBMS were capable of storing,
searchable in relational databases. Unstructured processing, and accessing structured data.
data has no predefined format or organization, In RDBMS, fields store length-delimited
making it much more difficult to collect, process, data like phone numbers, Social Security
and analyze numbers, or ZIP codes, and records even
contain text strings of variable length like
STRUCTURED DATA UNSTRUCTURED DATA names, making it a simple matter to
Organized information Diverse structure for
information
search
Requires less storage Requires more storage 6. Structured data consists of clearly defined
Easier to manage and More difficult to manage data types with patterns that make them
protect with legacy and protect with legacy easily searchable, while unstructured
systems and solutions. systems and solutions.
data-'everything else"-is composed of data
Can be displayed in rows, Cannot be displayed in
columns and relational rows, columns and that is usually not as easily searchable,
databases. relational databases. including formats like audio, video, and
Estimated 20% of Estimated 80% of social media postings
enterprise data (Garter) enterprise data (Garter)
7. Structured data analytics is a mature
Numbers, Dates, Strings Images, audio, video, word
processing files, emails, text process and technology, whereas
files unstructured data analytics is a
Examples: ZIP codes, Examples: Text files, Email, developing industry with a lot of new
Phone numbers, Email Social media, Website,
investment in research and development.
addresses, ATM activity. Mobile data, Satellite
Inventory control, Student imagery. Scientific data, BENEFITS OF USING STRUCTURED DATA
fee payment databases, Digital surveillance and
Airline reservation and Sensor data. 1. Easy to Use
ticketing ● Business users who understand
what the subject matter of the data
STRUCTURED DATA is and how it is related to their
infrastructure can easily understand
1. Structured data is highly organized and
how to structure their data.
formatted so that it's easily searchable in
● Tools such as Excel or Google
relational databases.
Sheets make structured data easy,
2. Structured data is more finite and sorted
or more advanced users can lean
into data arrays, while unstructured data is
further into SQL or business
scattered and variable. Structured data
intelligence tools.
adheres to a predefined data model; thus,
2. Convenient Storage
they are easy to analyze
● Because structured data is
3. Structured data relies on the existence of a
organized, it is commonly stored in
data model-a specification for how data
data centers for easy access of the
can be organized, processed, and
data.
interpreted; they are easy to analyze.
● The data warehouses hold their
Structured data adheres to the table
own space for businesses that
format - the relationship between rows
choose to use it.
and columns. Excel files or SQL databases
3. Instant Usability
are two prominent examples of structured
● Structured data is organized,
data. Both (excel files and SQL databases)
making it easy for a company to
consists of structured rows and columns
find exactly what they are looking
which can be easily ordered and
for.
categorized
● With this method, a company can times, the ability to access and analyze
begin using the data instantly. unstructured data has expanded
DISADVANTAGE OF USING STRUCTURED DATA tremendously, with several emerging
technologies and software coming onto
1. Limitations On Use the market that can store different forms
● Due to the organization style of of unstructured data.
structured data, it is more difficult 6. Unstructured data has an internal
to have flexibility or varied use structure but is not structured via
cases. predefined data models or schema It may
● Structured data can only be used be textual or non-textual and
for its intended purpose. This limits human-generated or
its flexibility and use cases. machine-generated.
2. Limited Storage
● Structured data is stored in specific Human-generated unstructured data includes:
spaces of data warehouses. ● Text Files: Word processing, presentations,
● While accessing the data is easy, emails, and logs.
scalability can be difficult. ● Email: Message field, largely text, but has
● Changes within data warehouses some internal structure thanks to its
can become hard to manage. metadata (eg. including the visible 'to*,
● Using cloud data centers help with "from", "date / time", "subject entered to
the storage problems. send an email) but also mixes in
3. High Overhead unstructured data via the message body.
● Data centers or other storage for For this reason, email is also referred to as
structured data can become semi-structured data.
expensive and be part of the ● Social Media: Data from Facebook, Twitter,
structured data ordeal. and Linkedin.
● Any change in requirements means ● Websites: YouTube, Instagram, and photo
updating all of that structured data sharing sites.
to meet the new needs. This results ● Mobile Data: Text messages and locations.
in massive expenditure of ● Communications: Chat, IM, phone
resources. recordings, and collaboration software.
● Again, cloud data centers are ● Media: MP3, digital photos, audio
recommended, but the storage can recording and video files.
still require significant work to keep ● Business Applications: Microsoft Office
the data maintained properly. documents, PDFs and productivity
applications.
UNSTRUCTURED DATA
1. Unstructured data is data stored in its Machine-generated unstructured data includes:
native format and not processed until ● Satellite Imagery. Weather data,
used, which is known as schema-on-read. landforms, and military movements.
2. Unstructured data comes in a myriad of ● Scientific Data: Oil and gas exploration,
file formats, including email, social media space exploration, seismic imagery, and
posts, presentations, chats, loT sensor data, atmospheric data.
and satellite imagery. ● Digital Surveillance: Surveillance photos
3. Unstructured data are those data that and video, cctv Sensor Data: Traffic,
have no predetermined data model. weather, and oceanographic sensors.
4. Usually, unstructured data is text-heavy, BENEFITS OF UNSTRUCTURED DATA
but may also include data like dates,
1. Limitless Use
numbers, and statistics. This leads to
● Use cases for unstructured data are
inconsistencies and contradictions that
significantly larger than structured
make it hard to comprehend conventional
data due to its flexibility.
systems as opposed to data stored in
● From social media posts to
structured databases
scientific data, unstructured data
5. Unstructured data may include audio,
gives companies the flexibility to
video, or No-SQL databases. In recent
use the data how they want.
2. Greater Insights that scale data into records and fields in a
● When a company has more dataset. Common examples of
unstructured data than structured semi-structured data are JSON and XML
data, there is more data to work 3. Semi-structured data is more complex
with. than structured data but less complex
● Unstructured data may be difficult than unstructured data.
to analyze, but through processing, 4. Semi-structured data also relatively easier
a company can benefit from the to store than unstructured data, bridging
data the gap between the two data types. An
3. Low Overhead XML sitemap contains page information
● Because of the ability to store for a website. It embeds URLs, domain
unstructured data at data lakes, a scores, do-follow pages, and meta tags.
business can save money with how 5. Email is another common example of a
they choose to store the data semi-structured data type. Although more
DISADVANTAGES OF UNSTRUCTURED DATA advanced analysis tools are necessary for
thread tracking, near-dedupe, and concept
1. Hard To Analyze searching, email's native metadata enables
● If a company uses unstructured classification and keyword searching
data, it is more difficult to take the without any additional tools
raw data and analyze it despite its
flexibility. COMPARISON OF DATA VARIANTS
● Users require a proficient STRUCTURED SEMI-STRUCTURED UNSTRUCTURED
background in data science and Structured in a Some degree of No predefined
spreadsheet like organizational organizational
machine learning to prepare, What is it?
manner (e.g. in a structure form and no
table specific format
analyze and integrate it with Think of a Think of a Text file Essentially
machine learning algorithms spreadsheet with text that has anything that is
To put it simply (excel) or data in a
some structure not structured or
2. Data Analytic Tools tabular form (header, semi-structured
paragraphs, etc. data (a lot)
● Unstructured data cannot be Excel, spreadsheet, HTML files, Images (jpg png),
Comma separated JavaScript Object Videos (mp4 avi),
managed by business tools values (.csv), Notation (JSON) Sound files (mp3
● Its inconsistent nature makes it Example Formats Relational files, Extensible wav). Plain text
database tables Markup Language files
more difficult than structured data. (XML) files Word files/PDF
files
● Currently, there aren't many tools •Within the table, •Tags or other •Data can take any
entries have the markers separate form and thus be
that can manipulate unstructured same format and elements and stored as any kind
data apart from cloud commodity predefined length enforce of file (formless).
and follow the hierarchies, but • within that file,
servers and open-source NoSQL same order. the size of there is no
•Is easily elements can vary structure of
DBMS machine-readable and their order is content.
and can therefore not important. • Typically needs a
3. Numerous Formats be analyzed •Needs some major
● Unstructured data comes in many Characteristics
without major
processing of the
preprocessing
before it can be
pre-processing
before it can be
different forms, such as medical data. analyzed by a analyzed by a
•It is commonly computer. computer, but
records, social media posts, and said that around • Has gained often easily
20% of the world's importance with consumable for
emails data is structured. the emergence of humans (e.g
the World Wide pictures, videos,
● This information may be Web. plain texts)
challenging with analysis. • Most of the data
created today is
4. Less Secured unstructured.

● Unstructured data rests on less


3.5 DATA QUALITY
authentic and encrypted shared
servers, which are more prone to LIST OF TERMINOLOGIES IN DATA QUALITY
ransomware and cyber attacks. 1. Outlier Data quality refers to observations
SEMI-STRUCTURED DATA or data points that deviate significantly
from the rest of the dataset. These data
1. Semi-structured data is a type of
points are unusual, unexpected, or
structured data that lies midway between
anomalous compared to the majority of
structured and unstructured data
the data and may indicate errors,
2. Semi-structured data doesn't have a
anomalies, or interesting patterns in the
specific relational or tabular data model
data. Outliers can occur in various types of
but includes tags and semantic markers
data, including numerical, categorical, and determining the current structure,
time-series data and the consequent transformation
2. GIGO is an acronym that stands for that is required.
"Garbage In, Garbage Out', refers to the 3. Data Extraction
principle that the quality of the output ● During this phase, data is moved
produced by a data analytics determined from a source system to a target
by the quality of the input data provided to system. Extraction may include
it. In other words, if the input data is structured or unstructured sources
inaccurate, incomplete, or irrelevant (i.e., 4. Code Generation and Execution
"garbage"), then the output produced by ● Once extracted and loaded,
the data analytics will also be inaccurate, transformation needs to occur on
unreliable, or nonsensical. The quality of the raw data to store it in a format
output is determined by the quality of appropriate for Business
the input. Intelligence and analytic use. This is
3. Data Profiling is the process of analyzing frequently accomplished by
and summarizing the structure, content, analytics engineers, who write
and quality of data within a dataset. Data program to transform data. This
profiling helps identify data quality issues code is executed daily/hourly to
and assess the overall quality of the data. provide timely and appropriate
4. Data Cleansing is the process of analytic data.
identifying and correcting errors, 5. Review
inconsistencies, and inaccuracies in a ● The transformed data is evaluated
dataset to improve its quality and to ensure the conversion has had
reliability. the desired results in terms of the
5. Data Validation is the process of verifying format of the data. It must also be
the integrity and accuracy of data through noted that not all data will need
validation checks and cross-referencing transformation, at times it can be
with external sources or known standards. used as IS.
6. Data Governance is the framework and 6. Sending
processes for ensuring the availability, ● The final step involves sending data
integrity, security, and usability of data to its target destination. The target
within an organization. Data governance might be a data warehouse or other
establishes policies, standards, and database in a structured formal
procedures for managing data quality and
TYPES OF DATA TRANSFORMATION
ensuring compliance with regulations
The four transformation processes are:
SIX PHASES DATA TRANSFORMATION
1. Constructive
In data mining and data warehousing, a. Changes are made on data with the
transformation process generally follows 6 stages intention of building or creating
something new, innovative, or
1. Discovery valuable
● The first step is to identify and b. The data item is added, copied or
understand data in its original replicated or aggregated. New data
source format with the help of data structures, features, or attributes
profiling tools. Finding all the are created from existing data
sources and data types that need to sources.
be transformed. This step helps in c. This type of transformation often
understanding how the data needs involves tasks such as data
to be transformed to fit into the aggregation, summarization,
desired format. enrichment, or feature engineering.
2. Mapping 2. Destructive
● During this phase, analysts a. The data, data items or data units
determine how individual fields are are trimmed or deleted.
modified, matched, filtered, joined,
and aggregated. This includes
b. It involves removing, filtering, or because it was entered incorrectly, but
reducing the size or complexity of because it used to be accurate before but
datasets. now it isn't anymore. It's possible that the
c. It can involve tasks such as data data has been replaced by newer data that
cleaning, filtering outliers, or down makes it outdated. A typical example of
sampling large datasets dirty data that's outdated is if your ORM
3. Aesthetic dataset still lists a customer's old address
a. This type of transformation may after they've moved.
involve tasks such as data 6. Insecure data is a sensitive data, which is
visualization, formatting, labeling, not encrypted or access controlled. If's
or designing interactive accessible by anyone in your company and
dashboards. - in worst case scenarios - even by third
b. Aesthetic data transformation aims parties. Insecure data can result from
to enhance the clarity, various factors, including inadequate
understandability, and security measures, vulnerabilities in data
engagement of data visualizations storage or transmission mechanisms, or
for decision-makers and malicious activities by unauthorized users
stakeholders. 7. Inaccurate data refers to data that
c. Certain values are standardized to contains errors, discrepancies, or
meet requirements or parameters inconsistencies that deviate from the true
4. Structural or expected values. Inaccuracies in data
a. Structural data transformation can arise from various sources, including
involves reshaping or restructuring data entry errors, measurement
the format, layout, or schema of inaccuracies, system glitches, or
datasets. inconsistencies in data collection processes
b. This includes columns being 8. Incorrect data is data that falls outside of
renamed, moved and combined previously specified parameters. It is easier
c. This type of transformation may to prevent when data validation is in place.
include tasks such as data pivoting, Incorrect data includes misspellings and
melting, reshaping, or joining other typographical errors, wrong value
tables. entries and syntax error. An example would
TYPES OF DIRTY DATA be if a customer enters their birthdate
using a dropdown menu. Your system will
1. Dirty data, or unclean data or erroneous likely only allow them to select one out of
data, is data that is in some way faulty: it 12 months, one out of 31 days, and perhaps
might contain duplicate, incomplete, they also won't be able to select a birth
outdated, insecure, inaccurate, incorrect, year that would make them older than 130
inconsistent, or outlier. years.
2. Data can get dirty when it's entered, 9. Inconsistent data is also known as data
stored, or used incorrectly. Oftentimes, this redundancy. It occurs when companies
comes down to human error or a lack of store the same information in different
standardization rules for data entry but places without syncing that information.
technical issues can also lead to dirty data. Inconsistent data is a situation where
3. Duplicate data refers to records that there are multiple tables within a
partially or fully share the same database that deal with the same data
information. They come about when the but may receive it from different inputs.
same information is entered multiple A prime example would be a company
times, sometimes in different formats storing customer information both in its
4. Incomplete data, also known as missing CRM and in its email marketing tool. It
data, occur when you don't have data usually stems from either poor initial
stored for certain variables or data items. relational database design, wherein
Data can go missing due to incomplete information is inefficiently structured and
data entry, equipment malfunctions, lost needlessly replicated within the same
files, and many other reasons. table.
5. Outdated data, also referred to as obsolete
data or expired data, is inaccurate not
10. Outlier data refers to data values that answers other than these would not be
differ significantly from other values in considered valid or legitimate based on the
your data set. For example, if you see that survey's requirement. Example: Feb 30 is
most student test scores fall between 50 not a valid response for the date entry.
and 80, but that one student has scored a 5. Data is legitimate when it is safe, valid, and
2, this might be considered an outlier. fit to use. Data must be generated from
Outliers may be the result of an error, but credible and reliable source
that's not always the case, so approach
DATA ACCURACY AND UNIFORMITY
with caution when deciding whether or
not to remove them. 1. This characteristic refers to the exactness
of the data. You need to ensure your data is
CHARACTERISTICS OF DATA QUALITY
close to the true values. This also means
Determining the quality data requires an that the data is error-free and has a reliable
examination of its characteristics, then weighing and consistent source of information.
those characteristics according to what is most 2. For example, if two people were trying to
important to your organization and the measure something like customer
application(s) for which they will be used. satisfaction but had differing
interpretations of what constituted
SIX (6) CHARACTERISTICS OF CLEAN DATA / satisfaction, they could end up with
QUALITY DATA conflicting results if they weren't using an
1. Validity (and Legitimacy) accurate definition of satisfaction.
2. Accuracy (and Precision / Uniformity / 3. Another example, accuracy in healthcare
Uniqueness) might be more important than in another
3. Completeness (and Comprehensiveness) industry (which is to say, inaccurate data in
4. Consistency (and Reliability) healthcare could have more serious
5. Timeliness (and Availability / Accessibility) consequences) and, therefore, justifiably
6. Relevance (and Coherent) worth higher levels of investment.
4. Outliers are values within a dataset that
Data quality characteristics are an essential part of vary greatly from the others - they're either
working with datasets as they can help determine much larger, or significantly smaller.
which datasets are reliable enough for use in Outliers are considered invalid data and
decision making processes or business operations. must be removed from the dataset.
Keeping these key characteristics in mind when 5. This accounts for the amount of duplicate
dealing with datasets can help ensure trust and data in a dataset. For example, when
reliability which can ultimately save time by reviewing customer data, you should
eliminating erroneous results from poor data expect that each customer has a unique
quality management practices. customer ID.

DATA VALIDITY AND LEGITIMACY DATA COMPLETENESS AND


COMPREHENSIVENESS
1. Data Validity refers to the degree to which
your data conforms to defined business 1. Data Completeness refers to the degree to
rules or constraints. which all required data is supplied and
2. Data should be collected according to the known. It refers to the extent to which a
defined parameters and should conform dataset has all the relevant and necessary
to the right format and fall within the information for a given purpose.
right range 2. This represents the amount of data that is
3. Accurate and reliable data is critical for usable or complete. If there is a high
businesses to make informed decisions percentage of missing values, it may lead
and avoid costly mistakes. However, to a biased or misleading analysis if the
ensuring data validity can be a complex data is not representative of a typical data
and time-consuming process. sample
4. For example, on surveys, items such as 3. Data completeness means that all the
gender, ethnicity, and nationality are necessary fields of information are present
typically limited to a set of options and and accurate. A complete dataset should
open answers are not permitted. Any not have any missing, duplicated, or
irrelevant values that could affect the lead to inaccurate results, outdated
analysis assumptions and incorrect decisions. This
4. Data completeness is important for a is especially true in fields such as business
variety of reasons. Incomplete data can intelligence, finance, healthcare, marketing
lead to inaccurate conclusions and and analytics.
decisions which can have serious 2. There must be a valid reason to collect the
implications for both businesses and data to justify the effort required, which
individuals. also means it has to be collected at the
5. For example, incomplete customer records right moment in time. Data collected too
can result in inaccurate customer soon or too late could misrepresent a
segmentation, leading to inefficient situation and drive inaccurate decisions
marketing campaigns that target the 3. The data timeliness refers to the availability
wrong people. Incomplete datasets can and accessibility of the selected data. Let's
cause systemic errors due to incorrect say our sales report is going to be used for
relationships between entities in a given weekly employee reviews, but our report is
dataset. only refreshed once a month.
4. This error in refreshing the data would
DATA CONSISTENCY AND RELIABILITY
cause our report to become outdated,
1. Data consistency is a crucial aspect that which would have serious consequences
ensures the accuracy and reliability of data. for employee reviews.
High quality data also requires the data to 5. The timeliness of information is an
be consistent across different systems. important data quality characteristic,
2. Data inconsistency occurs when different because information that isn't timely can
values of the same data exist in different lead to people making the wrong
systems, which can result in incomplete or decisions.
inaccurate information being presented.
DATA RELEVANCE AND COHERENT
This logic can also be applied around
relationships between data. For example, 1. Data Relevance is another trait of quality
the number of employees in a department data. When collecting information, a data
should not exceed the total number of analyst must consider if the data being
employees in a company. assembled is really necessary or relevant
3. Data must be consistent within the same for the project.
dataset and/or across multiple data sets. 2. Data relevance is an essential factor in
Data is considered consistent if two or determining the quality of data. It refers to
more values in different locations are how pertinent the data is to a particular
identical. application or business purpose.
4. Many systems in today's environments use High-quality data is invariably relevant to
and/or collect the same source data. its intended use and contains only
Regardless of what source collected the information that is necessary and
data or where it resides, it cannot appropriate for achieving desired results
contradict a value residing in a different For example, when reviewing the data
source or collected by a different system. related to the sales revenue per customer,
There must be a stable and steady information such as customer birthdays
mechanism that collects and stores the and other personal information is also
data without contradiction or unwarranted included. By making the determination
variance. early to exclude the personal information
5. Data reliability means that when from the data set, the analyst would save
information is collected from different themselves from having to review
sources or over multiple time periods, it unnecessary information.
should be consistent and produce similar 3. Without data relevance, data accuracy can
results. suffer from distractions such as additional
items that are misleading or without
DATA TIMELINESS AND ACCESSIBILITY
importance to the present context, this can
1. Data timeliness, as the name implies, lead to misdirection and suboptimal
refers to how up-to-date information is. outcomes.
Data that is not current or up-to-date can
DATA QUALITY REVIEWER 15. Inaccurate data refers to data that contains
errors and discrepancies that deviate from
1. Data quality determines the usability and
the true or expected values.
trustworthiness of data.
16. Constructive Transformation process
2. The characteristics of quality data are
where data item is added, copied or
validity, accuracy, completeness,
replicated.
consistency, timeliness, relevance, and
17. Destructive Transformation process where
reliability.
data items or records are trimmed or
3. Data quality issues include incomplete,
deleted.
duplicate, outdated, insecure, inaccurate,
18. Aesthetic Transformation process where
incorrect, inconsistent, and outlier.
certain values are standardized to meet
4. Data Validity refers to the degree to which
requirements or parameters.
your data conforms to defined business
19. Structural transformation process which
rules or constraints.
includes columns being renamed, moved
5. Data Accuracy ensures that your data is
and combined.
close to the true values.
20. Data Cleaning is also known as data
6. Data consistency refers to the degree to
cleansing and data scrubbing.
which all required data is supplied and
21. Garbage-In Garbage-Out or GIGO, simply
known. Data consistency ensures your data
means the quality of output is determined
is stable within the same data set and/or
by the quality of the input.
across multiple data sets. Data consistency
22. The data completeness is likely to be
occurs when aggregated data is reconciled
achieved when you make the important
with detailed data at lower levels of
fields mandatory in the data entry and
granularity.
data model.
7. Data Uniformity refers to the degree to
23. Data timeliness refers to data that is
which data is specified using the same unit
available when it is required.
of measure.
24. At the discovery stage, data teams work to
8. The data set must be updated or refreshed
understand, identify, and find all applicable
to replace the obsolete data with the
raw data and data types that need to be
newer data.
transformed. Data discovery includes
9. Data duplicate, also known as data
identifying and understanding data in its
redundancy, occurs when the same
original source format with the help of
information is entered multiple times,
data profiling tools.
sometimes in different formats. Data
25. At the data mapping stage, data teams
duplicate can be avoided by implementing
determine how individual fields are
record validation checks within a program
matched, filtered, joined, modified, and
to ensure that a record does not already
aggregated.
exist before it is added to a dataset or
26. At the extraction stage, data teams move
database.
data from its source system into the
10. Outlier data refers to the values that differ
staging areas.
significantly from other values in your data
27. At code generation and execution stage,
set. Outlier data refers to observations or
data teams generate codes based on the
data points that deviate significantly from
mapping process using a programming
the rest of the dataset.
language.
11. Insecure data refers to sensitive data that
28. At the review output stage, the
are not encrypted or access controlled.
transformed data is evaluated by the data
12. Incomplete data occurs when you don't
teams to ensure the conversion has had
have data stored for certain variables or
the desired results in terms of the format
data items.
of the data.
13. Incorrect data can easily be prevented
29. At the send to target stage, it involves
when data validation is in place.
sending the transformed data to its target
14. Inconsistent data occurs when there are
destination.
multiple tables within a database that deal
30. Data profiling involves identifying patterns
with the same data but may receive it from
and inconsistencies in data. Data profiling
different inputs.
helps identify data quality issues and 3. Scientific data is an example of
assess the overall quality of the data. machine-generated unstructured data
while MP3 audio files are example of
DATA AND INFORMATION REVIEWER
machine-generated unstructured data.
1. Data serves as the raw material for 4. Structured data requires less storage while
information. Data undergoes processing to unstructured data requires more storage.
become information. Data isn't sufficient Structured data is more secured compare
for decision-making, but you can make to unstructured data.
decisions based on information. 5. Structured data cannot be displayed in
2. Qualitative data refers to the type of data rows, columns, and relational databases
that can't be measured or counted in the while unstructured does not conform to a
form of numbers. specific format and is not easily analyzable
3. Quantitative data refers to the type of data using traditional methods.
that can be quantified, counted or
measured, and given a numerical value.
4. There are four data categories: nominal,
ordinal, discrete and continuous.
5. Nominal data refers to the data category
that has no quantitative value, inherent
order or ranking among them. Examples of
nominal data: color of hair, types of
vehicles, and nationality.
6. Ordinal data refers to the data category
that possesses a natural order or ranking
among its categories. Ordinal data can
help to compare one item with another by
ranking or ordering.
7. Examples of ordinal data: ranking of
winners, letter grades and educational
level.
8. Discrete data refers to the data category
that takes on specific, countable values
and does not include fractions or decimals.
Examples of discrete: days in a week,
number of employees, and number of
books.
9. Continuous data refers to the data
category that has no inherent order or
ranking among them. Continuous data are
values that can be divided into
subdivisions and smaller pieces. Examples
of continuous data: average score, height,
weight, temperature and wifi frequency.
10. Not all NUMBERS are numeric. Some
numbers like rank number, room number
and phone number are non-numeric.

STRUCTURED AND UNSTRUCTURED DATA


REVIEWER
1. Structured data is highly organized and
formatted so that it's easily searchable in
relational databases.
2. Unstructured data is commonly stored in
databases or data warehouses and
unstructured data is stored in data lakes
(storage).

You might also like