0% found this document useful (0 votes)
1 views

CS213 - 04 - Data Science

The document provides an overview of data types and their representation in data science, discussing structured, semi-structured, and unstructured data. It also outlines the data value chain, detailing stages from data acquisition to usage, and introduces the concept of big data along with its characteristics. Additionally, it defines key terms such as data warehouse, latency, and data lake.

Uploaded by

Nadir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

CS213 - 04 - Data Science

The document provides an overview of data types and their representation in data science, discussing structured, semi-structured, and unstructured data. It also outlines the data value chain, detailing stages from data acquisition to usage, and introduces the concept of big data along with its characteristics. Additionally, it defines key terms such as data warehouse, latency, and data lake.

Uploaded by

Nadir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Science

CS213 - Introduction to Emerging Technologies

Tsegamlak Molla
• Data & Information

• The DIKW Pyramid

• Data Quality
Previously • Data Processing

2 CS213 - Introduction to Emerging Technologies


Data Types &
Their
Representation

3 CS213 - Introduction to Emerging Technologies


Data Type

• Data types can be described from diverse perspectives.

• In computer science and computer programming, for instance, a


data type is simply an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.

4 CS213 - Introduction to Emerging Technologies


Data Type

• Data type defines;

• The operations that can be done on the data,

• The meaning of the data,

• The way values of that type can be stored.

5 CS213 - Introduction to Emerging Technologies


Computer Programming Perspective

• Integers (int)
• Is used to store whole numbers, mathematically known as integers

• Booleans (bool)
• Is used to represent restricted to one of two values: true or false

6 CS213 - Introduction to Emerging Technologies


Computer Programming Perspective

• Characters (char)
• Is used to store a single character

• Floating-point numbers (float)


• Is used to store real numbers

• Alphanumeric strings (string)


• Used to store a combination of characters and numbers

7 CS213 - Introduction to Emerging Technologies


Data Analytics Perspective

• From a data analytics point of view, it is important to


understand that there are three common types of data types or
structures:
• Structured,

• Semi-structured,

• Unstructured.

8 CS213 - Introduction to Emerging Technologies


Data Analytics Perspective

9 CS213 - Introduction to Emerging Technologies


Data Analytics Perspective

10 CS213 - Introduction to Emerging Technologies


Structured Data
• Structured data is data that adheres to a pre-defined data model and
is therefore straightforward to analyze.
• Structured data conforms to a tabular format with a relationship
between the different rows and columns.
• Common examples of structured data are Excel files or SQL
databases.
• Each of these has structured rows and columns that can be sorted.

11 CS213 - Introduction to Emerging Technologies


Semi-structured Data

• Semi-structured data is a form of structured data that does not


conform with the formal structure of data models associated
with relational databases or other forms of data tables,

• but nonetheless, contains tags or other markers to separate


semantic elements and enforce hierarchies of records and fields
within the data.

12 CS213 - Introduction to Emerging Technologies


Semi-structured Data

• Therefore, it is also known as a self-describing structure.

• Examples of semi-structured data include JSON and XML are


forms of semi-structured data.

13 CS213 - Introduction to Emerging Technologies


Semi-structured Data

14 CS213 - Introduction to Emerging Technologies


Unstructured Data

• Unstructured data is information that either does not have a


predefined data model or is not organized in a pre-defined
manner.

• Unstructured information is typically text-heavy but may contain


data such as dates, numbers, and facts as well.

15 CS213 - Introduction to Emerging Technologies


Unstructured Data

• This results in irregularities and ambiguities that make it difficult


to understand using traditional programs as compared to data
stored in structured databases.

• Common examples of unstructured data include audio, video


files or No-SQL databases.

16 CS213 - Introduction to Emerging Technologies


Metadata – Data about Data

• This is not a separate data structure, but it is one of the most


important elements for Big Data analysis and big data solutions.

• Metadata is data about data. It provides additional information


about a specific set of data.

17 CS213 - Introduction to Emerging Technologies


Metadata – Data about Data

• In a set of photographs, for example, metadata could describe


when and where the photos were taken.

• The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data.

• Because of this reason, metadata is frequently used by Big Data


solutions for initial analysis.

18 CS213 - Introduction to Emerging Technologies


Data Value
Chain

19 CS213 - Introduction to Emerging Technologies


Data Value Chain

• The data value chain and the data processing lifecycle are two
distinct concepts related to the management and utilization of
data.

20 CS213 - Introduction to Emerging Technologies


Data Value Chain

• The data value chain refers to the end-to-end process of


creating value from data within an organization.

• It encompasses the entire lifecycle of data, starting from its


collection or acquisition to its eventual utilization for decision-
making and generating insights.

21 CS213 - Introduction to Emerging Technologies


Data Value Chain

• In other words, it categorizes all of the various steps required to


transform raw data into useful insights.

• The data value chain typically includes the following stages:

22 CS213 - Introduction to Emerging Technologies


Data Value Chain

1. Data Acquisition • It describes connections

2. Data Analysis between each step that


change low-value inputs into
3. Data Curation
high-value outputs.
4. Data Storage

5. Data Usage

23 CS213 - Introduction to Emerging Technologies


Data Acquisition

• It is the process of gathering, filtering, and cleaning data before


it is put in a data warehouse or any other storage solution on
which data analysis can be carried out.

• Data acquisition is one of the major big data challenges in terms


of infrastructure requirements.

24 CS213 - Introduction to Emerging Technologies


Data Acquisition Infrastructure

• Must deliver low, predictable latency in both capturing data and


in executing queries

• Be able to handle very high transaction volumes, often in a


distributed environment

• Support flexible and dynamic data structures.

25 CS213 - Introduction to Emerging Technologies


Data Analysis

• It is concerned with making the raw data acquired amenable to


use in decision-making as well as domain-specific usage.

26 CS213 - Introduction to Emerging Technologies


Data Analysis

• Data analysis involves exploring, transforming, and modeling


data with the goal of highlighting relevant data, synthesizing
and extracting useful hidden information with high potential
from a business point of view.

• Related areas include data mining, business intelligence, and


machine learning.

27 CS213 - Introduction to Emerging Technologies


Data Curation

• It is the active management of data over its life cycle to ensure it


meets the necessary data quality requirements for its effective
usage.

28 CS213 - Introduction to Emerging Technologies


Data Curation

• Data curation is performed by expert curators that are responsible for


improving the accessibility and quality of data.

• Data curators hold the responsibility of ensuring that data are


trustworthy, discoverable, accessible, reusable and fit their purpose.

• A key trend for the duration of big data utilizes community and
crowdsourcing approaches

29 CS213 - Introduction to Emerging Technologies


Data Storage

• It is the persistence and management of data in a scalable way


that satisfies the needs of applications that require fast access to
the data.

• RDBMSs have been the main, and almost unique, a solution to


the storage paradigm for nearly 40 years.

30 CS213 - Introduction to Emerging Technologies


Data Storage

• However, the ACID (Atomicity, Consistency, Isolation, and


Durability) properties that guarantee database transactions lack
flexibility with regard to schema changes and the performance
and fault tolerance when data volumes and complexity grow,
making them unsuitable for big data scenarios.

31 CS213 - Introduction to Emerging Technologies


Data Storage

Summary of how the ACID property can be a bottleneck:

32 CS213 - Introduction to Emerging Technologies


Data Storage

• NoSQL technologies have been designed with the scalability


goal in mind and present a wide range of solutions based on
alternative data models.

33 CS213 - Introduction to Emerging Technologies


Data Usage

• It covers the data-driven business activities that need access to


data, its analysis, and the tools needed to integrate the data
analysis within the business activity.

34 CS213 - Introduction to Emerging Technologies


Data Usage

• Data usage in business decision-making can enhance


competitiveness through the reduction of costs, increased
added value, or any other parameter that can be measured
against existing performance criteria.

35 CS213 - Introduction to Emerging Technologies


Data Value Chain Summary

36 CS213 - Introduction to Emerging Technologies


Basic Concepts
of Big Data

37 CS213 - Introduction to Emerging Technologies


Big Data

• Big data is a blanket term for the non-traditional strategies and


technologies needed to gather, organize, process, and gather
insights from large datasets.

• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

38 CS213 - Introduction to Emerging Technologies


Big Data

• A “large dataset” means a dataset too large to reasonably


process or store with traditional tooling or on a single
computer.

• This means that the common scale of big datasets is constantly


shifting and may vary significantly from organization to
organization.

39 CS213 - Introduction to Emerging Technologies


Characteristics of Big Data

• Big data is characterized by 3V and more:

• Volume
• Large amounts of data zeta bytes/massive datasets

• Velocity
• Data is live streaming or in motion

40 CS213 - Introduction to Emerging Technologies


Characteristics of Big Data

• Variety
• Data comes in many different forms from diverse sources

More V’s:

• Veracity
• Can we trust the data?

• How accurate, credible is it? Etc.

41 CS213 - Introduction to Emerging Technologies


Characteristics of Big Data

42 CS213 - Introduction to Emerging Technologies


Find out which other ‘V’s are
considered

43 CS213 - Introduction to Emerging Technologies


Some Key
Terms

44 CS213 - Introduction to Emerging Technologies


Data Warehouse

• Is a central repository of information that can be analyzed to


make more informed decisions.

• Data flows into a data warehouse from transactional systems,


relational databases, and other sources, typically on a regular
frequency.

45 CS213 - Introduction to Emerging Technologies


Latency

• The delay before a transfer of data begins following an


instruction for its transfer.

• The time it takes for a packet of data to travel from source to a


destination

46 CS213 - Introduction to Emerging Technologies


Data Lake

• Is a centralized repository designed to store, process, and secure


large amounts of structured, semi-structured, and unstructured
data.

• It can store data in its native format and process any variety of
it, ignoring size limits.

47 CS213 - Introduction to Emerging Technologies


Any Questions?

48 CS213 - Introduction to Emerging Technologies


Extra Content

• List and describe each technology or tool used in the big data
life cycle.

49 CS213 - Introduction to Emerging Technologies


Next Time
Distributed Systems (Clustered Computing)

50 CS213 - Introduction to Emerging Technologies

You might also like