0% found this document useful (0 votes)
44 views38 pages

Chapter - 2

Data science is a multi-disciplinary field that uses scientific methods to extract knowledge from structured, semi-structured, and unstructured data. It involves activities like data acquisition, data processing, and data analysis to generate insights. There are different types of data from both programming and analytics perspectives, including structured, unstructured, semi-structured data, and metadata which provides additional information about data. The data value chain describes the process of generating value from data through activities like data acquisition, data processing, data analysis, and data visualization.

Uploaded by

Netsanet Getnet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views38 pages

Chapter - 2

Data science is a multi-disciplinary field that uses scientific methods to extract knowledge from structured, semi-structured, and unstructured data. It involves activities like data acquisition, data processing, and data analysis to generate insights. There are different types of data from both programming and analytics perspectives, including structured, unstructured, semi-structured data, and metadata which provides additional information about data. The data value chain describes the process of generating value from data through activities like data acquisition, data processing, data analysis, and data visualization.

Uploaded by

Netsanet Getnet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

II.

DATA SCIENCE
OVERVIEW OF DATA SCIENCE
• Activity 2.1 - Define:
• Data science?
• Data and Information
• Big data?
• What is role of data in emerging technologies?
• Data Science is a multi-disciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured, semi-
structured and unstructured data.
• Much more than just analyzing data.
• Offers a range of roles and requires a range of skills (mathematical, programing, analytical, …)
OVERVIEW OF DATA SCIENCE …
• Example:
• Consider data involved in buying a box of cereal from the store or supermarket:
• Your data here is the planned purchase written somewhere
• When you get to the store, you use that piece of data to remind yourself about what
you need to buy and pick it up and put it in your cart.
• At checkout, the cashier scans the barcode on your box and the cash register logs the
price.
• Back in the warehouse, a computer informs the stock manager that it is time to order
this item from distributor because your purchase takes the last box in the store.
• You may have a coupon for your purchase and the cashier scans that too, giving you a
predetermined discount.
OVERVIEW OF DATA SCIENCE …
• Example:
• At the end of the week, a report of all the scanned manufacturer coupons gets uploaded
to the cereal company so they can issue a reimbursement to the grocery store for all of
the coupon discounts they have handed out to customers.
• Finally, at the end of the month, a store manager looks at a colorful collection of pie
charts showing all the different kinds of cereal that were sold and, on the basis of strong
sales of cereals, decides to offer more varieties of these on the store’s limited shelf
space next month.
• So, the small piece of information on your notebook ended up in many different places
• Notably on the desk of a manager as an aid to decision making.
• The data went through many transformations.
OVERVIEW OF DATA SCIENCE …
• Example …
• In addition to the computers where the data might have stopped by or stayed on for
the long term, lots of other pieces of hardware—such as the barcode scanner—were
involved in collecting, manipulating, transmitting, and storing the data.
• In addition, many different pieces of software were used to organize, aggregate,
visualize, and present the data.
• Finally, many different human systems were involved in working with the data.
• People decided which systems to buy and install, who should get access to what kinds
of data, and what would happen to the data after its immediate purpose was fulfilled.
• Data science evolves as one of the most promising and in-demand career paths.
• Professionals use advanced techniques for analyzing large volumes of data.
• They are also skilled in communicating results to their non-technical counterparts.
OVERVIEW OF DATA SCIENCE …
• Skills important for data science:
• Statistics
• Linear algebra
• Programming knowledge with focus on data warehousing, data mining, and data modeling
OVERVIEW OF DATA SCIENCE …
• Activity 2.2
• Describe in some detail the main disciplines that contribute to data science.
• Write a small report on the role of data scientists .
DATA VS INFORMATION
• Data: a representation of facts, concepts, or instructions in a formalized manner, which
should be suitable for communication, interpretation, or processing, by human or
electronic machines.
• It can be described as unprocessed facts and figures.
• It is represented groups of non-random symbols in the form of text, images, voice, videos
representing quantities, action and objects.
• Information is the processed/interpreted data on which decisions and actions are based.
• It is data that has been processed into a form that is meaningful to the recipient and is of
real or perceived value in the current or the prospective action or decision of recipient.
• It is interpreted data; created from organized, structured, and processed data in a
particular context.
DATA PROCESSING CYCLE
• Data processing: is the re-structuring or re-ordering of data by people or machine to
increase their usefulness and add values for a particular purpose.
• Consists of the following basic steps: input, processing, and output, in that order.

• Input − input data is prepared in some convenient form for processing.


• The form will depend on the processing machine. For example, when electronic computers are used,
the input data can be recorded on any one of the several types of input medium, such as magnetic
disks, tapes, and so on.
DATA PROCESSING CYCLE
• Processing - input data is changed to produce data in a more useful form.
• For example, pay-checks can be calculated from the time cards, or a summary of sales for the month
can be calculated from the sales orders.

• Output − the result of the proceeding processing step is collected.


• The particular form of the output data depends on the use of the data. For example, output data may be
pay-checks for employees.

• Activity 2.3
• Discuss the main differences between data and information with examples.
• Can we process data manually using a pencil and paper? Discuss the differences with
data processing using the computer.

DATA TYPES AND THEIR REPRESENTATION
• Data types can be described from diverse perspectives.
1. Computer science and programming perspective:
• A data type is an attribute of data that tells the compiler or interpreter how the
programmer intends to use the data.
• Almost all programming languages explicitly include the notion of data type, though
different languages may use different terminology.
• Common data types include:
• Integers: store integers.
• Booleans: store one of the two values: true or false
• Characters: store a single character (numeric, alphabetic, symbol, …)
• Floating-point numbers: stores real numbers
• Alphanumeric strings: stores a combination of characters and numbers.
DATA TYPES AND THEIR REPRESENTATION …
• A data type:
• Constrains the values that an expression (such as a variable or a function) might take.
• Defines the operations that can be performed on the data, the meaning of the data, and the way values
of that data type can be stored/represented.

2. Data types from Data Analytics perspective


• From a data analytics point of view there are three common types of data types or
structures:
• Structured, Semi-structured, and Unstructured data types.
• Describes the three types of data and metadata.
DATA TYPES AND THEIR REPRESENTATION …

Data types from a data analytics perspective


• Structured Data: is data that adheres to a pre-defined data model and is therefore straightforward
to analyze.
• Structured data conforms to a tabular format with a relationship between the different rows and
columns.
• Common examples of structured data are Excel files or SQL databases.
• Each of these has structured rows and columns that can be sorted.
• Structured data is considered the most ‘traditional’ form of data storage, since the earliest versions
of database management systems (DBMS) were able to store, process and access structured data.
DATA TYPES AND THEIR REPRESENTATION …
• Unstructured Data: is information that either does not have a predefined data model or is not organized
in a pre-defined manner.
• Unstructured information is typically text-heavy but may contain data such as dates, numbers, and
facts as well.
• This results in irregularities and ambiguities that make it difficult to understand using traditional
programs as compared to data stored in structured databases.
• Common examples of unstructured data include audio, video files or No-SQL databases.
• Semi-structured Data: is a form of structured data that does not conform with the formal structure of
data models associated with relational databases or other forms of data tables.
• But, contain tags or other markers to separate semantic elements and enforce hierarchies of records and
fields within the data.
• Therefore, it is also known as a self-describing structure.
• Examples of semi-structured data include JSON and XML are forms of semi-structured data.
DATA TYPES AND THEIR REPRESENTATION …
• Metadata – Data about Data: A last category of data type is metadata.
• From a technical point of view, this is not a separate data structure, but it is one of the
most important elements for Big Data analysis and big data solutions.
• Metadata is data about data. It provides additional information about a specific set of
data.
• Example: In a set of photographs, metadata could describe when and where the photos
were taken.
• The metadata then provides fields for dates and locations which, by themselves, can be
considered structured data.
• Because of this reason, metadata is frequently used by Big Data solutions for initial
analysis.
DATA TYPES AND THEIR REPRESENTATION …
• Activity 2.4
➢ Discuss data types from programing and analytics perspectives.
➢ Compare metadata with structured, unstructured and semi-structured data.
➢ Given at least one example of structured, unstructured and semi-structured data types.
DATA TYPES AND THEIR REPRESENTATION …
• Data value Chain:
• The Data Value Chain is introduced to describe the information flow within a big data
system as a series of steps needed to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level activities:
DATA TYPES AND THEIR REPRESENTATION …
• Data Acquisition: is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse or any other storage solution on which data analysis can be carried
out.
• Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
• The infrastructure required to support the acquisition of big data must:
• deliver low, predictable latency in both capturing data and in executing queries;
• be able to handle very high transaction volumes, often in a distributed environment; and
• support flexible and dynamic data structures.
DATA TYPES AND THEIR REPRESENTATION …
• Data Analysis: is concerned with making the raw data acquired amenable to use in
decision-making as well as domain-specific usage.
• Data analysis involves exploring, transforming, and modelling data with the goal of
highlighting relevant data, synthesizing and extracting useful hidden information with
high potential from a business point of view.
• Related areas include data mining, business intelligence, and machine learning.
• Data Curation: is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities such as content
creation, selection, classification, transformation, validation, and preservation.
DATA TYPES AND THEIR REPRESENTATION …
• Data curation is performed by expert curators that are responsible for improving the
accessibility and quality of data.
• Data curators (also known as scientific curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable, accessible, reusable,
and fit their purpose.
• A key trend for the curation of big data utilizes community and crowd sourcing
approaches.
• Data Storage: is the persistence and management of data in a scalable way that satisfies
the needs of applications that require fast access to the data.
• Relational Database Management Systems (RDBMS) have been the main, and almost
unique, solution to the storage paradigm for nearly 40 years.
DATA TYPES AND THEIR REPRESENTATION …
• However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties that
guarantee database transactions lack flexibility with regard to schema changes and the
performance and fault tolerance when data volumes and complexity grow, making them
unsuitable for big data scenarios.
• NoSQL technologies have been designed with the scalability goal in mind and present a
wide range of solutions based on alternative data models.
• Data Usage: covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the business activity.
• Data usage in business decision-making can enhance competitiveness through reduction
of costs, increased added value, or any other parameter that can be measured against
existing performance criteria
ACTIVITY 2.5
➢ Which information flow step in the data value chain you think is labor-intensive? Why?
• Data Acquisition? Analysis? Curation? Storage? Usage?
• Of course, it is Curation!
➢ What are the different data types and their value chain?
BIG DATA: DEFINITION
• Big data is a blanket term for the non-traditional strategies and technologies needed to
gather, organize, process, and gather insights from large datasets.
• While the problem of working with data that exceeds the computing power or storage of
a single computer is not new, the pervasiveness, scale, and value of this type of
computing has greatly expanded in recent years.
• What Is Big Data?
• Big data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.
BIG DATA: DEFINITION …
• Generally speaking, big data is:
• Large datasets
• The category of computing strategies and technologies that are used to handle large datasets.
• In this context, “large dataset” means a dataset too large to reasonably process or store
with traditional tools or on a single computer.
BIG DATA CHARACTERISTICS – THE 4VS
• Big data differs from traditional data in the following ways:
• Volume: large amounts of data Zeta bytes/Massive datasets. Orders of magnitude larger
than traditional datasets.
• Velocity: Data is live streaming or in motion. The speed that data moves through the
system. Data is frequently flowing into the system from multiple sources and is often
processed in real-time.
• Variety: data comes in many different forms, quality and from diverse sources. (Social
media, server logs, sensors, …)
• Veracity: can we trust the data? How accurate is it? etc.
BIG DATA THE 4VS: INFOGRAPHIC (IBM)
BIG DATA CHARACTERISTICS: THE 4VS …
BIG DATA SOLUTIONS: CLUSTERED COMPUTING
• Individual computers are often inadequate for handling big data at most stages.
• Clustered computing is used to better address the high storage and computational needs
of big data.
• Clustered computing is a form of computing in which a group of computers (often called
nodes) that are connected through a LAN (local area network) so that, they behave like a
single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one more powerful computer
than the individual computers.
BIG DATA SOLUTIONS: CLUSTERED COMPUTING …
• Big data clustering software combines the resources of many smaller machines, seeking
to provide a number of benefits:
• Resource Pooling: Combining the available storage space, CPU and memory is
extremely important.
• Processing large datasets requires large amounts of all three of these resources.
• High Availability: Clusters provide varying levels of fault tolerance and availability
guarantees to prevent hardware or software failures from affecting access to data and
processing.
• Increasingly important for real-time analytics of big data.
• Easy Scalability: Clusters make it easy to scale horizontally by adding more
machines to the group. The system can react to changes in resource requirements
without expanding the physical resources on a machine.
BIG DATA SOLUTIONS: CLUSTERED COMPUTING …
• Using clusters requires a solution for managing cluster membership, coordinating
resource sharing, and scheduling actual work on individual nodes.
• Cluster membership and resource allocation can be handled by softwares like Hadoop’s
YARN (which stands for Yet Another Resource Negotiator).
• The assembled computing cluster often acts as a foundation that other software
interfaces with to process the data.
• The machines involved in the computing cluster are also typically involved with the
management of a distributed storage system, which we will talk about when we discuss
data persistence.
BIG DATA: ACTIVITY 2.6
List and discuss the characteristics of big data.
Describe the big data life cycle.
Which step you think most useful and why?
List and describe each technology or tool used in the big data life cycle.
Discuss the three methods of computing over a large dataset.
BIG DATA SOLUTIONS: HADOOP
• Hadoop is an open-source framework intended to make interaction with big data easier.
• It is a framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models.
• The four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers can be used for
data processing.
• Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need.
BIG DATA SOLUTIONS: HADOOP ECOSYSTEM
• Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems.
• Hadoop has an ecosystem that has evolved from its four core components: data
management, access, processing, and storage.
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
BIG DATA SOLUTIONS: HADOOP ECOSYSTEM …
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
BIG DATA SOLUTIONS: HADOOP ECOSYSTEM …
ACTIVITY 2.7: ASSIGNMENT I – B (REPORT +
PRESENTATION)
• Discuss the purpose of each Hadoop Ecosystem components.
• Group 1,3,5,7:
Group 2,4,6,8:
BIG DATA LIFE CYCLE WITH HADOOP
1. Ingesting data into the system
• The first stage of Big Data processing is to Ingest data into the system.
• The data is ingested or transferred to Hadoop from various sources such as relational
databases, systems, or local files.
• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.
2. Processing the data in storage.
• The second stage is Processing.
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system, HDFS, and the NoSQL distributed
data, HBase.
• Spark and MapReduce perform data processing.
BIG DATA LIFE CYCLE WITH HADOOP …
3. Computing and analyzing data
• The third stage is to Analyze Data
• Here, the data is analyzed by processing frameworks such as Pig, Hive, and Impala.
• Pig converts the data using a map and reduce and then analyzes it.
• Hive is also based on the map and reduce programming and is most suitable for
structured data.
4. Visualizing the results
• The fourth stage is access, which is performed by tools such as Sqoop, Hive, Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.

You might also like