22UCS303 DS-Unit I-N
22UCS303 DS-Unit I-N
22UCS303 DS-Unit I-N
Text Books:
1. Python Data Science Handbook-Essential Tools for Working with Data, Jake Vander Plas, O'Reilly Media, 2nd edition, 2022.
2. Data Science from Scratch: First Principles with Python, Joel Grus, O'Reilly, 2nd edition, 2019.
UNIT –I
10/3/2024 2
❖ Course Objectives: The course aims to
10/3/2024 3
Course outcomes: At the end of the course, students will be able to
10/3/2024 4
Introduction-Data Science
• Data science is an evolutionary extension of statistics capable of dealing
with the massive amounts of data produced today.
• Emerging Role of the Data Scientist and the Art of Data Science, the
authors sifted through hundreds of job descriptions for data scientist,
statistician, and BI (Business Intelligence) analyst to detect the differences
between those titles.
• The main things that set a data scientist apart from a statistician are the
ability to work with big data and experience in machine learning,
computing, and algorithm building.
10/3/2024 5
Introduction-Data Science Contd..
• Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional data
management techniques such as, for example, the RDBMS (relational
database management systems).
• The widely adopted RDBMS has long been regarded as a one-size-fits-all
solution, but the demands of handling big data have shown otherwise.
• Data science involves using methods to analyse massive amounts of
data and extract the knowledge it contains.
10/3/2024 6
Characteristics of Big Data
• The characteristics of big data are often referred to as the three Vs:
■ Volume—How much data is there?
■ Variety—How diverse are different types of data?
■ Velocity—At what speed is new data generated?
• Often these characteristics are complemented with a
• fourth V, veracity: How accurate is the data?
• These four properties make big data different from the data found
• in traditional data management tools.
10/3/2024 7
10/3/2024 8
Volumes
The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human
interactions, and many more.
10/3/2024 9
Variety
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in
array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
10/3/2024 10
The data is categorized as below:
1.Structured data: In Structured schema, along with all the required columns. It is in a tabular form.
Structured Data is stored in the relational database management system.
2.Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV, TSV,
and email. OLTP (Online Transaction Processing) systems are built to work with semi-structured data. It is
stored in relations, i.e., tables.
3.Unstructured Data: All the unstructured files, log files, audio files, and image files are included in the
unstructured data. Some organizations have much data available, but they did not know how to derive the
value of data since the data is raw.
4.Quasi-structured Data: The data format contains textual data with inconsistent data formats that are
formatted with effort and time with some tools.
10/3/2024 11
Veracity
Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and
manage data efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.
10/3/2024 12
Value
Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store, process, and
analyze.
10/3/2024 13
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile
devices, etc.
10/3/2024 14
Benefits- Data Science
• Commercial companies
• Governmental organizations
• Nongovernmental organizations (NGOs)
• Universities
10/3/2024 15
Commercial Companies
• Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion,
and products.
• A good example of this is Google AdSense, which collects data from
internet users so relevant commercial messages can be matched to the
person browsing the internet.
10/3/2024 16
Governmental Organizations
• Governmental organizations are also aware of data’s value. Many
governmental organizations not only rely on internal data scientists to
discover valuable information, but also share their data with the public.
• You can use this data to gain insights or build data-driven applications.
Data.gov is but one example; it’s the home of the US Government’s
open data.
• A data scientist in a governmental organization gets to work on diverse
projects such as detecting fraud and other criminal activity or
optimizing project funding.
10/3/2024 17
Nongovernmental organizations (NGOs)
• Nongovernmental organizations (NGOs) are also no strangers to using data.
• They use it to raise money and defend their causes. The World Wildlife
Fund (WWF), for instance, employs data scientists to increase the
effectiveness of their fundraising efforts.
• Many data scientists devote part of their time to helping NGOs, because
NGOs often lack the resources to collect data and employ data scientists.
10/3/2024 18
Universities
Universities use data science in their research but also to
enhance the study experience of their students. The rise of
massive open online courses (MOOC) produces a lot of
data, which allows universities to study how this type of
learning can complement traditional classes.
The big data and data science landscape changes quickly, and
MOOCs allow you to stay up to date by following courses
from top universities.
10/3/2024 19
Use of Big Data:
Big Data is used for many things but some of the main usage purposes are as follows:
• Bag Data enables you to gather information about customers and their experience, then
eventually helps you to align it properly.
• Helps in maintaining the predictive failures beforehand by analyzing the problems and
provides with their potential solutions.
• Big Data is also useful for companies to anticipate customer demand, roll out new plans,
test markets, etc.
• Big Data is very useful in assessing predictive failures by analyzing various indicators
such as unstructured data, error messages, log entries, engine temperature, etc.
• Big Data is also very efficient in maintaining operational functions along with anticipating
future demands of the customers, current market demands thus providing proper results.
10/3/2024 20
Facets of data
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
10/3/2024 21
Structured data
Structured data is data that depends on a data model and resides in a fixed
field within a record.
As such, it’s often easy to store structured data in tables within databases or
Excel files (figure 1.1).
SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases.
10/3/2024 22
Excel File
10/3/2024 23
Unstructured Data
• Unstructured data is data that isn’t easy to
fit into a data model because the content is
context-specific or varying.
• One example of unstructured data is your
regular email as shown in figure 1.2.
• Although email contains structured
elements such as the sender, title, and body
text, it’s a challenge to find the number of
people who have written an email complaint
about a specific employee because so many
ways exist to refer to a person.
10/3/2024 24
Natural Language
• Natural language is a special type of unstructured data; it’s challenging
to process because it requires knowledge of specific data science
techniques and linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion,
and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
10/3/2024 25
10/3/2024 26
Machine-generated data
• Machine-generated data is information that’s
automatically created by a computer,
process, application, or other machine
without human intervention.
• Machine-generated data is becoming a major
data resource and will continue to do so.
• The analysis of machine data relies on highly
scalable tools, due to its high volume and
speed.
• Examples of machine data are web server
logs, call detail records, network event logs,
and telemetry.
10/3/2024 27
Graph-based or network data
• In mathematical graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to represent
and store graphical data.
• Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of
a person and the shortest path between two people.
10/3/2024 28
Graph-based or network data
• Examples of graph-based data can be found on many social media
websites.
• For instance, on LinkedIn you can see who you know at which company.
• Your follower list on Twitter is another example of graph-based data.
10/3/2024 29
Graph-based or network data
• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
• Graph data poses its challenges, but for a computer interpreting additive
and image data, it can be even more difficult.
10/3/2024 30
Audio, image, and video
• Audio, image, and video are data types that pose specific challenges
to a data scientist.
• Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
• Recently a company called DeepMind succeeded at creating an
algorithm that’s capable of learning how to play video games.
• This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning.
10/3/2024 31
Streaming data
• While streaming data can take almost any of the previous forms, it has
an extra property.
• The data flows into the system when an event happens instead of
being loaded into a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or
music events, and the stock market.
10/3/2024 32
Formative Assessment 1
Q1
Which of the following characteristic of big data is relatively more concerned
to data science?
a) Velocity
b) Variety
c) Volume
d) Veracity
10/3/2024 33
a) Velocity
10/3/2024 34
Q2
• Data in ____ bytes size is called big data
a) Giga
b) Mega
c) Tera
d) Peta
10/3/2024 35
d) Peta
10/3/2024 36
Q3
• Transaction of data of the bank is a type of _____.
a) Structured data
b) Unstructured data
c) Both a and b
d) Semi-structured data
10/3/2024 37
B) A transaction of data of the bank is structured data.
10/3/2024 38
Q4 Choose the primary characteristics of big data among the following
a) Value
b) Variety
c) Volume
d) All of the above
10/3/2024 39
D) All of the above
10/3/2024 40
Q5 ___________ is a collection of data that is used in volume, yet
growing exponentially with time
a) Big Database
b) Big DBMS
c) Big Datafile
d) Big Data
10/3/2024 41
d) Big Data
10/3/2024 42