AD3491-Unit 1
AD3491-Unit 1
PART A
1. What is Bigdata?
Big data is a huge volume, high velocity and variety of data that cannot be
processed by traditional processing system.
They are characterized by the 7 Vs: velocity, variety, volume, variability,
visualization, value and veracity.
Contents
Big Data
Data Science
Benefits and Uses:
1. Commercial Companies
2. Human Resource Professionals
3. Financial Institutions
4. Government Organizations
5.Non-governmental organizations
(NGOs)
6. Universities
Data Science Tools
Real Time Applications of Data Science
Data
Data is a collection of discrete states that convey information,
describing quantity, quality, fact and statistics.
Big data
Big data is a huge volume, high velocity and variety of data that
cannot be processed by traditional processing system.
They are characterized by the 7 Vs: velocity, variety, volume,
variability, visualization, value and veracity.
Data science
Data science is the field of study of data, using modern scientific
techniques, statistical methods and algorithms to derive insights from
huge volume of data and to create business and IT strategies.
It deals about where the data comes from, what it represents, and the ways
by which it can be transformed into valuable inputs and resources
2. List and explain the facets of data or different types of data or categories of data.
Contents
1. Structured
2. Unstructured
3. Natural Language
4. Machine-generated
5. Graph-based
6. Audio, video, and images
7. Streaming
Categories of data:
1. Structured data
Structured data is data that depends on a data model and resides in a fixed
field within a record.
It’s easy to store structured data in tables within databases or Excel files.
PREPARED BY: Mrs.S.MAHALAKSHMI AP/AI&DS 9
2. Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying.
Example - regular email. (Figure 1.2).
In Figure 1.2, email contains structured elements such as the sender, title, and
body text, it’s a challenge to find the number of people who have written an
email complaint about a specific employee because so many ways exist to
refer to a person, for example.
3. Natural language
Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and
linguistics.
The natural language processing community had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but
models trained in one domain don’t generalize well to other domains.
4. Machine-generated data
Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human intervention.
The analysis of machine data relies on highly scalable tools, due to its high
volume and speed.
Examples - web server logs, call detail records, network event logs, and telemetry
(Figure 1.3).
The machine data in figure 1.3 would fit nicely in a classic table-
structured database.
This isn’t the best approach for highly interconnected or “networked” data,
where the relationships between entities have a valuable role to play.
5 Graph-based or network data
“Graph” points to mathematical graph theory.
In graph theory, a graph is a mathematical structure to model pair- wise
relationships between objects.
Graph or network data is, a data that focuses on the relationship or
adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store
graphical data.
Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
6. Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a
data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn
out to be challenging for computers.
High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time, for example, the path taken by a defender relative to
two baselines.
Recently a company called DeepMind succeeded at creating an algorithm
that’s capable of learning how to play video games.
This algorithm takes the video screen as input and learns to interpret everything
via a complex process of deep learning.
This prompted Google to buy the company for their own Artificial Intelligence
(AI) development plans.
7. Streaming data
The data flows into the system in a continuous manner when an event happens
instead of being loaded into a data store in a batch.
Examples - “What’s trending” on Twitter, live sporting or music events, and
the stock market.
PREPARED BY: Mrs.S.MAHALAKSHMI AP/AI&DS 12
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 1
The second step is to collect data by finding suitable data and getting access
to the data from the data owner.
Start with data stored within the company
o The data can be stored in official data repositories such as databases,
data marts, data warehouses, and data lakes maintained by a team of IT
professionals.
o The primary goal of a database is data storage, while a data warehouse is
designed for reading and analyzing that data.
o A data mart is a subset of the data warehouse and geared toward
serving a specific business unit.
o While data warehouses and data marts are home to preprocessed data, data
lakes contains data in its natural or raw format which probably needs
polishing and transformation before it becomes usable..
Don’t be afraid to shop around
o Many companies specialize in collecting valuable information.
o Data can also be delivered by third-party companies and take many
forms ranging from Excel spreadsheets to different types of databases.
Refer Table 1.2
3 Data preparation
Example:
2. Appending or stacking:
Appending or stacking tables is effectively adding observations from
one table to another table.
The equivalent operation in set theory would be the union, and this is
also the command in SQL, the common language of relational
databases.
Other set operators are also used in data science, such as set difference
and intersection.
Example:
3. View
Views are kind of virtual tables.
Can create a view by selecting fields from one or more tables
present in the database.
A View can either have all the rows of a table or specific rows
based on certain condition.
4 Data transformation
Certain models require their data to be in a certain shape.
Ensures that the data is in a suitable format for use in data
models.
Taking the log of the independent variables simplifies the
estimation problem dramatically.
Example – Refer Figure 1.9
Relationships between an input variable and an output variable aren’t always linear.
Histogram
In a histogram a variable is cut into discrete categories and the number of
occurrences in each category are summed up and shown in the graph.