Fdsa Unit 1 Aids Sem 4
Fdsa Unit 1 Aids Sem 4
PART A
1. What is Bigdata?
Big data is a huge volume, high velocity and variety of data that cannot
be processed by traditional processing system.
They are characterized by the 7 Vs: velocity, variety, volume, variability,
visualization, value and veracity.
PART B
1. Give the description about data science and its applications, also
discuss the benefits and uses of Data Science and Big Data.
Contents
Big Data
Data Science
Benefits and Uses:
1. Commercial Companies
2. Human Resource Professionals
3. Financial Institutions
4. Government Organizations
5.Non-governmental organizations
(NGOs)
6. Universities
Data Science Tools
Real Time Applications of Data Science
Data
Data is a collection of discrete states that convey information,
describing quantity, quality, fact and statistics.
Big data
Big data is a huge volume, high velocity and variety of data that
cannot be processed by traditional processing system.
They are characterized by the 7 Vs: velocity, variety, volume,
variability, visualization, value and veracity.
Data science
Data science is the field of study of data, using modern scientific
techniques, statistical methods and algorithms to derive insights
from huge volume of data and to create business and IT strategies.
It deals about where the data comes from, what it represents, and
the ways by which it can be transformed into valuable inputs and
resources
2. List and explain the facets of data or different types of data or categories of
data.
Contents
1. Structured
2. Unstructured
3. Natural Language
4. Machine-generated
5. Graph-based
6. Audio, video, and images
7. Streaming
Categories of data:
1. Structured data
Structured data is data that depends on a data model and resides in a
fixed field within a record.
It’s easy to store structured data in tables within databases or Excel files.
2. Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because
the content is context-specific or varying.
Example - regular email. (Figure 1.2).
3. Natural language
Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of specific data
science techniques and linguistics.
The natural language processing community had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize
well to other domains.
4. Machine-generated data
Machine-generated data is information that’s automatically created
by a computer, process, application, or other machine without human
intervention.
The analysis of machine data relies on highly scalable tools, due to its
high volume and speed.
Examples - web server logs, call detail records, network event logs, and
telemetry (Figure 1.3).
The machine data in figure 1.3 would fit nicely in a classic table-
structured database.
This isn’t the best approach for highly interconnected or “networked”
data, where the relationships between entities have a valuable role to
play.
5 Graph-based or network data
“Graph” points to mathematical graph theory.
In graph theory, a graph is a mathematical structure to model pair-
wise relationships between objects.
Graph or network data is, a data that focuses on the relationship or
adjacency of objects.
The graph structures use nodes, edges, and properties to represent and
store graphical data.
Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
6. Audio, image, and video
Audio, image, and video are data types that pose specific challenges to
a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
High-speed cameras at stadiums will capture ball and athlete movements
to calculate in real time, for example, the path taken by a defender
relative to two baselines.
Recently a company called DeepMind succeeded at creating an algorithm
that’s capable of learning how to play video games.
This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning.
This prompted Google to buy the company for their own Artificial
Intelligence (AI) development plans.
7. Streaming data
The data flows into the system in a continuous manner when an event
happens instead of being loaded into a data store in a batch.
Examples - “What’s trending” on Twitter, live sporting or music events,
and the stock market.
The second step is to collect data by finding suitable data and getting
access to the data from the data owner.
Start with data stored within the company
o The data can be stored in official data repositories such as
databases, data marts, data warehouses, and data lakes
maintained by a team of IT professionals.
o The primary goal of a database is data storage, while a data
warehouse is designed for reading and analyzing that data.
o A data mart is a subset of the data warehouse and geared toward
serving a specific business unit.
o While data warehouses and data marts are home to preprocessed
data, data lakes contains data in its natural or raw format which
probably needs polishing and transformation before it becomes
usable..
Don’t be afraid to shop around
o Many companies specialize in collecting valuable information.
o Data can also be delivered by third-party companies and take
many forms ranging from Excel spreadsheets to different types of
databases. Refer Table 1.2
3 Data preparation
Common Errors
Table 1.3 – Common Errors
Example:
2. Appending or stacking:
Appending or stacking tables is effectively adding observations
from one table to another table.
The equivalent operation in set theory would be the union, and
this is also the command in SQL, the common language of
relational databases.
Other set operators are also used in data science, such as set
difference and intersection.
Example:
3. View
Views are kind of virtual tables.
Can create a view by selecting fields from one or more tables
present in the database.
A View can either have all the rows of a table or specific rows
based on certain condition.
4 Data transformation
Certain models require their data to be in a certain shape.
Ensures that the data is in a suitable format for use in data
models.
Taking the log of the independent variables simplifies the
estimation problem dramatically.
Example – Refer Figure 1.9
Relationships between an input variable and an output variable aren’t always
linear.
Histogram
In a histogram a variable is cut into discrete categories and the
number of occurrences in each category are summed up and shown in
the graph.
The boxplot
The boxplot, offers an impression of the distribution within categories.
It can show the maximum, minimum, median, and other
characterizing measures at the same time.
Example: