Big Data
Big Data
Department of Mathematics
School of Advanced Sciences
Vellore Institute of Technology
Vellore Campus, Vellore - 632 014
India
Earlier in 1980’s most of the enterprise data has been stored in relational database
complete with rows/records/tuples, columns/attributes/fields, primary keys, foreign keys.
Relational Database Management System (RDBMS)
The software used to store, manage, query, and retrieve data stored in a relational
database is called a RDBMS.
The RDBMS provides an interface between users and applications and the database, as
well as administrative functions for managing data storage, access, and performance.
Over a period of time RDBMS matured and the RDBMS, as they are available today,
have become more robust, cost-effective and efficient.
The data held in RDBMS is typically structured.
Structured Data
Semi structured Data
Unstructured Data.
This is the data which is in an organized form (e.g. in rows and columns) and can be
easily used by a computer program.
i.e., when the data conforms to a pre-defined schema/ structure.
Relationships exist between entities of data, such as classes and their objects.
Data stored in databases is an example of structured data.
Company Database
Oracle Corp Oracle
IBM DB2
Microsoft Microsoft SQL Server
EMC Greenplum
Teradata Teradata
MySQL Open source
PostgreSQL Advanced open source
Table: Database Products by Companies
These databases are typically used to hold transaction/operational data generated and
collected by day-to-day business activities.
The data of the On-Line Transaction Processing (OLTP) systems are generally structured.
OLTP or Online Transaction Processing is a type of data processing that consists of
executing a number of transactions occurring concurrently?online banking, shopping,
order entry, or sending text messages, for example.
These transactions traditionally are referred to as economic or financial transactions,
recorded and secured so that an enterprise can access the information anytime for
accounting or reporting purposes.
Flexible Schema: Unlike structured data with fixed schemas, semi-structured data allows
for flexibility in the organization of data. It may have a loose or dynamic schema,
enabling data to be added or changed without the need for predefined structures.
Data Contains Tags or Markers: Semi-structured data often includes tags, markers, or
other identifiers that provide a level of organization or hierarchy.
No Clear Data Relationships: While semi-structured data may have some organizational
elements, it lacks the strict relationships and constraints found in structured databases.
Variety of Data Types: Semi-structured data can accommodate various types of data,
including text, images, and nested structures.
Ease of Adaptability: It can easily incorporate new data elements without requiring
modifications to the entire schema.
Examples
XML (eXtensible Markup Language): Uses tags to define elements and attributes.
JSON (JavaScript Object Notation): Utilizes key-value pairs to represent data objects. It is
used to transmit data between a server sand web application.
This is the data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
About 80-90% data of an organization is in this format.
For example- chat rooms, power point presentations, images, videos, letters, researchers,
body of an email, etc-
Data Mining: First, we deal with large data sets, Second, we use methods at the
intersection of AI, ML, Statistics and database systems to unearth consistent patterns in
large data sets. It is the analysis step of the “Knowledge discovery in databases” process.
Few popular data mining algorithms are as follows:
Association Rule Mining (Market basket analysis):
It is used to determine “What goes with what?”
It is about when you buy a product, what is the other product that you are likely to purchase
with it
Regression Analysis
It helps to predict the relationship between two variables.
Collaborating Filtering
It is about predicting a user’s preference or preferences based on the preferences of a group of
users.
Dr. Jisha Francis Module 1 16 / 18
Few Ways to Deal with Unstructured Data
Text analytics or text mining: This involves analyzing and interpreting large volumes of
textual information to discover patterns, trends, and knowledge. It includes tasks such as
text categorization, text clustering, sentiment analysis, etc.
Natural Language Processing: It focuses on enabling computers to understand,
interpret, and generate human language. It involves the interaction between computers
and humans through natural language.
Noisy text analytics: It is the process of extracting structured and semi-structured
information from the unstructured data such as chats, blogs, wikis, emails, text messages
etc.