NGT Unit 1
NGT Unit 1
com
Acuity Educare
NGT
SEM : V
SEM V: UNIT 1
levels of data intensity than others; in this case, data intensity refers to the average
amount of data getting accumulated across companies/firms of that sector, implying that
they have more potential to capture value from big data. Financial services sectors,
including banking, investment, and securities services, are highly transaction-oriented;
they are also required by regulations to store data. The analysis shows that they have
the most digital data stored per firm on average. Communications and media firms,
utilities, and government also have significant digital data stored per enterprise or
organization, which appears to reflect the fact that such entities have a high volume of
operations and multimedia data. Discrete and process manufacturing have the highest
aggregate data stored in bytes. However, these sectors rank much lower in intensity
terms, since they are fragmented into a large number of firms.
Page 2 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
the same that the occurrence of an event). Quality of this kind of source depends mostly
of the capacity of the sensor to take accurate measurements in the way it is expected.
Social interactions: Is data produced by human interactions through a network, like
Internet. The most common is the data produced in social networks. This kind of
data implies qualitative and quantitative aspects which are of some interest to be
measured. Quantitative aspects are easier to measure tan qualitative aspects, first
ones implies counting number of observations grouped by geographical or temporal
characteristics, while the quality of the second ones mostly relies on the accuracy of the
algorithms applied to extract the meaning of the contents which are commonly found
as unstructured text written in natural language, examples of analysis that are made
from this data are sentiment analysis, trend topics analysis, etc.
Business transactions: Data produced as a result of business activities can be recorded
in structured or unstructured databases. When recorded on structured data bases the
most common problem to analyze that information and get statistical indicators is the
big volume of information and the periodicity of its production because sometimes these
data is produced at a very fast pace, thousands of records can be produced in a second
when big companies like supermarket chains are recording their sales. But these kind of
data is not always produced in formats that can be directly stored in relational databases,
an electronic invoice is an example of this case of source, it has more or less an structure
but if we need to put the data that it contains in a relational database, we will need
to apply some process to distribute that data on different tables (in order to normalize
the data accordingly with the relational database theory), and maybe is not in plain text
(could be a picture, a PDF, Excel record, etc.), one problem that we could have here is
that the process needs time and as previously said, data maybe is being produced too
fast, so we would need to have different strategies to use the data, processing it as it
is without putting it on a relational database, discarding some observations (which
criteria?), using parallel processing, etc. Quality of information produced from business
transactions is tightly related to the capacity to get representative observations and to
process them.
Electronic Files: These refers to unstructured documents, statically or dynamically
produced which are stored or published as electronic files, like Internet pages, videos,
audios, PDF files, etc. They can have contents of special interest but are difficult
to extract, different techniques could be used, like text mining, pattern recognition, and
so on. Quality of our measurements will mostly rely on the capacity to extract and
correctly interpret all the representative information from those documents.
Broadcastings: Mainly referred to video and audio produced on real time, getting
statistical data from the contents of this kind of electronic data by now is too complex
and implies big computational and communications power, once solved the problems of
converting “digital-analog” contents to “digital-data” contents we will have similar
complications to process it like the ones that we can find on social interactions.
Page 3 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
2. VELOCITY
With Velocity we refer to the speed with which data are being generated. Staying with
our social media example, every day 900 million photos are uploaded on Facebook, 500
million tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube
and 3.5 billion searches are performed in Google. This is like a nuclear data explosion.
Big Data helps the company to hold this explosion, accept the incoming flow of data and
at the same time process it fast so that it does not create bottlenecks.
3. VARIETY
Variety in Big Data refers to all the structured and unstructured data that has the
possibility of getting generated either by humans or by machines. The most commonly
added data are structured -texts, tweets, pictures & videos. However, unstructured data
like emails, voicemails, hand-written text, ECG reading, audio recordings etc, are also
important elements under Variety. Variety is all about the ability to classify the incoming
data into various categories.
Page 4 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
result, this information and knowledge can be used to improve processes and
performance.
C. Segmentation and Customizations: Big data enables organizations to create tailor-
made products and services to meet specific segment needs. This can also be used in the
social sector to accurately segment populations and target benefit schemes for specific
needs. Segmentation of customers based on various parameters can aid in targeted
marketing campaigns and tailoring of products to suit the needs of customers.
D. Aiding Decision Making: Big data can substantially minimize risks, improve decision
making , and uncover valuable insights. Automated fraud alert systems in credit card
processing and automatic fine-tuning of inventory are examples of systems that aid or
automate decision-making based on big data analytics.
E. Innovation: Big data enables innovation of new ideas in the form of products and
services. It enables innovation in the existing ones in order to reach out to large segments
of people. Using data gathered for actual products, the manufacturers can not only
innovate to create the next generation product but they can also innovate sales offerings.
As an example, real-time data from machines and vehicles can be analyzed to provide
insight into maintenance schedules; wear and tear on machines can be monitored to make
more resilient machines; fuel consumption monitoring can lead to higher efficiency
engines. Real-time traffic information is already making life easier for commuters by
providing them options to take alternate routes.
Page 5 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
systems to deal with big data on one hand and the lack of experienced resources
in newer technologies is a challenge that any big data project has to manage.
Page 6 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
There are a growing number of technologies that are making use of these technological
advancements. In this book, we will be discussing MongoDB, one of the technologies that can
be used to store and process big data.
Challenges of RDBMS
RDBMS assumes a well-defined structure of data and assumes that the data is largely uniform.
It needs the schema of your application and its properties (columns, types, etc.) to be defined up-front before
building the application. This does not match well with the agile development approaches for highly dynamic
applications.
As the data starts to grow larger, you have to scale your database vertically, i.e. adding more capacity to the
existing servers.
• Social Network Graph : Who is connected to whom? Whose post should be visible on the
user’s wall or homepage on a social network site?
• Search and Retrieve : Search all relevant pages with a particular keyword ranked by the
number of times a keyword appears on a page.
Definition : NoSQL doesn’t have a formal definition . It represents a form of
persistence/data storage mechanism that is fundamentally different from RDBMS. But
if pushed to define NoSQL, here it is: NoSQL is an umbrella term for data stores that
don’t follow the RDBMS principles.
Page 8 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
consistency.
Consistency can be implemented at both read and write operation levels.
Write Operations
N=W implies that the write operation will update all data copies before returning the
control to the client and marking the write operation as successful. This is similar to
how the traditional RDBMS databases work when implementing synchronous
replication. This setting will slow down the write performance.
If write performance is a concern, which means you want the writes to be happening
fast, you can set W=1, R=N. This implies that the write will just update any one copy
and mark the write as successful, but whenever the user issues a read request, it will
read all the copies to return the result. If either of the copies is not updated, it will
ensure the same is updated, and then only the read will be successful. This
implementation will slow down the read performance.
Hence most NoSQL implementations use N>W>1. This implies that greater than one
node needs to be updated successfully; however, not all nodes need to be updated at
the same time.
Read Operations
If R is set to 1, the read operation will read any data copy, which can be outdated. If
R>1, more than one copy is read, and it will read most recent value. However, this can
slow down the read operation.
Using N<W+R always ensures that a read operation retrieves the latest value. This is
because the number of written copies and read copies are always greater than the
actual number of copies, ensuring that at least one read copy has the latest version.
This is quorum assembly .
Atomicity Basically
Consistency Available
Isolation Eventually
Durable Consistency
Soft State
commodity servers, enabling the users to store and process more data at a low cost.
•Flexible data models: NoSQL databases have a very flexible data model, enabling them to
work with any type of data; they don’t comply with the rigid RDBMS data models. As a result,
any application changes that involve updating the database schema can be easily
implemented.
Disadvantages of NoSQL
In addition to the above mentioned advantages, there are many impediments that you need
to be aware of before you start developing applications using these platforms.
•Maturity : Most NoSQL databases are pre-production versions with key features that are
still to be implemented. Thus, when deciding on a NoSQL database, you should analyze the
product properly to ensure the features are fully implemented and not still on the To-do list .
•Support : Support is one limitation that you need to consider. Most NoSQL databases are
from start-ups which were open sourced. As a result, support is very minimal as compared to
the enterprise software companies and may not have global reach or support resources.
•Limited Query Capabilities : Since NoSQL databases are generally developed to meet the
scaling requirement of the web-scale applications, they provide limited querying capabilities.
A simple querying requirement may involve significant programming expertise.
• Administration : Although NoSQL is designed to provide a no-admin solution, it still
requires skill and effort for installing and maintaining the solution.
• Expertise : Since NoSQL is an evolving area, expertise on the technology is limited in the
developer and administrator community.
Although NoSQL is becoming an important part of the database landscape, you need to be
aware of the limitations and advantages of the products to make the correct choice of the
NoSQL database platform.
Let’s talk abouttechnical scenarios and how they compare in RDBMS vs. NoSQL :
• Schema flexibility: This is a must for easy future enhancements and integration with
external applications (outbound or inbound). RDBMS are quite inflexible in their design.
Adding a column is an absolute no-no, especially if the table has some data. The reasons
range from default value, indexes, and performance implications. More often than not, you
Page 11 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
end up creating new tables, and increasing the complexity by introducing relationships across
tables.
• Complex queries: The traditional designing of the tables leads to developers writing
complex JOIN queries, which are not only difficult to implement and maintain but also take
substantial database resources to execute.
• Data update: Updating data across tables is probably one of the more complex scenarios,
especially if they are a part of the transaction. Note that keeping the transaction open for a
long duration hampers the performance. You also have to plan for propagating the updates to
multiple nodes across the system. And if the system does not support multiple masters or
writing to multiple nodes simultaneously, there is a risk of node failure and the entire
application moving to read-only mode.
• Scalability: Often the only scalability that may be required is for read operations.
However, several factors impact this speed as operations grow. Some of the key questions to
ask are:
• What is the time taken to synchronize the data across physical database instances?
• What is the time taken to synchronize the data across datacenters?
• What is the bandwidth requirement to synchronize data?
• Is the data exchanged optimized?
• What is the latency when any update is synchronized across servers? Typically, the records
will be locked during an update.
NoSQL-based solutions provide answers to most of the challenges listed above.
Let’s now see what NoSQL has to offer against each technical question mentioned above.
• Schema flexibility: Column-oriented databases store data as columns as opposed to rows
in RDBMS. This allows the flexibility of adding one or more columns as required, on the fly.
Similarly, document stores that allow storing semi-structured data are also good options.
• Complex queries: NoSQL databases do not have support for relationships or foreign keys.
There are no complex queries. There are no JOIN statements.
Is that a drawback? How does one query across tables?
It is a functional drawback, definitely. To query across tables, multiple queries must be
executed. A database is a shared resource, used across application servers and must not be
released from use as quickly as possible. The options involve a combination of simplifying the
queries to be executed, caching data, and performing complex operations in the application
tier. A lot of databases provide in-built entitylevel caching. This means that when a record is
accessed, it may be automatically cached transparently by the database. The cache may be
in-memory distributed cache for performance and scale.
• Data update: Data updating and synchronization across physical instances are difficult
engineering problems to solve. Synchronization across nodes within a datacenter has a
different set of requirements compared to synchronizing across multiple datacenters. One
would want the latency within a couple of milliseconds or tens of milliseconds at the best.
NoSQL solutions offer great synchronization options.
MongoDB, for example, allows concurrent updates across nodes, synchronization with
conflict resolution, and eventually, consistency across the datacenters within an acceptable
time that would run in few milliseconds. As such, MongoDB has no concept of isolation. Note
that now because the complexity of managing the transaction may be moved out of the
database, the application will have to do some hard work.
A plethora of databases offer multiversion concurrency control (MCC) to achieve transactional
consistency.
Well, as Dan Pritchett ( www.addsimplicity.com/ ), Technical Fellow at eBay puts it, eBay.com
does not use transactions. Note that PayPal does use transactions.
• Scalability: NoSQL solutions provider greater scalability for obvious reasons. A lot of the
complexity that is required for transaction-oriented RDBMS does not exist in ACID non-
Page 12 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
compliant NoSQL databases. Interestingly, since NoSQL does not provide cross-table
references and there are no JOIN queries possible, and because you can’t write a single query
to collate data across multiple tables, one simple and logical solution is to—at times—
duplicate the data across tables. In some scenarios, embedding the information within the
primary entity—especially in one-to-one mapping cases—may be a great idea.
Page 13 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
This design also makes for high performance by providing for grouping of relevant data
together internally and making it easily searchable.
A JSON document contains the actual data and is comparable to a row in SQL.
However, in contrast to RDBMS rows, documents can have dynamic schema. This
means documents within a collection can have different fields or structure, or common
fields can have different type of data.
A document contains data in form of key-value pairs. Let’s understand this with an example:
{
"Name": "ABC",
"Phone": ["1111111",
........"222222"
........],
"Fax":..
}
As mentioned, keys and values come in pairs. The value of a key in a document can be
left blank. In the above example, the document has three keys, namely “Name,”
”Phone,” and “Fax.” The “Fax” key has no value.
Page 14 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1
MongoDB doesn’t provide support for transactions in the same way as SQL. However, it
guarantees atomicity at the document level. Also, it uses an isolation operator to
isolate write operations that affect multiple documents, but it does not provide “all-or-
nothing” atomicity for multi-document write operations.
Page 15 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622