Unit1 Bda
Unit1 Bda
Data which are very large in size is called Big Data. Normally we work
on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes)
but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated
that almost 90% of today's data has been generated in the past 3
years.
Four Vs:-
Big data analytics applications often include data from both internal
systems and external sources, such as weather data or demographic
data on consumers compiled by third-party information services
ASSI.PRO UPEKSHA CHAUDHRI 14
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
Those three factors became known as the 3V's of big data. Gartner
popularized this concept in 2005 after acquiring Meta Group and
hiring Laney. Over time, the 3V's became the 5V's by
adding value and veracity and sometimes a sixth V for variability.
Travel and tourism are the users of Big Data. It enables us to forecast
travel facilities requirements at multiple locations, improve business
through dynamic pricing, and many more.
Healthcare
Government agencies use Big Data and run many agencies, managing
utilities, dealing with traffic jams, and the effect of crime
like hacking and online fraud.
E-commerce
Social Media
Social Media is the largest data generator. The statistics have shown
that around 500+ terabytes of fresh data generated from social media
daily, particularly on Facebook. The data mainly contains videos,
photos, message exchanges, etc. A single activity on the social media
site generates many stored data and gets processed when required.
The data stored is in terabytes (TB); it takes a lot of time for
processing. Big Data is a solution to the problem.
Mapper class takes the input, tokenizes it, maps and sorts it. The
output of Mapper class is used as input by Reducer class, which in
turn searches matching pairs and reduces them.
• Sorting
• Searching
• Indexing
• TF-IDF
Sorting
Searching
Example
• The Map phase processes each input file and provides the
employee data in key-value pairs (<k, v> : <emp name, salary>).
See the following illustration.
else{
Continue checking;
}
• Reducer phase − Form each file, you will find the highest
salaried employee. To avoid redundancy, check all the <k, v>
pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are
coming from four input files. The final output should be as
follows −
<gopal, 50000>
Indexing
Example
The following text is the input for inverted indexing. Here T[0], T[1],
and t[2] are the file names and their content are in double quotes.
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly,
"is": {0, 1, 2} implies the term "is" appears in the files T[0], T[1], and
T[2].
TF-IDF
While computing TF, all the terms are considered equally important.
That means, TF counts the term frequency for normal words like “is”,
“a”, “what”, etc. Thus we need to know the frequent terms while
scaling up the rare ones, by computing the following −
Example