01 Introduction
01 Introduction
1
In this discussion…
Introduction to data
Current trend
2
Introduction to data
Example:
10, 25, …, Kharagpur, 10CS3002, [email protected]
Anything else?
3
Current Trend
4
How large your data is?
What is the maximum file size you have
dealt so far?
Movies/files/streaming video that you have
used?
5
Growth of data
6
Sources of data
“Every day, we create 2.5 quintillion bytes of data
So much that 90% of the data in the world today has been created in the
last two years alone.
7
Examples
8
Big Data
9
Now data is Big Data!
Big data is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
10
Characteristics of Big Data: V3
11
V3 : V for Volume
Volume of data, which needs to
be processed is increasing
rapidly
More storage capacity
More computation
More tools and techniques
Exponential increase in
collected/generated data
12
V3: V for Variety
Various formats, types, and
structures
Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-
dimensional arrays, etc…
13
V3: V for Velocity
Data is being generated fast and need to be
processed fast
For time-sensitive processes such as
catching fraud, big data must be used as it
streams into your enterprise in order to
maximize its value
14
Big Data vs. small data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
15
Big Data vs. small data
Big Data is more real-time in nature
than traditional applications
16
Tools and Techniques
17
Challenges ahead…
18
Big data players
19
Major players…
Hadoop
MapReduce
Mahout
Apache Hbase
Cassandra
20
Tools available
NoSQL
DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable,
Voldemort, Riak, ZooKeeper
MapReduce
Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka,
Azkaban, Oozie, Greenplum
Storage
S3, HDFS, GDFS
Servers
EC2, Google App Engine, Elastic, Beanstalk, Heroku
Processing
R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets,
Tinkerpop
21
Questions of the day…
1. What is the smallest and largest units of measuring
size of data?
23