1 Introduction To Big Data Management and Processing
1 Introduction To Big Data Management and Processing
Lecture 1
Introduction to big data storage and
processing
2
Syllabus
STT Lecture
3
How big is big data?
4
5
How big is big data?
6
Data science: The 4th paradigm for scientific
discovery
7
Big data in 2008
8
Big data in 2014
9
Big data today
10
Big numbers
11
Big data sources
• E-commerce
• Social networks
• Internet of things
• Data-intensive experiments (bioinformatics, quantum
physics, etc)
12
Data is the new oil
13
Big data 5'V
Big data is a term for data sets that are so large or complex that
traditional data processing application software is inadequate to
deal with them (wikipedia)
14
Big data – big value
15
source: wipro.com
Big Data in education industry
• Customized and Dynamic Learning Programs
• Reframing Course Material
• Grading Systems
• Career Prediction
16
Edtech
• Coursera
• VioEdu
• https://byjus.com/
• Engaging Video Lessons
• Personalized Learning Journeys
• Mapped to the Syllabus
• In-depth Analysis
• Engaging Interactive Questions
17
Big Data in healthcare industry
• Reduce costs of treatments, unnecessary diagnosis.
• Predict outbreaks of epidemics and preventive
measures.
• Avoid preventable diseases
18
Big Data in government sector
• Welfare Schemes
• Make faster and informed decisions
• Identify areas that are in immediate need of attention
• Overcome national challenges such as unemployment,
terrorism,.
• Cyber Security
• deceit recognition.
• Catching tax evaders.
19
Big Data in media and entertainment
industry
• Predicting the interests of audiences
• Optimized or on-demand scheduling of media streams
in digital media distribution platforms
• Getting insights from customer reviews
• Effective targeting of the advertisements
• Example
• Spotify, Amazon Prime
20
Big data in scientific discovery
https://www.youtube.com/watch?v=fobx4wIS6W0
22
Top 10 Company Market Cap
Ranking History (1998-2018)
23
Big data technology stack
24
Scalable data management
• Scalability
• Able to manage incresingly big volume of data
• Accessibility
• Able to maintain efficiciency in reading and writing data (I/O)
into data storage systems
• Transparency
• In distributed environment, users should be able to access
data over the network as easily as if the data were stored
locally.
• Users should not have to know the physical location of data to
access it.
• Availability
• Fault tolerance
• The number of users, system failures, or other consequences
of distribution shouldnʼt compromise the availability.
25
Data I/O landscape
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s
26
Scalable data ingestion and
processing
• Data ingestion
• Data from different complementing information systems is to be combined to
gain a more comprehensive basis to satisfy the need
• How to ingest data efficiently from various, distributed heterogeneous
sources?
• Different data formats
• Different data models and schemas
• Security and privacy
• Data processing
• How to process massive volume of data in a timely fashion?
• How to process massive stream of data in a real-time fashion?
• Traditional parallel, distributed processing (OpenMP, MPI)
• Big learning curve
• Scalability is limited
• Fault tolerence is hard to achive
• Expensive, high performance computing infrastructure
• Novel realtime processing architecture
• Eg. Mini-batch in Spark streaming
• Eg. Complex event processing in Apache Flink
27
Scalable analytic algorithms
• Challenges
• Big volume
• Big dimensionality
• Realtime processing
• Scaling-up Machine Learning algorithms
• Adapting the algorithm to handle Big Data in a single machine.
• Eg. Sub-sampling
• Eg. Principal component analysis
• Eg. feature extraction and feature selection
• Scaling-up algorithms by parallelism
• Eg. k-nn classification based on MapReduce
• Eg. scaling-up support vector machines (SVM) by a divide and-
conquer approach
28
Eg. Curse of dimensionality
• The required number of samples (to achieve the same accuracy)
grows exponentionally with the number of variables!
• In practice: number of training examples is fixed!
• => the classifier’s performance usually will degrade for a large
number of features!
29
Utilization and interpretability of big
data
• Domain expertise to findout problems and
interprete analytics results
• Scalable visualization and interpretability of
million data points
• to facilitate their interpretability and
understanding
30
Privacy and security
31
Big data job trends
32
Talent shortage in big data
33
Big data skill set
34
How to land big data related jobs
• Learn to code
• Coursera
• Udacity
• Freecodecamp
• Codecademy
• Math, Stats and machine learning
• Kaggle
• Hadoop, NoSQL, Spark
• Visualization and Reporting
• Tableau
• Pentahoo
• Meetup & Share
• Find a mentor
• Internships, projects
35
Data science method
1. Formulate a question
4. Product
2. Gather data
3. Analyze data
80% 4/2010
70% 10/2009
5/2009
60%
12/2008
Precision
50% 8/2008
5/2008
40% 12/2007
30%
20%
Baseline
10%
0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered
37
Cleaning big data: most time-consuming,
least enjoyable data science task
• Data preparation accounts for about 80% of the work of
data scientists
source: https://www.forbes.com/
38
Cleaning big data: most time-consuming,
least enjoyable data science task
• 57% of data scientists regard cleaning and organizing
data as the least enjoyable part of their work and 19%
say this about collecting data sets.
39
References
[1] Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
[2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
[3] Miner, Donald, and Adam Shook. MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems. " O'Reilly
Media, Inc.", 2012.
[4] Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
[5] Penchikala, Srini. Big data processing with apache spark. Lulu. com, 2018.
[6] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[7] Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics." International Journal of Information
Management 35.2 (2015): 137-144.
[8] Cattell, Rick. "Scalable SQL and NoSQL data stores." Acm Sigmod Record 39.4 (2011): 12-27.
[9] Gessert, Felix, et al. "NoSQL database systems: a survey and decision guidance." Computer Science-Research and Development 32.3-4 (2017): 353-
365.
[10] George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.
[11] Sivasubramanian, Swaminathan. "Amazon dynamoDB: a seamlessly scalable non-relational database service." Proceedings of the 2012 ACM
SIGMOD International Conference on Management of Data. ACM, 2012.
[12] Chan, L. "Presto: Interacting with petabytes of data at Facebook." (2013).
[13] Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
[14] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015.
[15] Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. "Big data analysis: Apache storm perspective." International journal of computer trends and
technology 19.1 (2015): 9-14.
[16] Toshniwal, Ankit, et al. "Storm@ twitter." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
[17] Lin, Jimmy. "The lambda and the kappa." IEEE Internet Computing 21.5 (2017): 60-66.
40
Online courses
• https://www.coursera.org/learn/nosql-database-systems
• https://who.rocq.inria.fr/Vassilis.Christophides/Big/index.htm
• https://www.coursera.org/learn/big-data-introduction?specialization=big-
data
• https://www.coursera.org/learn/big-data-integration-
processing?specialization=big-data
• https://www.coursera.org/learn/big-data-management?specialization=big-
data
• https://www.coursera.org/learn/hadoop
• https://www.coursera.org/learn/scala-spark-big-data
41
Thank you
for your
attention!!!
42