UNIT-1:Overview of Big Data
UNIT-1:Overview of Big Data
The quantity of data created by humans is quickly increasing every year as a result of the introduction of new technology, gadgets, and communication channels
such as social networking sites. Big data is a group of enormous datasets that can't be handled with typical computer methods. It is no longer a single technique
or tool; rather, it has evolved into a comprehensive subject including a variety of tools, techniques, and frameworks. Quantities, letters, or symbols on which a
computer performs operations and which can be stored and communicated as electrical signals and recorded on magnetic, optical, or mechanical media.
3. Variety:
o The diversity of data formats, such as structured (databases), semi-structured (JSON, XML), and unstructured (text, videos, images).
o Today's data comes in many formats, from structured to numeric data in traditional databases to unstructured text, video and images from diverse
sources like social media and video surveillance. This variety demans flexible data management systems to handle and integrate disparate data
types for comprehensive analysis. NoSQL databases, data lakes and schema-on-read technologies provide the necessary flexibility to accommodate
the diverse nature of big data.
4. Veracity:
o The reliability or quality of the data.
o Examples: Handling noisy or incomplete datasets to ensure accurate insights.
o Data reliability and accuracy are critical, as decisions based on inaccurate or incomplete data can lead to negative outcomes. Veracity refers to the
data's trustworthiness, encompassing data quality, noise and anomaly detection issues. Techniques and tools for data cleaning, validation and
verification are integral to ensuring the integrity of big data, enabling organizations to make better decisions based on reliable information.
5. Value:
o The actionable insights or benefits derived from data analysis.
o Example: Personalized recommendations on e-commerce platforms.
1. Social Media:
o Data from platforms like Facebook, Twitter, Instagram.
o Includes likes, shares, comments, posts, and multimedia.
2. Internet of Things (IoT):
o Data from connected devices like smart thermostats, fitness trackers, and industrial sensors.
3. E-commerce:
o Purchase history, customer reviews, and browsing patterns.
4. Healthcare:
o Electronic Health Records (EHRs), medical imaging, and genetic data.
5. Finance:
o Stock market data, credit card transactions, and fraud detection systems.
6. Telecommunications:
o Call logs, text data, and customer service records.
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage medium and large data processing are provided by Hadoop,
and it is an open-source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and retrieve unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in data centers that are remote, and it saves their infrastructure cost and
maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and analysis is done on a huge amount of data to get meaningful insights
from it. This has led to the development of artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of data in real time.
7. Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data processing to be done at the edge or the corner of the network, closer to
the source of the data.
Overall, big data technology has come a long way since the early days of data warehousing. The introduction of Hadoop, NoSQL databases, cloud computing,
machine learning, data streaming, and edge computing has revolutionized how we store, process, and analyze large volumes of data. As technology evolves, we
can expect Big Data to play a very important role in various industries.
1. Business Decision-Making:
o Informed decisions based on data-driven insights.
o Example: Predicting customer churn or optimizing inventory.
2. Personalization:
o Tailored recommendations in e-commerce, entertainment (Netflix, Spotify), and social media.
3. Fraud Detection:
o Real-time analysis of transactions to prevent fraudulent activities.
4. Healthcare Advancements:
o Predicting diseases, improving patient care, and advancing precision medicine.
5. Urban Planning and Smart Cities:
o Traffic management, energy efficiency, and public safety.
1. Storage:
o Distributed systems like Hadoop Distributed File System (HDFS).
o Cloud storage solutions like AWS S3, Google Cloud Storage, Azure Blob.
2. Processing:
o Batch Processing: Hadoop MapReduce, Apache Spark.
o Real-Time Processing: Apache Kafka, Apache Flink.
1. Retail:
o Customer behavior analysis, inventory management, and targeted marketing.
2. Banking and Finance:
o Risk management, fraud detection, and algorithmic trading.
3. Healthcare:
o Genomic data analysis, patient monitoring, and drug discovery.
4. Media and Entertainment:
o Content recommendation, audience segmentation, and trend forecasting.
5. Energy Sector:
o Smart grid optimization and renewable energy management.
➢ Better-informed decisions
With big data analytics, organizations can uncover previously hidden trends, patterns and correlations. A deeper understanding equips leaders and
decision-makers with the information needed to strategize effectively, enhancing business decision-making in supply chain management, e-commerce,
operations and overall strategic direction.
➢ Cost savings
Big data analytics drives cost savings by identifying business process efficiencies and optimizations. Organizations can pinpoint wasteful expenditures
by analyzing large datasets, streamlining operations and enhancing productivity. Moreover, predictive analytics can forecast future trends, allowing
companies to allocate resources more efficiently and avoid costly missteps.
➢ Data scientist
Data scientists analyze complex digital data to assist businesses in making decisions. Using their data science training and advanced analytics
technologies, including machine learning and predictive modeling, they uncover hidden insights in data.
➢ Data analyst
Data analysts turn data into information and information into insights. They use statistical techniques to analyze and extract meaningful trends from
data sets, often to inform business strategy and decisions.
➢ Data engineer
Data engineers prepare, process and manage big data infrastructure and tools. They also develop, maintain, test and evaluate data solutions within
organizations, often working with massive datasets to assist in analytics projects.
➢ Data architect
Data architects design, create, deploy and manage an organization's data architecture. They define how data is stored, consumed, integrated
and managed by different data entities and IT systems.
• Edge Computing: Processing data closer to its source for faster analytics.
• AI and ML Integration: Automating insights and enabling predictive analytics.
• Quantum Computing: Addressing the challenges of complex Big Data computations.
• Sustainability: Optimizing Big Data systems to reduce environmental impact.