Big Data
Big Data
1. Define big data and explain how it differs from small data.
Big Data: Refers to extremely large datasets that cannot be managed, processed, or analyzed
using traditional data processing tools due to their size, complexity, and speed.
Small Data: Refers to manageable datasets that can be analyzed and processed using standard
tools.
Differences:
1. Scale: Big data involves petabytes or exabytes, whereas small data is typically
megabytes or gigabytes.
3. Processing: Requires advanced technologies (e.g., Hadoop, Spark) versus simpler tools
(e.g., Excel).
4. Applications: Big data is used in complex analytics (AI, predictive modeling), whereas
small data supports traditional decision-making.
5. Speed: Big data needs real-time processing; small data can handle slower batch
processing.
2. What are the three main classifications of data? Provide an example for each.
1. Structured Data: Organized data in rows and columns (e.g., customer databases).
2. Semi-structured Data: Contains tags or markers for organization but lacks a rigid
schema (e.g., JSON files).
Web Data: Data collected from the internet, including websites, social media, and online
transactions.
Significance:
6. Briefly describe the four Vs of big data: Volume, Velocity, Variety, and Veracity.
10. Name any two big data technologies and briefly explain their purpose.
Semi-structured Data Has tags but lacks a strict schema. JSON, XML files.
12. Discuss the business value of big data with real-world applications in any two
industries.
Healthcare:
Finance:
1. Fraud detection.
13. What are the challenges of managing big data? Explain any two challenges in detail.
14. Explain the role of big data analytics in improving decision-making processes.
15. Discuss the differences between big data and small data in terms of characteristics,
scale, and applications.
3. Applications: Big data powers AI; small data aids simple analysis.
16. What are the key big data processing architectures? Explain any one in detail.
3. Lambda Architecture: Combines batch and stream processing for real-time insights.
Lambda Architecture:
17. Describe the importance of big data handling techniques and their role in managing
data complexity.
19. Explain the role of Hadoop or any other big data technology in processing and managing
large datasets.
20. What is the future scope of big data? Highlight its potential advancements and
challenges.
Advancements:
1. AI-driven analytics.
Challenges:
3. MapReduce.
2. Scalability.
3. Fault tolerance.
4. Open-source.
1
. Aggregates intermediate key-value pairs.
2. Produces the final output.
1. Apache Hive.
2. Apache Pig.
31. Explain the HDFS architecture, including data storage and physical organization.
32. Discuss the key features of Hadoop that make it suitable for handling big data.
1. Scalability.
2. Fault tolerance.
3. Parallel processing.
33. What are the main commands used in HDFS? Provide examples of any three
commands.
35. Explain the execution flow of a MapReduce job, including the roles of the Map task and
Reduce task.
Assignment -2
1. What is Hadoop YARN, and how does it improve upon the Hadoop 1.x execution model?
Hadoop YARN (Yet Another Resource Negotiator): A resource management layer introduced
in Hadoop 2.x.
2. Discuss the Hadoop 2.x execution model and how it differs from Hadoop 1.x.
1. Hadoop 1.x: Relied solely on MapReduce for processing and had limited scalability.
3. Framework Support: 2.x supports multiple data processing frameworks, unlike 1.x.
3. Provide an overview of Hadoop ecosystem tools such as Hive, Pig, or HBase and their
applications.
1. Hive: Data warehousing tool for querying large datasets using SQL-like syntax.
2. Pig: High-level platform for analyzing data using Pig Latin scripting language.
3. HBase: NoSQL database for real-time read/write access to big data.
5. Explain the physical organization of HDFS and how data is stored across nodes in the
cluster.
1. Data Blocks: Files are divided into blocks (default size: 128 MB).
6. What is NoSQL, and how does it differ from traditional relational databases?
Differences:
2. Data Models: NoSQL supports key-value, document, column-family, and graph models.
7. Name the main types of NoSQL databases and provide an example of each.
1. Schema-less architecture.
14. What is MongoDB, and how is it different from other NoSQL databases?
1. Schema-less design.
16. Compare SQL and NoSQL databases in terms of scalability, structure, and use cases.
3. Use Cases: SQL for structured data; NoSQL for diverse formats.
5. Scales dynamically.
3. Enhances personalization.
20. What is the role of NoSQL in managing unstructured and semi-structured data?
22. Discuss the MongoDB query language and how it differs from SQL.
1. JSON-like syntax.
1. String: "text".
2. Number: 42.
3. Boolean: true.
24. How does MongoDB handle big data storage and retrieval?
25. Write a brief note on the use of MongoDB in industry with real-world applications.
29. What is Pig Latin, and why is it important in big data processing?
31. What are the key components of the Pig data model?
3. Bag
1. Schema-based.
1. High latency.
3. What are the steps involved in executing a Pig Latin script? Discuss
the commands used.
1. Load Data: Use LOAD to read data.
2. Transform Data: Apply filters, group, and aggregate.
3. Store Results: Use STORE to write output.
4. Commands: Run the script in Grunt Shell or batch mode.
6. What are the key data types and file formats supported by Hive?
Provide examples.
Data Types:
1. Primitives: INT, STRING, FLOAT.
2. Complex: ARRAY, MAP, STRUCT.
File Formats:
1. Text File.
2. Sequence File.
3. ORC (Optimized Row Columnar).
4. Parquet.
7. Explain the Hive data model and its structure. How does it handle
data storage?
Data Model:
1. Databases: Logical namespaces.
2. Tables: Store data.
3. Partitions: Divide tables for efficient queries.
4. Buckets: Sub-divide partitions.
Data Storage: Managed in HDFS.
13. Explain how MapReduce jobs are run on HBase for table
input/output.
1. Input data is read using TableInputFormat.
2. Mapper processes HBase rows.
3. Reducer writes output using TableOutputFormat.
21. Explain the architecture and key features of HBase. How is it used
for big data storage and retrieval in the Hadoop ecosystem?
Architecture:
1. Master-RegionServer Model: Master manages metadata;
RegionServers handle reads/writes.
2. HDFS Integration: Data stored in HDFS.
3. Zookeeper: Ensures coordination.
26. What is Spark, and how does it facilitate data analysis and
processing in comparison to Hadoop MapReduce?
1. In-memory processing.
2. Supports real-time analytics.
29. What are the key steps involved in setting up and running Spark in a
distributed environment?
1. Set up Hadoop and Spark.
2. Configure cluster nodes.
30. Compare and contrast the capabilities of Spark and Hadoop for big
data analytics.