0% found this document useful (0 votes)
11 views22 pages

Big Data

The document provides a comprehensive overview of big data, its characteristics, and its significance in today's digital world. It discusses the differences between big data and small data, classifications of data, challenges faced in managing big data, and the role of various technologies like Hadoop and NoSQL databases. Additionally, it highlights the applications of big data analytics across different industries and the future scope of big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Big Data

The document provides a comprehensive overview of big data, its characteristics, and its significance in today's digital world. It discusses the differences between big data and small data, classifications of data, challenges faced in managing big data, and the role of various technologies like Hadoop and NoSQL databases. Additionally, it highlights the applications of big data analytics across different industries and the future scope of big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Assignment -1

1. Define big data and explain how it differs from small data.

Big Data: Refers to extremely large datasets that cannot be managed, processed, or analyzed
using traditional data processing tools due to their size, complexity, and speed.
Small Data: Refers to manageable datasets that can be analyzed and processed using standard
tools.

Differences:

1. Scale: Big data involves petabytes or exabytes, whereas small data is typically
megabytes or gigabytes.

2. Variety: Big data includes diverse data types (structured, semi-structured,


unstructured), while small data is usually structured.

3. Processing: Requires advanced technologies (e.g., Hadoop, Spark) versus simpler tools
(e.g., Excel).

4. Applications: Big data is used in complex analytics (AI, predictive modeling), whereas
small data supports traditional decision-making.

5. Speed: Big data needs real-time processing; small data can handle slower batch
processing.

2. What are the three main classifications of data? Provide an example for each.

1. Structured Data: Organized data in rows and columns (e.g., customer databases).

2. Semi-structured Data: Contains tags or markers for organization but lacks a rigid
schema (e.g., JSON files).

3. Unstructured Data: No predefined format (e.g., videos, social media posts).

3. Explain the concept of "web data" and its significance.

Web Data: Data collected from the internet, including websites, social media, and online
transactions.

Significance:

1. Enables insights into user behavior and trends.

2. Fuels targeted advertising and personalization.

3. Supports web-based predictive analytics.

4. Facilitates competitive analysis through web scraping.

5. Enhances e-commerce recommendations and customer experience.


4. What challenges do conventional systems face in handling big data?

1. Scalability: Difficulty in scaling systems to manage growing datasets.

2. Speed: Inability to process real-time data streams efficiently.

3. Diversity: Challenges in managing diverse data formats.

4. Storage: Limited capacity to store massive datasets cost-effectively.

5. Security: Ensuring data privacy and protection.

5. Why is big data important in today’s digital world?

1. Drives informed decision-making with advanced analytics.

2. Powers AI and machine learning models.

3. Enhances customer experience through personalization.

4. Facilitates innovation across industries (e.g., healthcare, finance).

5. Supports predictive and real-time analytics.

6. Briefly describe the four Vs of big data: Volume, Velocity, Variety, and Veracity.

1. Volume: The massive amount of data generated daily.

2. Velocity: The speed at which data is generated and processed.

3. Variety: The different types of data (text, images, videos).

4. Veracity: The accuracy and trustworthiness of the data.

7. What is unstructured data? Provide two real-world examples.

Unstructured Data: Data without a predefined schema or organization.


Examples:

1. Social media posts (e.g., tweets).

2. Video and audio files.

8. What are the primary types of big data?

1. Structured Data: Relational databases.

2. Semi-structured Data: XML, JSON files.

3. Unstructured Data: Videos, images, documents.


9. What is meant by the complexity of big data?

1. Data Integration: Combining data from multiple sources.

2. Processing Challenges: Handling real-time and batch processing.

3. Storage: Managing cost-effective storage solutions.

4. Analysis: Extracting meaningful insights from diverse formats.

5. Security: Protecting sensitive information.

10. Name any two big data technologies and briefly explain their purpose.

1. Hadoop: Distributed storage and processing of large datasets.

2. Apache Spark: Real-time data processing and analytics.

11. Differentiate between structured, semi-structured, and unstructured data with


appropriate examples.

Data Type Characteristics Example

Structured Data Organized in rows and columns. Customer database.

Semi-structured Data Has tags but lacks a strict schema. JSON, XML files.

Unstructured Data No predefined format. Images, videos, emails.

12. Discuss the business value of big data with real-world applications in any two
industries.

Healthcare:

1. Predicting patient outcomes.

2. Personalized treatment plans.

3. Managing health records efficiently.

Finance:

1. Fraud detection.

2. Credit risk assessment.

3. Automated trading systems.

13. What are the challenges of managing big data? Explain any two challenges in detail.

1. Data Quality: Ensuring accuracy and reliability of large datasets.


2. Infrastructure Costs: High costs for scalable storage and processing systems.

14. Explain the role of big data analytics in improving decision-making processes.

1. Provides insights from large datasets.

2. Supports predictive and prescriptive analysis.

3. Enables real-time decision-making.

4. Identifies trends and opportunities.

5. Improves operational efficiency.

15. Discuss the differences between big data and small data in terms of characteristics,
scale, and applications.

1. Characteristics: Big data includes diverse formats; small data is structured.

2. Scale: Big data is massive; small data is manageable.

3. Applications: Big data powers AI; small data aids simple analysis.

16. What are the key big data processing architectures? Explain any one in detail.

1. Batch Processing: Processes data in batches (e.g., Hadoop).

2. Stream Processing: Processes real-time data (e.g., Apache Spark).

3. Lambda Architecture: Combines batch and stream processing for real-time insights.

Lambda Architecture:

• Batch layer stores all data.

• Speed layer processes real-time data.

• Serving layer provides analytics results.

17. Describe the importance of big data handling techniques and their role in managing
data complexity.

1. Organize diverse datasets.

2. Enable efficient processing and analysis.

3. Reduce storage costs.

4. Improve scalability and speed.

5. Enhance data security.


18. Discuss the applications of big data analytics in healthcare, finance, or e-commerce.

Healthcare: Predictive diagnostics, drug development.


Finance: Fraud detection, algorithmic trading.
E-commerce: Personalization, supply chain optimization.

19. Explain the role of Hadoop or any other big data technology in processing and managing
large datasets.

1. Stores data in a distributed manner.

2. Processes data using the MapReduce model.

3. Handles diverse data formats.

4. Scales to accommodate growing datasets.

5. Provides fault tolerance.

20. What is the future scope of big data? Highlight its potential advancements and
challenges.

Advancements:

1. AI-driven analytics.

2. Quantum computing for faster processing.

3. Real-time edge computing.

Challenges:

1. Data privacy issues.

2. Managing increasing data volumes.

21. What is Hadoop, and why is it significant in big data processing?

1. Open-source framework for distributed storage/processing.

2. Handles large datasets efficiently.

3. Provides scalability and fault tolerance.

4. Supports diverse data types.

5. Foundation for big data tools.

22. Name the core components of the Hadoop framework.

1. HDFS (Hadoop Distributed File System).


2. YARN (Yet Another Resource Negotiator).

3. MapReduce.

23. What is HDFS, and what role does it play in Hadoop?

1. Distributed file system for storing large datasets.

2. Divides data into blocks across multiple nodes.

3. Ensures data redundancy and fault tolerance.

24. Explain the purpose of the NameNode in HDFS.

1. Manages metadata for HDFS.

2. Tracks file locations on DataNodes.

3. Coordinates file operations.

25. What are the key features of Hadoop?

1. Distributed storage and processing.

2. Scalability.

3. Fault tolerance.

4. Open-source.

5. Flexible data handling.

26. Define MapReduce and its role in the Hadoop ecosystem.

1. Programming model for processing large datasets.

2. Splits tasks into Map and Reduce phases.

3. Parallel processing for speed.

27. What is the function of a "Map" task in MapReduce?

1. Processes input data in parallel.

2. Generates key-value pairs.

28. Briefly explain the "Reduce" task in the MapReduce model.

1
. Aggregates intermediate key-value pairs.
2. Produces the final output.

29. What is Hadoop YARN, and why was it introduced?

1. Resource manager for Hadoop.

2. Supports multiple processing engines.

3. Enhances cluster efficiency.

30. List any two tools from the Hadoop ecosystem.

1. Apache Hive.

2. Apache Pig.

31. Explain the HDFS architecture, including data storage and physical organization.

1. NameNode: Manages metadata.

2. DataNodes: Store actual data blocks.

3. Block Replication: Ensures fault tolerance.

32. Discuss the key features of Hadoop that make it suitable for handling big data.

1. Scalability.

2. Fault tolerance.

3. Parallel processing.

4. Flexible data handling.

33. What are the main commands used in HDFS? Provide examples of any three
commands.

1. hdfs dfs -ls /: Lists files in a directory.

2. hdfs dfs -put localfile /hdfsdir: Uploads a file.

3. hdfs dfs -rm /hdfsdir/file: Deletes a file.

34. Describe the MapReduce programming model with an example.

1. Map Phase: Converts data into key-value pairs.


2. Reduce Phase: Aggregates values based on keys.
Example: Word count in a text file.

35. Explain the execution flow of a MapReduce job, including the roles of the Map task and
Reduce task.

1. Input data is split into chunks.

2. Map Task: Processes chunks into key-value pairs.

3. Intermediate data is shuffled and sorted.

4. Reduce Task: Aggregates and finalizes output.

5. Final output is stored.

Assignment -2
1. What is Hadoop YARN, and how does it improve upon the Hadoop 1.x execution model?

Hadoop YARN (Yet Another Resource Negotiator): A resource management layer introduced
in Hadoop 2.x.

Improvements over Hadoop 1.x:

1. Decouples resource management and job scheduling from MapReduce.

2. Supports multiple processing frameworks (e.g., Spark, Storm).

3. Improves cluster utilization and scalability.

4. Enables dynamic allocation of resources.

5. Allows parallel processing of diverse workloads.

2. Discuss the Hadoop 2.x execution model and how it differs from Hadoop 1.x.

1. Hadoop 1.x: Relied solely on MapReduce for processing and had limited scalability.

2. Hadoop 2.x: Introduced YARN to manage resources and scheduling.

3. Framework Support: 2.x supports multiple data processing frameworks, unlike 1.x.

4. Cluster Utilization: Improved resource allocation in 2.x.

5. Fault Tolerance: Enhanced failover mechanisms in 2.x.

3. Provide an overview of Hadoop ecosystem tools such as Hive, Pig, or HBase and their
applications.

1. Hive: Data warehousing tool for querying large datasets using SQL-like syntax.

2. Pig: High-level platform for analyzing data using Pig Latin scripting language.
3. HBase: NoSQL database for real-time read/write access to big data.

4. What is the role of the Hadoop ecosystem in big data processing?

1. Supports storage, processing, and analysis of big data.

2. Provides tools for diverse data formats (structured, semi-structured, unstructured).

3. Enables distributed and fault-tolerant data management.

4. Facilitates real-time and batch processing.

5. Offers integration with advanced analytics tools.

5. Explain the physical organization of HDFS and how data is stored across nodes in the
cluster.

1. Data Blocks: Files are divided into blocks (default size: 128 MB).

2. Distributed Storage: Blocks are stored across multiple DataNodes.

3. Replication: Each block is replicated (default: 3 copies) for fault tolerance.

4. Metadata: Managed by NameNode.

5. Fault Tolerance: Automatic recovery from node failures.

6. What is NoSQL, and how does it differ from traditional relational databases?

NoSQL: A database system designed for unstructured or semi-structured data.

Differences:

1. Schema: NoSQL is schema-less; SQL uses a predefined schema.

2. Data Models: NoSQL supports key-value, document, column-family, and graph models.

3. Scalability: NoSQL is horizontally scalable; SQL is vertically scalable.

4. Flexibility: NoSQL is suitable for varied data types.

5. Performance: Optimized for high-speed read/write operations.

7. Name the main types of NoSQL databases and provide an example of each.

1. Key-Value Stores: Redis.

2. Document Stores: MongoDB.

3. Column-Family Stores: Apache Cassandra.

4. Graph Databases: Neo4j.


8. Why is NoSQL preferred over SQL for managing big data?

1. Handles unstructured and semi-structured data.

2. Scales horizontally for large datasets.

3. Supports high-speed transactions.

4. Flexible schema design.

5. Ideal for distributed systems.

9. List two advantages of using NoSQL databases.

1. Supports dynamic and unstructured data.

2. Scales easily with increasing data volumes.

10. What industries commonly use NoSQL databases?

1. E-commerce (e.g., product catalogs).

2. Social media platforms (e.g., user profiles).

3. Healthcare (e.g., patient records).

4. Finance (e.g., fraud detection).

5. Gaming (e.g., real-time scoring).

11. How does NoSQL handle unstructured or semi-structured data?

1. Schema-less architecture.

2. Flexible data models (key-value, document).

3. Optimized for hierarchical and nested data.

4. Allows dynamic addition of fields.

5. Stores data as JSON, XML, or BSON formats.

12. What is a NoSQL datastore?

1. A database optimized for storing and managing non-relational data.

2. Supports horizontal scaling.

3. Handles high-velocity, high-volume data.

4. Often used in distributed systems.


13. Briefly explain the NoSQL data architecture pattern.

1. Supports flexible schema design.

2. Distributes data across nodes.

3. Ensures eventual consistency.

4. Uses partitioning for scalability.

5. Optimized for specific data access patterns.

14. What is MongoDB, and how is it different from other NoSQL databases?

1. Document-based NoSQL database.

2. Stores data in JSON-like BSON format.

3. Optimized for hierarchical and semi-structured data.

4. Rich query language.

5. Focus on developer productivity.

15. Name any three features of MongoDB.

1. Schema-less design.

2. Indexing for fast queries.

3. Replication for high availability.

16. Compare SQL and NoSQL databases in terms of scalability, structure, and use cases.

1. Scalability: SQL scales vertically; NoSQL scales horizontally.

2. Structure: SQL has a fixed schema; NoSQL is flexible.

3. Use Cases: SQL for structured data; NoSQL for diverse formats.

17. Explain the different types of NoSQL databases.

1. Key-Value Stores: Redis for caching.

2. Document-Based: MongoDB for JSON-like data.

3. Column-Family: Cassandra for time-series data.

4. Graph Databases: Neo4j for relationship-based data.


18. Discuss the advantages of NoSQL databases in managing big data.

1. Handles high-volume data efficiently.

2. Supports distributed systems.

3. Adapts to unstructured data.

4. High performance for read/write operations.

5. Scales dynamically.

19. How does NoSQL contribute to modern web applications?

1. Real-time data processing.

2. Scales with user traffic.

3. Enhances personalization.

4. Optimizes search and analytics.

5. Simplifies development with flexible schemas.

20. What is the role of NoSQL in managing unstructured and semi-structured data?

1. Allows flexible schema changes.

2. Stores nested and hierarchical data efficiently.

3. Supports JSON/XML formats.

4. Enables faster query execution.

5. Optimized for dynamic datasets.

21. Explain the key features of MongoDB.

1. Stores data as BSON.

2. Supports indexing for efficient querying.

3. Provides replication and sharding.

22. Discuss the MongoDB query language and how it differs from SQL.

1. JSON-like syntax.

2. Supports aggregation pipelines.

3. Example: db.collection.find({field: value}).


23. What are the various data types supported in MongoDB?

1. String: "text".

2. Number: 42.

3. Boolean: true.

4. Array: ["value1", "value2"].

5. Object: {key: "value"}.

24. How does MongoDB handle big data storage and retrieval?

1. Stores data in distributed nodes.

2. Uses replication for fault tolerance.

3. Shards data for scalability.

25. Write a brief note on the use of MongoDB in industry with real-world applications.

1. E-commerce: Product catalogs.

2. Healthcare: Patient data.

3. Social Media: User profiles.

26. What is Apache Pig, and what is its primary use?

1. A high-level platform for big data analysis.

2. Processes large datasets using Pig Latin.

27. List two features of Apache Pig.

1. Supports parallel processing.

2. Handles structured and unstructured data.

28. What is the role of the Grunt Shell in Apache Pig?

1. Interactive shell for executing Pig scripts.

2. Debugging and testing Pig commands.

29. What is Pig Latin, and why is it important in big data processing?

1. High-level scripting language for Apache Pig.


2. Simplifies data transformation tasks.

30. Briefly explain the architecture of Apache Pig.

1. Parser: Parses Pig Latin scripts.

2. Compiler: Generates execution plans.

3. Execution Engine: Executes plans on Hadoop.

31. What are the key components of the Pig data model?

1. Atom: Single data value.

2. Tuple: Ordered collection of fields.

3. Bag

: Unordered collection of tuples.

32. What is Hive, and how does it differ from Pig?

1. Hive: SQL-like querying tool for structured data.

2. Pig: Scripting tool for diverse data formats.

33. Name two characteristics of Hive.

1. Schema-based.

2. Supports SQL-like queries.

34. What are the limitations of Hive?

1. High latency.

2. Not suitable for real-time data.

35. Name any two Hive built-in functions.

1. SUM(): Calculates sum.

2. AVG(): Calculates average.


Assignment -3
1. Explain the architecture of Apache Pig with a diagram. Discuss its
components and their roles.
Architecture Components:
1. Parser: Checks syntax and converts Pig Latin scripts into logical
plans.
2. Optimizer: Optimizes logical plans into physical execution plans.
3. Execution Engine: Executes the physical plan on Hadoop
MapReduce.
4. Grunt Shell: Interactive shell for script execution.
5. Pig Latin Scripts: High-level scripts for data processing.

2. Write a note on the Pig Latin scripting language. Provide an example


of a simple script.
Pig Latin:
1. Declarative scripting language for Apache Pig.
2. Handles structured, semi-structured, and unstructured data.
3. Simplifies data processing over MapReduce.
Example Script:
A = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int);
B = FILTER A BY age > 30;
STORE B INTO 'output';

3. What are the steps involved in executing a Pig Latin script? Discuss
the commands used.
1. Load Data: Use LOAD to read data.
2. Transform Data: Apply filters, group, and aggregate.
3. Store Results: Use STORE to write output.
4. Commands: Run the script in Grunt Shell or batch mode.

4. Describe the Hive architecture and its key components.


Key Components:
1. User Interface: CLI, JDBC/ODBC for interaction.
2. Metastore: Stores metadata about tables and schemas.
3. Driver: Manages query lifecycle and execution.
4. Query Compiler: Converts HiveQL to MapReduce tasks.
5. Execution Engine: Executes MapReduce jobs on Hadoop.

5. Compare Hive with traditional RDBMS in terms of schema, data


processing, and query language.
1. Schema: Hive supports schema-on-read; RDBMS uses schema-
on-write.
2. Data Processing: Hive processes batch data; RDBMS processes
OLTP data.
3. Query Language: HiveQL for Hive; SQL for RDBMS.

6. What are the key data types and file formats supported by Hive?
Provide examples.
Data Types:
1. Primitives: INT, STRING, FLOAT.
2. Complex: ARRAY, MAP, STRUCT.
File Formats:
1. Text File.
2. Sequence File.
3. ORC (Optimized Row Columnar).
4. Parquet.

7. Explain the Hive data model and its structure. How does it handle
data storage?
Data Model:
1. Databases: Logical namespaces.
2. Tables: Store data.
3. Partitions: Divide tables for efficient queries.
4. Buckets: Sub-divide partitions.
Data Storage: Managed in HDFS.

8. Discuss the workflow of Hive integration with Hadoop.


1. User submits HiveQL query.
2. Query Compiler generates MapReduce tasks.
3. Execution Engine runs tasks on Hadoop.
4. Results are stored in HDFS.
5. Output is displayed to the user.

9. How are HiveQL queries executed? Provide examples of a basic


query for creating and querying a table.
Steps:
1. Parse and compile query.
2. Optimize and execute on Hadoop.
3. Retrieve results.
Example:
CREATE TABLE employees (id INT, name STRING, salary FLOAT);
SELECT * FROM employees WHERE salary > 50000;
10. Compare the use cases of Apache Pig and Hive in big data analytics.
Highlight their strengths and limitations.

Feature Apache Pig Hive

Use Case ETL workflows Data warehousing

Strengths Handles unstructured data SQL-like queries for BI

Limitations Steeper learning curve Higher query latency

11. What is HBase, and how does it integrate with Hadoop?


1. NoSQL database on Hadoop.
2. Provides real-time read/write access.
3. Stores data in HDFS.
4. Works with MapReduce for analytics.
5. Integrates with Hive and Pig for querying.

12. What are the key fundamentals of HBase?


1. Column-family-oriented storage.
2. Horizontal scalability.
3. Strong consistency model.
4. Automatic sharding.
5. Real-time data access.

13. Explain how MapReduce jobs are run on HBase for table
input/output.
1. Input data is read using TableInputFormat.
2. Mapper processes HBase rows.
3. Reducer writes output using TableOutputFormat.

14. What is the role of Zookeeper in the Hadoop ecosystem?


1. Manages cluster coordination.
2. Handles leader election and configuration management.
3. Ensures distributed synchronization.
4. Provides metadata storage for HBase.

15. Name two visualization techniques for visual data analysis.


1. Scatter plots.
2. Heatmaps.

16. What are the main interaction techniques used in data


visualization?
1. Filtering data.
2. Zooming and panning.

17. Define RDD (Resilient Distributed Dataset) in Spark.


1. Immutable distributed collection of objects.
2. Supports in-memory processing.
3. Enables fault tolerance.

18. How is Spark different from Hadoop MapReduce?


1. Spark supports in-memory computation.
2. Faster processing due to DAG execution.
3. Better suited for iterative tasks.
19. What is MLlib in Spark, and what types of machine learning
algorithms does it support?
1. Library for scalable machine learning.
2. Supports classification, clustering, and regression.

20. How do you download and install Spark?


1. Download from the official Apache Spark website.
2. Set up Java and Hadoop dependencies.

21. Explain the architecture and key features of HBase. How is it used
for big data storage and retrieval in the Hadoop ecosystem?
Architecture:
1. Master-RegionServer Model: Master manages metadata;
RegionServers handle reads/writes.
2. HDFS Integration: Data stored in HDFS.
3. Zookeeper: Ensures coordination.

22. Discuss the process of running MapReduce jobs on HBase,


including table input and output.
1. InputFormat: Reads data from HBase tables.
2. Mapper: Processes rows.
3. Reducer: Writes back to HBase.

23. Describe the role of Zookeeper in Hadoop and HBase ecosystems.


1. Coordinates distributed nodes.
2. Ensures consistent metadata updates.
24. Discuss various visual data analysis techniques. Provide examples
of how they are used in big data analytics.
1. Bar Charts: Sales trends analysis.
2. Word Clouds: Text mining.

25. Explain interaction techniques in visual data analysis and their


importance for effective data exploration.
1. Enables dynamic filtering.
2. Facilitates zoom and drill-down.

26. What is Spark, and how does it facilitate data analysis and
processing in comparison to Hadoop MapReduce?
1. In-memory processing.
2. Supports real-time analytics.

27. Describe the process of programming with RDDs in Spark. Provide


an example of creating and manipulating an RDD.
rdd = sc.parallelize([1, 2, 3, 4])
result = rdd.map(lambda x: x * 2).collect()

28. Discuss how machine learning can be performed with Spark's


MLlib. Provide an example of a classification algorithm using MLlib.
1. Classification Example: Logistic regression.
2. Train and evaluate using MLlib’s API.

29. What are the key steps involved in setting up and running Spark in a
distributed environment?
1. Set up Hadoop and Spark.
2. Configure cluster nodes.

30. Compare and contrast the capabilities of Spark and Hadoop for big
data analytics.

Feature Spark Hadoop

Processing In-memory Disk-based

Use Cases Real-time analytics Batch processing

You might also like