Big Data
Big Data
• Data Collection: Ensure data is gathered from diverse and relevant sources.
• Data Quality: Maintain high data quality through cleaning and validation.
• Scalability: Design analytics systems that can scale with increasing data volumes.
• Security & Privacy: Implement robust security measures to protect sensitive data.
• Cost Management: Optimize resources to manage costs effectively.
• Data Lakes: Centralized storage for raw, unprocessed data in its native format.
• Data Warehouses: Structured storage optimized for query and analysis.
• NoSQL Databases: Non-relational databases designed for high scalability.
• Distributed File Systems: Systems like HDFS that store data across multiple
machines.
• Cloud Storage: Scalable, on-demand storage solutions provided by cloud
platforms.
• Scalability Issues: Traditional systems struggle with the volume and velocity of Big
Data.
• High Costs: Maintaining and scaling traditional systems can be expensive.
• Inflexibility: Rigid architectures are not well-suited for unstructured data.
• Distributed Computing: Breaks down tasks and distributes them across multiple
nodes.
• Fault Tolerance: Automatically replicates data and reruns failed tasks.
• Data Locality: Processes data where it is stored to minimize data movement.
• Job Tracker: Manages the MapReduce jobs, scheduling, and monitoring tasks.
• Task Tracker: Executes the individual map and reduce tasks on Data Nodes.
• Data Shuffling: Transfers intermediate data between the map and reduce phases.
• K-Means Clustering:
o Algorithm: Partitions data into K clusters by minimizing the variance within
each cluster.
o Use Cases: Customer segmentation, image compression, anomaly
detection.
• Elbow Method: Plotting the sum of squared errors to find the optimal number of
clusters.
• Silhouette Score: Measures how similar a point is to its cluster compared to other
clusters.
• Indexing: Creating data structures that improve the speed of data retrieval.
• Techniques: B-trees, hash tables, inverted indexes for efficient search operations.
• Reasons to Choose:
o Scalability: Ability to handle large datasets.
o Cost-Effectiveness: Open-source tools reduce costs.
• Cautions:
o Complexity: Steep learning curve for setting up and managing Hadoop.
o Data Quality: Poor data can lead to inaccurate analysis.
• ID3: Uses information gain to select the feature that best splits the data.
• C4.5: An extension of ID3 that handles both categorical and continuous data.
• CART (Classification and Regression Trees): Builds binary trees for both
classification and regression tasks.
• Definition: Analyzing social media and other sources to gauge public sentiment in
real-time.
• Applications: Brand monitoring, customer feedback analysis, market research.
• Tools: Apache Storm, Spark Streaming, Twitter API.