Big Data
Big Data
Applications:
● Supervised Learning: Uses labeled data to predict outcomes. Common tasks include
classification and regression.
● Unsupervised Learning: Finds patterns in data without labels, primarily used for
clustering and dimensionality reduction.
Disadvantages:
● Requires labeled data, which can be time-consuming to obtain.
● Not suitable for complex tasks without sufficient computational resources.
Disadvantages:
Cons:
● Computationally expensive as it requires calculating the distance for all training points.
● Sensitive to the scale of data and irrelevant features.
● High memory requirements to store training data.
Q12: How does SVM handle linearly separable and non-linearly separable data?
A:
● Linearly Separable Data: Constructs a hyperplane that separates the data points with
the maximum margin.
● Non-Linearly Separable Data: Uses kernel functions to transform the data into
higher-dimensional space, where it becomes linearly separable.
Q13: Explain the concept of margin and support vectors in SVM.
A:
● Margin: The distance between the hyperplane and the nearest data points from each
class. SVM aims to maximize this margin.
● Support Vectors: The data points closest to the hyperplane, which are critical in
defining the decision boundary.
Disadvantages:
K-Means Clustering
Q1: What is K-Means Clustering, and how does it work?
A:
K-Means Clustering is an unsupervised learning algorithm used for solving clustering problems.
The algorithm aims to partition nnn observations into kkk clusters, where each observation
belongs to the cluster with the nearest mean, serving as the cluster's prototype.
Working:
This makes it efficient for large datasets but can be computationally expensive for
high-dimensional data.
Neural Networks
Q1: What are Neural Networks and how are they inspired by the human brain?
A: Neural networks are information-processing models inspired by the human brain. They are
highly nonlinear, complex, and parallel in nature. Like the brain, they consist of structural units
called neurons that enable tasks such as perception, pattern recognition, and control. Neural
networks are capable of learning and generalizing to solve computationally complex problems.
Q2: What are the key features and properties of Neural Networks?
A:
1. Nonlinearity: Neural networks can model nonlinear relationships, which are distributed
across the network.
2. Input-Output Mapping: They map inputs to outputs using supervised learning with
labeled examples.
3. Adaptivity: Networks adapt to changing environments by modifying synaptic weights.
4. Evidential Response: Provides confidence levels for classification decisions.
5. Fault Tolerance: Gradual performance degradation under adverse conditions.
6. VLSI Implementability: Parallel structures make them efficient for Very-Large-Scale
Integration (VLSI).
7. Uniformity of Design: Common frameworks allow for modular and reusable designs.
8. Neurobiological Analogy: Mimics the brain's fault-tolerant and parallel processing
capabilities.
Q3: Explain the role of synaptic weights and activation functions in Neural Networks.
A:
● Synaptic Weights: Represent the strength of connections between neurons. They store
experiential knowledge and are adjusted during the learning process.
● Activation Functions: Define the output of neurons by transforming the input. Common
types include:
○ Threshold Function (binary outputs).
○ Sigmoid Function (smooth, differentiable, S-shaped curve).
○ ReLU (Rectified Linear Unit): Replaces negative inputs with zero.
1. Feed-Forward Networks: Data flows in one direction from input to output (e.g., image
recognition).
2. Single-Layer Networks: Have only one output layer of neurons.
3. Multilayer Networks: Include hidden layers to handle complex mappings.
4. Recurrent Networks: Include feedback loops for dynamic behavior and sequential data
processing.
1. Supervised Learning: Networks learn from labeled examples by minimizing errors. The
process includes:
○ Providing inputs and desired outputs (training data).
○ Iterative adjustment of weights to reduce errors.
2. Unsupervised Learning: Finds patterns without labeled data, often using clustering
methods.
3. Reinforcement Learning: Learns through interaction with the environment, optimizing a
reward function.
Applications
1. Facial Recognition
2. Stock Market Prediction
3. Social Media Analytics
4. Aerospace Engineering
5. Healthcare Diagnostics
6. Signature Verification
7. Handwriting Analysis
A:
1. Convolutional Neural Networks (CNNs): Specialized for grid-like data (e.g., images).
2. Recurrent Neural Networks (RNNs): Process sequential data and retain memory of
previous inputs.
3. Autoencoders: Used for unsupervised learning tasks like feature extraction and data
compression.
4. Feedforward Neural Networks: Simplest architecture, mapping inputs to outputs via
layers.
A:
1. Identify the Problem: Define the objective and feasibility for deep learning application.
2. Collect Relevant Data: Gather and preprocess data.
3. Choose an Algorithm: Select an appropriate deep learning model.
4. Train the Model: Feed data into the model for training.
5. Test and Validate: Evaluate the model's performance on unseen data.
A:
A:
1. Automatic Text Generation: Learns language patterns for generating coherent text.
2. Healthcare: Diagnoses diseases using medical data.
3. Machine Translation: Translates text between languages.
4. Image Recognition: Identifies and categorizes objects in images.
5. Earthquake Prediction: Models viscoelastic computations to predict seismic activity.
6. Self-Driving Cars: Processes sensor data for navigation.
A:
● Machine Learning: Requires manual feature engineering; effective for smaller datasets.
● Deep Learning: Automates feature extraction; suitable for large datasets with high
complexity.
Architectures in Detail
A:
Deep feedforward networks, also known as multilayer perceptrons (MLPs), are the foundation of
many deep learning models. They approximate functions f∗(x)f^*(x)f∗(x), where xxx is the input,
and the model maps it to an output yyy. Parameters θ\thetaθ are optimized during training to
minimize the difference between predicted and actual outputs.
A:
Pooling simplifies the representation of features by reducing their dimensions.
A:
Technical Concepts
A:
Padding adds extra pixels around the input data to preserve spatial dimensions after
convolution. It prevents information loss at the edges.
A:
Batch normalization standardizes the input to each layer in a deep network. It stabilizes the
learning process, reduces training epochs, and prevents overfitting.
Q15: What are strides in CNNs?
A:
Strides determine the step size of the kernel during convolution. Larger strides reduce
computational cost and generalize features better but may lose fine-grained details.
Introduction
Q3: How are public opinions mined using social media platforms?
A:
Public opinions are mined using a framework like DOM (Dynamic Opinion Mining), which
involves:
1. Collecting messages from social networks, blogs, and forums using DOM’s crawler
module.
2. Storing data in a NoSQL database (e.g., MongoDB).
3. Using Natural Language Processing (NLP) for sentiment scoring, topic categorization,
and summarization.
4. Applying MapReduce on Apache Hadoop to optimize data processing.
Q5: What is the MapReduce framework, and why is it essential in sentiment analysis?
A:
MapReduce is a distributed computing framework used for processing large datasets efficiently
by splitting tasks into smaller, parallel jobs. It accelerates the analysis speed of the DOM
framework, making sentiment analysis scalable.
Q7: What machine learning techniques are applied in Twitter sentiment analysis?
A:
Influencer Analysis
● DOM classified tweets into six predefined categories with over 85% accuracy, providing
insights into public sentiment and political opinions.
Q13: How are weather conditions correlated with emotions expressed on Twitter?
A:
Q14: What tools and technologies are used in sentiment analysis and weather
correlation?
A:
Ethics Overview
Q3: How can unethical use of data affect individuals and organizations?
A:
Unethical practices, such as misconfigured databases, data breaches, or lack of informed
consent, can lead to privacy violations, loss of trust, and legal repercussions. For organizations,
this might damage their reputation and result in financial losses.
Q5: What are the five key concerns in Big Data Ethics?
A:
Ethical Challenges
Q8: What are the challenges with informed consent in Big Data?
A:
● Traditional informed consent for specific studies is insufficient for Big Data.
● Consent cannot cover unknown future uses, as data mining often reveals patterns
unanticipated during collection.
● Alternatives like broad consent (pre-authorization for secondary uses) and tiered
consent (specific authorizations) are being adopted, but these approaches may dilute
the concept of informed consent.
Q15: What are the key principles for ethical data usage?
A:
1. Data should remain private unless explicitly authorized for sharing.
2. Customers should have control over their data’s flow.
3. Big Data analytics should not interfere with human autonomy.
4. Algorithms must avoid reinforcing unfair biases.
Final Reflections