Final Note
Final Note
I. Apache Spark Architecture - Create new RDDs/DataFrames but do not execute immediately.
1. Spark Application Processes Actions:
A Spark application is a distributed computing program built to process large datasets using the Apache - Trigger computation, e.g., collect, save, count.
Spark framework. Each Spark application is a self-contained computation with a driver program and a set - Generate one or more jobs for execution.
of distributed worker processes. Here's an overview of the key processes involved in a Spark application: f. Execution modes
a. Driver Process Standalone Mode: Runs Spark on its built-in cluster manager.
The driver is the central process of a Spark application that manages and coordinates the execution of YARN Mode: Runs Spark within a Hadoop cluster.
tasks. Mesos Mode: Uses Apache Mesos to manage cluster resources.
Responsibilities: Kubernetes Mode: Deploys Spark applications on Kubernetes clusters.
- Defines the Spark context (SparkContext) and manages the lifecycle of the application. Local Mode: Executes Spark on a single machine (ideal for debugging).
- Analyzes, schedules, and distributes tasks to worker nodes (executors). g. Spark Application Lifecycle
- Collects results from the executors. Application Submission: The application is submitted via the spark-submit tool.
- Monitors application progress and handles failures. Driver Initialization: The driver initializes the Spark context and interacts with the cluster manager.
Location: Runs on the node where the application is submitted. Executor Launch: The cluster manager launches executors on worker nodes.
b. Executors Task Scheduling: The driver divides the job into stages and tasks, then schedules them on executors.
Executors are distributed processes running on worker nodes that execute tasks assigned by the driver. Task Execution: Executors process tasks in parallel and return results.
Responsibilities: Completion: Once all tasks are complete, the application terminates, and resources are released.
- Execute tasks from the job and return results to the driver. h. Key concepts
- Store data for caching and shuffling. RDD (Resilient Distributed Dataset): Immutable distributed data collection.
- Perform computations on local data partitions. DAG (Directed Acyclic Graph): Logical representation of the execution plan.
Lifecycle: Executors are created when the application starts and terminate when the application ends. Checkpointing: Saves the intermediate state of an RDD to storage for fault tolerance.
c. Cluster Manager 2. Run an Apache Spark Application
The cluster manager handles resource allocation across all nodes in the cluster. a. Prepare your Spark Application
Types: Write Your Code: Develop your Spark application in Scala, Python, Java, or R. Save it as a file (e.g.,
- Standalone: Spark's built-in cluster manager. MySparkApp.scala, MySparkApp.py, or MySparkApp.java).
- YARN: Resource manager for Hadoop ecosystems. Build the Application: For Java/Scala:
- Mesos: General-purpose cluster manager. - Package the code into a JAR file using a build tool like Maven or SBT. Example for Scala: sbt package
- Kubernetes: Container orchestration system for Spark. - The resulting file might be named something like my-spark-app_2.12-1.0.jar.
Responsibilities: For Python/R, ensure the script is ready without additional packaging.
- Allocate resources (CPU, memory) to the Spark application. b. Set up the environment
- Manage the lifecycle of executors. Install Spark on your system or cluster.
d. Job, Stage and Task Execution - Download Spark: https://spark.apache.org/downloads.html
Job: A set of actions triggered by transformations on an RDD/DataFrame/Dataset. - Extract the package and configure environment variables (SPARK_HOME and add
Stages: SPARK_HOME/bin to PATH).
- Subdivision of a job based on shuffle boundaries. Configure the cluster manager if running on YARN, Kubernetes, or Mesos.
- Each stage consists of multiple tasks. c. Submit the Application
Tasks: Use the spark-submit command to run your application. Here's the general syntax:
- The smallest unit of execution in Spark. spark-submit \
- Process data in parallel across partitions. --master <MASTER_URL> \
e. Data Flow --deploy-mode <DEPLOY_MODE> \
Transformations: --class <MAIN_CLASS> \
--name <APP_NAME> \ --conf spark.kubernetes.container.image=<SPARK_IMAGE> \
--conf <CONFIGURATION_OPTIONS> \ my-spark-app_2.12-1.0.jar
--driver-memory <MEMORY> \ 3. Spark Shell Example – Run code
--executor-memory <MEMORY> \ a. Launch Scala Spark Shell
--executor-cores <CORES> \ spark-shell
<APPLICATION_FILE> \ - This launches an interactive Scala Spark shell, which includes pre-configured Spark and SQL
<APPLICATION_ARGUMENTS> contexts (spark and spark.sql).
d. Example commands b. Create a Distributed DataFrame
Run locally: Create a DataFrame with a single column id containing values from 0 to 9.
spark-submit \ val df = spark.range(0, 10).toDF("id")
--master local[*] \ df.show()
--name "MySparkApp" \ - spark.range(0, 10): Generates a range of numbers from 0 to 9 as a distributed dataset.
my-spark-app_2.12-1.0.jar - .toDF("id"): Converts the dataset to a DataFrame with a column named id
--master local[*]: Runs Spark locally using all available CPU cores. c. Add a Column for Modulo Operation
my-spark-app_2.12-1.0.jar: Path to your application JAR or script. Use the withColumn method to add a new column mod_2 that evaluates the modulo of the id column by
Run on a Cluster (YARN Example) 2.
spark-submit \ import org.apache.spark.sql.functions._
--master yarn \ val dfWithModulo = df.withColumn("mod_2", col("id") % 2)
--deploy-mode cluster \ dfWithModulo.show(4)
--class com.example.MySparkApp \ - import org.apache.spark.sql.functions._: Imports Spark SQL functions for use.
--name "MySparkApp" \ - withColumn("mod_2", col("id") % 2): Adds a new column mod_2 by calculating the modulo of id with
--driver-memory 4g \ 2.
--executor-memory 8g \ - show(4): Displays the first four rows of the resulting DataFrame.
--executor-cores 4 \ d. OutputFor df.show():
--conf spark.yarn.submit.waitAppCompletion=false \ +---+ | 9|
my-spark-app_2.12-1.0.jar | id| +---+
--master yarn: Specifies YARN as the cluster manager. +---+ For dfWithModulo.show(4):
--deploy-mode cluster: Runs the driver on the cluster. | 0| +---+------+
--class com.example.MySparkApp: Fully qualified main class name (Scala/Java). | 1| | id|mod_2|
--conf: Additional configuration properties for Spark. | 2| +---+------+
Run a Python Script | 3| | 0| 0|
spark-submit \ | 4| | 1| 1|
--master yarn \ | 5| | 2| 0|
--deploy-mode client \ | 6| | 3| 1|
--name "MyPythonSparkApp" \ | 7| +---+------+
my_spark_app.py | 8|
Run on Kubernetes
spark-submit \
--master k8s://https://<K8S_API_SERVER> \
--deploy-mode cluster \
--name "MyK8sApp" \
--class com.example.MySparkApp \
CHAPTER 9: BIG DATA ANALYTICS - Large-scale data may require automated tools like web scraping, APIs, or IoT devices.
I. Gathering Data Timeliness:
1. Process for identifying data - Real-time data collection may need sensors or live dashboards.
a. Step 1: Determining the Information You Want to Collect - Historical data might come from archival research or databases.
The Specific Information You Need: 2. How to gather and import data
- Identify the questions you want to answer or the problems you want to solve. a. Data Gathering
- Define the scope of your data requirements (e.g., demographic data, sales data, website traffic, This step involves identifying and collecting data from various sources.
etc.). Data Sources
- Specify key metrics, variables, or attributes that are necessary for your analysis. Structured Data:
The Possible Sources for This Data: - Databases: Relational databases like MySQL, PostgreSQL.
- Primary Sources: Data collected directly from original sources, such as surveys, interviews, or - Data Warehouses: Snowflake, Amazon Redshift.
experiments. Semi-Structured Data:
- Secondary Sources: Pre-existing data such as industry reports, government publications, or - JSON, XML, or CSV files from APIs, logs, or sensors.
organizational records. - Cloud storage systems (e.g., AWS S3, Azure Blob Storage).
- Internal Sources: Data generated within your organization (e.g., CRM data, employee records). Unstructured Data:
- External Sources: Data obtained from third-party services, public databases, or competitors. - Media files (audio, video, images).
b. Step 2: Define a Plan for Collecting Data - Textual data from social media, emails, or documents.
Establish Objectives: Real-Time Streaming Data:
- Define what success looks like for your data collection process. - IoT device streams.
- Align your data collection goals with broader organizational or project goals. - Event streams from platforms like Apache Kafka, AWS Kinesis.
Outline Key Steps: Web and External Data:
- Specify timelines and deadlines for data collection. - Web scraping or crawling for publicly available data.
- Assign roles and responsibilities to team members. - Third-party APIs for external datasets.
- Decide on the scale and frequency of data collection (e.g., daily, weekly, quarterly). Data Collection Tools
Consider Ethical and Legal Factors: - Batch Collection: Tools like Sqoop (for databases), Flume (for logs).
- Ensure compliance with data protection laws (e.g., GDPR, CCPA). - Real-Time Collection: Kafka, Spark Streaming, Flink.
- Obtain necessary permissions and informed consent if collecting personal data. - Web Data Collection: Scrapy, Beautiful Soup, or APIs.
c. Step 3: Determining Your Data Collection Methods - Cloud Data Collection: Services like AWS Glue, Google Dataflow.
The methods you choose depend on the following factors: b. Data Import
Type of Data: Once gathered, data needs to be ingested into a big data platform like Hadoop, Spark, or a cloud-based
- Quantitative Data: Numerical data requiring methods like surveys, structured observations, or solution.
sensors. Methods for Importing Data
- Qualitative Data: Non-numerical data requiring methods like interviews, focus groups, or open- File-Based Ingestion:
ended surveys. - Upload files (e.g., CSV, Parquet, ORC) directly into storage like HDFS, S3, or GCS.
Accessibility: - Tools: HDFS commands, AWS CLI, Google Cloud SDK.
- Availability of data from the sources you identified in Step 1. Database Import:
- Ease of reaching respondents or accessing systems. - Use tools like Sqoop to move structured data from relational databases to big data platforms.
Accuracy and Reliability Needs: - Example: sqoop import --connect jdbc:mysql://hostname/dbname --table tablename --target-
- The precision required in your data (e.g., high accuracy for financial analysis). dir /user/hadoop/tablename
- Balance between reliability and feasibility of collection. Real-Time Stream Ingestion:
Resources: - Use tools like Kafka, Flink, or Spark Streaming to continuously import data streams.
- Budget constraints, time limitations, and available workforce. - Example:
- Tools or technologies required for data collection. + Produce data to Kafka: kafka-console-producer.sh --broker-list localhost:9092 --topic test-
Volume of Data: topic
- Small-scale datasets may rely on manual collection. + Consume data in Spark Streaming:
val stream =spark.readStream.format("kafka").option("subscribe", "test-topic").load() - Detect missing, inconsistent, or duplicate data.
Cloud Services: - Look for outliers and anomalies.
- AWS Glue for ETL (Extract, Transform, Load). Understand Context:
- Google BigQuery for direct ingestion. - Identify relationships between variables.
- Azure Data Factory for pipeline orchestration. - Check metadata or documentation for definitions.
API Integration: Tools:
- Use APIs to fetch data programmatically and load it into your system. - Python: Pandas, NumPy, Matplotlib.
- Example: - SQL: Exploratory queries.
import requests - Visualization: Tableau, Power BI, or Seaborn for patterns.
response = requests.get("https://api.example.com/data") Example:
with open("data.json", "w") as file: import pandas as pd
file.write(response.text) # Load data
c. Storing and Organizing Data df = pd.read_csv('data.csv')
After importing, data must be stored and organized for efficient processing: # Quick summary
Distributed File Systems: print(df.info())
- HDFS: Hadoop's distributed storage system. print(df.describe())
- S3/GCS: Cloud-based storage solutions. 2. Transformation
Data Partitioning: Modify and reshape the data to address quality issues and prepare it for analysis.
- Partition data by date, region, or other attributes for faster access. Activities:
Data Formats: Handle Missing Data:
- Use efficient formats like Parquet, Avro, or ORC for big data processing. - Impute missing values (mean, median, mode, etc.).
d. Data Processing after Import - Remove rows or columns with excessive missing data.
Once imported, process the data using tools like: Correct Data Types:
- Batch Processing: Apache Spark, Hadoop MapReduce. - Convert data types (e.g., string to date, float to integer).
- Real-Time Processing: Apache Flink, Spark Streaming. Normalize and Scale:
e. Example: End-to-End Data Ingestion - Normalize numeric data (e.g., Min-Max scaling).
Gather Data: Derive New Features:
- Fetch log files from a server. - Create new columns (e.g., aggregations, ratios).
- Stream IoT data using Kafka. Clean Data:
Import Data: - Standardize text (e.g., consistent capitalization).
- Use HDFS commands to load logs: - Remove duplicates and irrelevant data.
hdfs dfs -put logs.txt /data/logs/
- Use Kafka to ingest IoT data into Spark Streaming. Tools:
Store Data: - Python: Pandas, PySpark, OpenRefine.
- Store the imported data in HDFS or S3. - ETL Tools: Apache NiFi, Talend.
Process Data: Example:
- Use Spark for transformations and analytics. # Fill missing values
II. Wrangling Data df['age'] = df['age'].fillna(df['age'].mean())
1. Discovering # Convert column to datetime
The first step is to understand the structure, quality, and content of the data. df['date'] = pd.to_datetime(df['date'])
Activities: # Remove duplicates
Explore Data: df = df.drop_duplicates()
- Inspect data attributes (e.g., column names, data types, missing values). 3. Validation
- Summarize data with descriptive statistics. Ensure the data's integrity, accuracy, and consistency after transformation.
Identify Issues: Activities:
Check Consistency: - Check for logical consistency and data integrity.
- Verify data types, ranges, and formats. Publish:
- Check for logical inconsistencies (e.g., negative ages). - Save cleaned data as a CSV file and upload it to a data warehouse.
Validate Transformations: CHAPTER 10: MACHINE LEARNING WITH BIG DATA
- Compare transformed data with original datasets. I. Machine Learning Overview
Quality Assurance: 1. What is Machine Learning?
- Ensure no data is inadvertently lost or misrepresented. Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on developing algorithms
- Validate against business rules or benchmarks. and statistical models that enable computers to learn from and make predictions or decisions based
Tools: on data without being explicitly programmed. It allows systems to improve their performance on a task
- Python: Data validation libraries like pandas_schema. with experience.
- Big Data: Apache DQ, Great Expectations. Key Components of Machine Learning
- Manual Checks: Use summary statistics and visualizations. - Data: The foundation of ML. It can be structured (like tables) or unstructured (like images, audio).
Example: - Algorithms: Mathematical models that learn patterns from data.
# Validate data range - Training: The process of feeding data into an algorithm to help it learn.
assert df['age'].min() >= 0, "Age cannot be negative" - Model: The result of training; it’s used to make predictions or decisions.
# Check for missing values - Features: The input variables or attributes used for making predictions.
if df.isnull().values.any(): - Labels: The target outcomes the model aims to predict (used in supervised learning).
print("Data still has missing values!") Types of Machine Learning
4. Publishing a. Supervised Learning:
Prepare and export the wrangled data for use in downstream processes, such as modeling, - The model is trained on labeled data (input-output pairs).
visualization, or reporting. - Goal: Predict outcomes for new inputs.
Activities: Examples:
Format Data: - Predicting house prices (regression).
- Save the data in required formats (e.g., CSV, JSON, Parquet). - Classifying emails as spam or not (classification).
Store Data: b. Unsupervised Learning:
- Upload data to databases, cloud storage, or data warehouses. The model works with unlabeled data and identifies patterns or structures.
Document Changes: Goal: Discover hidden relationships.
- Keep a log of the transformations applied for reproducibility. Examples:
Share Data: - Customer segmentation (clustering).
- Provide data to stakeholders or analysis teams. - Anomaly detection.
Tools: c. Semi-Supervised Learning:
- Data Storage: HDFS, AWS S3, Google BigQuery. A mix of labeled and unlabeled data is used.
- Export Formats: CSV, JSON, Parquet. Goal: Leverage a small amount of labeled data to enhance learning.
- Data Sharing: APIs, FTP, or shared repositories. d. Reinforcement Learning:
Example: The model learns by interacting with its environment and receiving rewards or penalties.
# Export to CSV Goal: Maximize long-term rewards.
df.to_csv('cleaned_data.csv', index=False) Examples:
# Export to Parquet - Robotics.
df.to_parquet('cleaned_data.parquet') - Game playing (e.g., AlphaGo).
4. End-to-End Example: Applications of Machine Learning
Discover: Business: Fraud detection, Demand forecasting, Customer churn prediction.
- Inspect raw data, identify missing values, and check for outliers. Healthcare: Disease diagnosis, Personalized medicine, Predicting patient outcomes.
Transform: Technology: Search engine algorithms, Voice assistants (e.g., Siri, Alexa), Recommendation systems
- Handle missing data, clean text fields, and scale numeric values. (e.g., Netflix, Spotify).
Validate: Finance: Stock market prediction, Risk assessment
Others: Self-driving cars, Natural language processing (e.g., chatbots), Image and speech recognition. 2-B: Pre-process
Benefits Clean and transform the data to make it suitable for modeling.
Automation of repetitive tasks. Activities:
Improved accuracy in predictions. Handle Missing Data:
Ability to handle large-scale data. - Impute missing values or remove incomplete records.
Adaptability to new data. Normalize/Scale:
Challenges - Scale numerical features for consistent ranges.
Data Quality: Garbage in, garbage out. Encode Categorical Variables:
Bias: Algorithms can reflect and amplify biases present in data. - Convert categories into numeric representations (e.g., one-hot encoding).
Overfitting: Models may perform well on training data but fail on unseen data. Feature Engineering:
Scalability: Handling vast amounts of data efficiently. - Create new features or transform existing ones.
2. Machine Learning Process Tools:
The machine learning process involves a systematic series of steps to build, evaluate, and deploy a Data Cleaning: Pandas, OpenRefine.
model for solving a specific problem. Here's a detailed breakdown of the process: Feature Engineering: Scikit-learn, PySpark.
Step 1: Acquire Data Example:
Gather raw data from relevant sources to serve as input for the machine learning model. from sklearn.preprocessing import StandardScaler
Activities: # Scale numerical data
Identify data sources (e.g., databases, APIs, web scraping, IoT devices). scaler = StandardScaler()
Collect structured, semi-structured, or unstructured data. data['scaled_age'] = scaler.fit_transform(data[['age']])
Ensure data volume and variety are sufficient for the problem. Step 3: Analyze Data
Tools: Build and train machine learning models using the prepared data.
APIs (e.g., RESTful APIs). Activities:
Databases (e.g., MySQL, PostgreSQL). Split data into training and testing sets.
Cloud Storage (e.g., AWS S3, Google Cloud). Choose an appropriate algorithm (e.g., linear regression, decision trees, neural networks).
Example: Train the model and evaluate its performance.
import pandas as pd Tools:
# Load data from a CSV file Python Libraries: Scikit-learn, TensorFlow, PyTorch.
data = pd.read_csv('dataset.csv') Jupyter Notebook for experiments.
Step 2: Prepare Data Example:
2-A: Explore from sklearn.model_selection import train_test_split
Understand and analyze the data to identify patterns, trends, and potential issues. from sklearn.linear_model import LinearRegression
Activities: # Split data
Visualize data distributions (e.g., histograms, scatter plots). X = data[['feature1', 'feature2']]
Identify missing values, outliers, and inconsistencies. y = data['target']
Summarize data using statistics (mean, median, standard deviation). X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Tools: # Train model
Visualization: Matplotlib, Seaborn, Tableau. model = LinearRegression()
Data Exploration: Pandas, Excel. model.fit(X_train, y_train)
Example: Step 4: Communicate Results
# Check for missing values Share findings and insights with stakeholders.
print(data.isnull().sum()) Activities:
# Visualize distribution of a variable Visualize model performance (e.g., accuracy, confusion matrix, ROC curve).
import matplotlib.pyplot as plt Interpret the results in business terms.
data['age'].hist() Create reports or dashboards for stakeholders.
plt.show() Tools:
Visualization: Matplotlib, Seaborn, Power BI. - Data Exploration: Summarize data with statistics and visualizations. Identify data quality issues
Reporting: Tableau, Excel. (missing values, outliers, inconsistencies).
Example: Outputs:
from sklearn.metrics import mean_squared_error - Initial data report.
# Evaluate model - Description of data quality and potential challenges.
y_pred = model.predict(X_test) Phase 3: Data Preparation
mse = mean_squared_error(y_test, y_pred) Focuses on cleaning and transforming the data into a format suitable for modeling.
print(f"Mean Squared Error: {mse}") Key Activities:
Step 5: Apply Results - Prepare Data for Modeling: Split data into training and test sets. Normalize, scale, and encode
Implement the trained model into production and monitor its performance. variables.
Activities: - Address Quality Issues: Handle missing, duplicate, or inconsistent data.
Deploy the model via APIs or as part of an application. - Select Features: Identify relevant features using statistical methods or domain knowledge.
Continuously monitor performance to ensure accuracy. Outputs: Cleaned and transformed dataset ready for modeling.
Retrain the model as new data becomes available. Phase 4: Modeling
Tools: Select and apply algorithms to build predictive or descriptive models.
Deployment: Flask, FastAPI, AWS SageMaker.
Monitoring: MLflow, Prometheus. Key Activities:
Example: - Determine Type of Problem: Identify whether the task is regression, classification, clustering,
# Save model etc.
import joblib - Select Modeling Technique: Choose suitable algorithms (e.g., decision trees, neural networks).
joblib.dump(model, 'model.pkl') - Build Model: Train models on the prepared data and optimize hyperparameters.
# Load and use the model Outputs:
loaded_model = joblib.load('model.pkl') - Trained models.
new_prediction = loaded_model.predict([[5.1, 3.5]]) - Description of modeling techniques.
II. CRISP-DM Phase 5: Evaluation
Stands for Cross Industry Standard Process for Data Mining. CRISP-DM is a widely-used methodology Assess the model’s performance and ensure it meets business and project objectives.
for data mining and machine learning projects. It provides a structured, six-phase process to ensure Key Activities:
that projects are systematically executed and results are actionable. - Assess Model Performance: Use metrics such as accuracy, precision, recall, F1-score, or RMSE.
Phase 1: Business Understanding - Evaluate Results with Success Criteria: Verify if the model aligns with the defined business
Focuses on understanding the problem or opportunity and aligning project goals with business goals.
objectives. Outputs:
Key Activities: - Model evaluation report.
- Define Problem/Opportunity: Clearly articulate the problem or goal (e.g., "Reduce customer - Decision on whether to proceed to deployment.
churn by 10%"). Phase 6: Deployment
- Assess Situation: Understand the constraints, resources, risks, and requirements. Deliver the final model and integrate it into the business workflow.
Formulate Goals: Translate business objectives into data science tasks (e.g., "Develop a churn Key Activities:
prediction model"). - Produce Final Report: Document the entire process and results for stakeholders.
Outputs: - Deploy Model: Implement the model in production (e.g., via APIs or applications).
- Project charter. - Monitor Model: Track model performance over time and retrain as necessary.
- Success criteria. Outputs:
Phase 2: Data Understanding - Deployed solution.
Involves collecting, exploring, and analyzing data to ensure it’s suitable for the project. - Monitoring framework.
Key Activities: III. KNIME and Spark Mllib
- Data Acquisition: Gather data from various sources (databases, APIs, logs, etc.). KNIME (Konstanz Information Miner) and Spark MLlib are powerful tools used for data analytics and
machine learning. Here’s an overview of both and their comparison:
1. KNIME
Type: Data Analytics and Workflow Automation Platform
Core Features:
Visual Workflow Interface: Uses a drag-and-drop interface to design workflows, making it accessible to
non-programmers.
Broad Support for Data Sources: Connects to databases, files, web services, and big data systems.
Extensibility: Allows integration of Python, R, Java, and various plugins for advanced customization.
Machine Learning & Analytics: Includes pre-built machine learning algorithms and can integrate with
other tools like TensorFlow and H2O.
Deployment: Workflows can be deployed for automation and used in production environments.
Strengths:
Intuitive interface and easy learning curve.
Rich library of nodes for data preprocessing, transformation, and visualization.
Strong integration capabilities. CHAPTER 11: ALGORITHMS
Use Case: Ideal for building end-to-end workflows for data preparation, modeling, and deployment I. Classification
without needing much coding. 1. Goal
2. Spark MLlib The primary goal of a classification algorithm is to categorize data into predefined classes or labels.
Type: Scalable Machine Learning Library for Apache Spark Given an input (features), the algorithm predicts which category (class) the input belongs to. It learns
Core Features: patterns from labeled training data and uses these patterns to classify new, unseen data.
Distributed Computation: Leverages Spark's distributed processing to handle large datasets across a 2. Key Objectives:
cluster. - Accurate Predictions: Assign correct labels to new data points.
Machine Learning Algorithms: Provides algorithms for classification, regression, clustering, - Generalization: Perform well on unseen data, not just the training set.
collaborative filtering, and more. - Efficiency: Achieve accurate classification in a computationally efficient manner.
DataFrames API: Simplifies the integration of ML workflows with Spark SQL and other Spark Classification algorithms are widely used in:
components. - Spam detection (e.g., spam or not spam)
Language Support: Supports Python, Scala, Java, and R. - Sentiment analysis (e.g., positive, negative, neutral)
Scalability: Designed to scale with big data on distributed systems. - Medical diagnosis (e.g., disease present or not)
Strengths: - Image recognition (e.g., identifying objects in images)
Highly efficient for large datasets and distributed computing. 3. Common Classification Algorithms
Built into Spark, making it a natural choice for big data environments. 1. Logistic Regression
Seamless integration with other Spark components like Spark Streaming and Spark SQL. - A linear model for binary classification.
Use Case: Suited for processing and modeling massive datasets in distributed environments. - Predicts the probability of an input belonging to a class using the logistic (sigmoid) function.
3. Comparision - Can be extended to multiclass problems (e.g., softmax regression).
Use KNIME if: 2. Decision Trees
- You prefer a visual, no-code or low-code environment. - Constructs a tree-like structure where each node represents a decision based on input features.
- Your dataset is not extremely large or does not require distributed computing. - Simple and interpretable but prone to overfitting.
- You need a tool for end-to-end data processing, analytics, and machine learning. 3. Random Forest
Use Spark MLlib if: - An ensemble method that combines multiple decision trees.
- Your data is large and distributed across clusters. - Reduces overfitting by averaging predictions from multiple trees.
- You are comfortable with coding in Python, Scala, Java, or R. 4. Support Vector Machines (SVM)
- You are already using Spark for other big data tasks and want seamless integration. - Finds the hyperplane that best separates classes in the feature space.
- Works well for high-dimensional data and non-linear classification using kernels.
5. k-Nearest Neighbors (k-NN)
- A non-parametric method that assigns a class based on the majority class among the k nearest
neighbors.
- Simple but can be computationally expensive for large datasets. Now, given another set of data points (also called testing data), allocate these points to a group by
6. Naive Bayes analyzing the training set. Note that the unclassified points are marked as ‘White’.
- Based on Bayes’ theorem, assuming feature independence. a. Why do we need a KNN algorithm?
- Efficient for text classification tasks like spam filtering or sentiment analysis. (K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used for its
7. Neural Networks simplicity and ease of implementation. It does not require any assumptions about the underlying data
- Mimics the structure of the human brain with interconnected nodes (neurons). distribution. It can also handle both numerical and categorical data, making it a flexible choice for
- Includes specialized models like Convolutional Neural Networks (CNNs) for image various types of datasets in classification and regression tasks. It is a non-parametric method that
classification and Recurrent Neural Networks (RNNs) for sequential data. makes predictions based on the similarity of data points in a given dataset. K-NN is less sensitive to
8. Gradient Boosting Algorithms outliers compared to other algorithms.
- Build models sequentially to correct errors made by previous models. The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance
- Popular implementations: XGBoost (Extreme Gradient Boosting), LightGBM, CatBoost metric, such as Euclidean distance. The class or value of the data point is then determined by the
9. Multi-class Classification Algorithms majority vote or average of the K neighbors. This approach allows the algorithm to adapt to different
- One-vs-All (OvA) or One-vs-One (OvO): Strategies to adapt binary classifiers for multiclass patterns and make predictions based on the local structure of the data.
problems. b. Distance Metrics Used in KNN Algorithm
- Direct algorithms like Decision Trees, Random Forests, and Neural Networks handle multiclass Euclidean Distance
natively. This is nothing but the cartesian distance between the two points which are in the plane/hyperplane.
4. How to Choose a Classification Algorithm? Euclidean distance can also be visualized as the length of the straight line that joins the two points
The choice depends on: which are into consideration. This metric helps us calculate the net displacement done between the
- Dataset size: k-NN and SVM may not scale well for very large datasets. two states of an object.
- Feature types: Naive Bayes is good for categorical data; Neural Networks work well with image
or text data.
- Interpretability: Decision Trees and Logistic Regression are interpretable; Neural Networks are
not.
Manhattan Distance
- Computational resources: Gradient Boosting and Neural Networks may require more
Manhattan Distance metric is generally used when we are interested in the total distance traveled by
computational power.
the object instead of the displacement. This metric is calculated by summing the absolute difference
5. kNN Algorithms
between the coordinates of the points in n-dimensions.
KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs to
the supervised learning domain and finds intense application in pattern recognition, data mining, and
intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make any
Minkowski Distance
underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM,
We can say that the Euclidean, as well as the Manhattan distance, are special cases of the Minkowski
which assume a Gaussian distribution of the given data). We are given some prior data (also called
distance.
training data), which classifies coordinates into groups identified by an attribute.
As an example, consider the following table of data points containing two features:
From the formula above we can say that when p = 2 then it is the same as the formula for the Euclidean
distance and when p = 1 then we obtain the formula for the Manhattan distance.
c. How to choose the value of k for KNN Algorithm?
The value of k is very crucial in the KNN algorithm to define the number of neighbors in the algorithm.
The value of k in the k-nearest neighbors (k-NN) algorithm should be chosen based on the input data. If
the input data has more outliers or noise, a higher value of k would be better. It is recommended to
choose an odd value for k to avoid ties in classification. Cross-validation methods can help in selecting
the best k value for the given dataset.
Algorithm for K-NN
DistanceToNN=sort(distance from 1st example, distance from kth example)
value i=1 to number of training records: point x, the algorithm calculates the distance between x and each data point Xi in X using a distance
Dist=distance(test example, ith example) metric, such as Euclidean distance
if (Dist<any example in DistanceToNN): The algorithm selects the K data points from X that have the shortest distances to x. For classification
Remove the example from DistanceToNN and value. tasks, the algorithm assigns the label y that is most frequent among the K nearest neighbors to x. For
Put new example in DistanceToNN and value in sorted order. regression tasks, the algorithm calculates the average or weighted average of the values y of the K
Return average of value nearest neighbors and assigns it as the predicted value for x.
Fit using K-NN is more reasonable than 1-NN, K-NN affects very less from noise if dataset is large. e. Advantages of the KNN Algorithm
In K-NN algorithm, We can see jump in prediction values due to unit change in input. The reason for this - Easy to implement as the complexity of the algorithm is not that high.
due to change in neighbors. To handles this situation, We can use weighting of neighbors in algorithm. - Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage and
If the distance from neighbor is high, we want less effect from that neighbor. If distance is low, that hence whenever a new example or data point is added then the algorithm adjusts itself as per that new
neighbor should be more effective than others. example and has its contribution to the future predictions as well.
d. Workings of KNN algorithm - Few Hyperparameters – The only parameters which are required in the training of a KNN algorithm are
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it predicts the the value of k and the choice of the distance metric which we would like to choose from our evaluation
label or value of a new data point by considering the labels or values of its K nearest neighbors in the metric.
training dataset. f. Disadvantages of the KNN Algorithm
- Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy
Algorithm. - The main significance of this term is that this takes lots of computing power as well as data
storage. This makes this algorithm both time-consuming and resource exhausting.
- Curse of Dimensionality – There is a term known as the peaking phenomenon according to this the
KNN algorithm is affected by the curse of dimensionality which implies the algorithm faces a hard time
classifying the data points properly when the dimensionality is too high.
- Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone to the
problem of overfitting as well. Hence generally feature selection as well as dimensionality reduction
techniques are applied to deal with this problem.
g. Applications of the KNN Algorithm
- Data Preprocessing – While dealing with any Machine Learning problem we first perform the EDA part
in which if we find that the data contains missing values then there are multiple imputation methods
are available as well. One of such method is KNN Imputer which is quite effective ad generally used for
Step 1: Selecting the optimal value of K
sophisticated imputation methodologies.
K represents the number of nearest neighbors that needs to be considered while making prediction.
- Pattern Recognition – KNN algorithms work very well if you have trained a KNN algorithm using the
Step 2: Calculating distance
MNIST dataset and then performed the evaluation process then you must have come across the fact
To measure the similarity between target and training data points, Euclidean distance is used. Distance
that the accuracy is too high.
is calculated between each of the data points in the dataset and target point.
- Recommendation Engines – The main task which is performed by a KNN algorithm is to assign a new
Step 3: Finding Nearest Neighbors
query point to a pre-existed group that has been created using a huge corpus of datasets. This is exactly
The k data points with the smallest distances to the target point are the nearest neighbors.
what is required in the recommender systems to assign each user to a particular group and then provide
Step 4: Voting for Classification or Taking Average for Regression
them recommendations based on that group’s preferences.
In the classification problem, the class labels of K-nearest neighbors are determined by performing
h. How to Calculate Euclidean Distance in the K-Nearest Neighbors Algorithm
majority voting. The class with the most occurrences among the neighbors becomes the predicted
The table represents our data set. We have two columns — Brightness and Saturation. Each row in the
class for the target data point.
table has a class of either Red or Blue.
In the regression problem, the class label is calculated by taking average of the target values of K
Before we introduce a new data entry, let's assume the value of K is 5.
nearest neighbors. The calculated average value becomes the predicted output for the target data
point.
Let X be the training dataset with n data points, where each data point is represented by a d-dimensional
feature vector Xi and Y be the corresponding labels or values for each data point in X. Given a new data
Distance #2
Here's the new data entry: For the second row, d2:
i. Implementation
# 1. Importing the modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
At this point, you should understand how the calculation works. Attempt to calculate the distance for from sklearn.model_selection import train_test_split
the last four rows. # 2. Creating Dataset
X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5, random_state = 4)
# 3. Visualize the Dataset
plt.style.use('seaborn')
plt.figure(figsize = (10,10))
plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')
plt.show()
# 4. Splitting Data into Training and Testing Datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# 5. KNN Classifier Implementation
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)
# 6. Predictions for the KNN Classifiers
knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)
y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)
# 7. Predict Accuracy for both k values
from sklearn.metrics import accuracy_score
Since we chose 5 as the value of K, we'll only consider the first five rows. That is:
print("Accuracy with k=5", accuracy_score(y_test, y_pred_5)*100)
print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)
# 8. Visualize Predictions
plt.figure(figsize = (15,5))
plt.subplot(1,2,1) We use statistical methods for ordering attributes as root or internal node.
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black') c. How Decision Trees Work?
plt.title("Predicted values with k=5", fontsize=20) The process of creating a decision tree involves:
plt.subplot(1,2,2) - Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or information gain, the
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black') best attribute to split the data is selected.
plt.title("Predicted values with k=1", fontsize=20) - Splitting the Dataset: The dataset is split into subsets based on the selected attribute.
plt.show() - Repeating the Process: The process is repeated recursively for each subset, creating a new
6. Decision Tree internal node or leaf node until a stopping criterion is met (e.g., all instances in a node belong to
a. What is a Decision Tree? the same class or a predefined depth is reached).
A Decision tree is a tree-like structure that represents a set of decisions and their possible d. Metrics for Splitting
consequences. Each node in the tree represents a decision, and each branch represents an outcome Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it was randomly
of that decision. The leaves of the tree represent the final decisions or predictions. classified according to the distribution of classes in the dataset.
Decision trees are created by recursively partitioning the data into smaller and smaller subsets. At each
partition, the data is split based on a specific feature, and the split is made in a way that maximizes the
information gain.
where pi is the probability of an instance being classified into a particular class.
Entropy: Measures the amount of uncertainty or impurity in the dataset.
Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of
dataset)
X = (Rainy, Hot, High, False)
y = No
So basically, P(y∣X) here means, the probability of “Not playing golf” given that the weather conditions
are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.
With relation to our dataset, this concept can be understood as:
- We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has
nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence,
the features are assumed to be independent.
- Secondly, each feature is given the same weight(or importance). For example, knowing only
temperature and humidity alone can’t predict the outcome accurately. None of the attributes is
irrelevant and assumed to be contributing equally to the outcome.
Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among the
features. So now, we split evidence into the independent parts.
Now, if any two events A and B are independent, then, P(A,B) = P(A)P(B)
c. Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another event that
has already occurred. Bayes’ theorem is stated mathematically as the following equation:
So, in the figure above, we have calculated P(xi ∣yj) for each xi in X and yj in y manually in the tables 1-4.
So, finally, we are left with the task of calculating P(y)and P(xi∣y). For example, probability of playing golf given that the temperature is cool, i.e P(temp. = cool | play golf
Please note that P(y) is also called class probability and P(xi∣y) is called conditional probability. = Yes) = 3/9.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the Also, we need to find class probabilities P(y)P(y) which has been calculated in the table 5. For example,
distribution of P(xi∣y). P(play golf = Yes) = 9/14.
Let us try to apply the above formula manually on our weather dataset. For this, we need to do some So now, we are done with our pre-computations and the classifier is ready!
precomputations on our dataset. Let us test it on a new set of features (let us call it today):
We need to find P(xi∣yj) for each xi in X and yj in y. All these calculations have been demonstrated in the
tables below:
Relationship Analysis:
- Identify correlations between numerical variables.
- Explore associations and dependencies among categorical features.
Distribution Understanding:
- Visualize data distributions to understand skewness, kurtosis, and modality.
- Assess the presence of anomalies or non-normal patterns.
Prepare for Modeling:
- Identify features with predictive power.
- Flag irrelevant or redundant features.
Hypothesis Formation:
- Generate hypotheses about potential trends or patterns that may be tested in further analysis.
3. Categories of Techniques for Exploring Data
a. Descriptive Statistics
Summary Statistics: Mean, median, mode, variance, standard deviation, and range.
Frequency Tables: Show counts and proportions for categorical variables.
Percentiles: Identify spread and concentration of data.
b. Data Visualization
Univariate Visualization:
- Histogram: Visualize the distribution of a single numerical variable.
- Boxplot: Identify outliers and spread.
- Bar Chart: Examine frequencies of categorical data.
Bivariate Visualization:
- Scatterplot: Study relationships between two numerical variables.
II. Data Exploration - Heatmap: Display correlations or relationships in a matrix format.
1. Why is Data Exploration Necessary? - Stacked Bar Chart: Analyze categorical relationships.
Data exploration is a critical first step in any data analysis or machine learning project. It involves Multivariate Visualization:
examining the dataset to understand its structure, identify patterns, and uncover potential issues. - Pair Plots: Examine relationships between multiple numerical variables.
Proper exploration ensures the foundation for accurate, meaningful analysis and modeling. - 3D Plots: Explore high-dimensional data.
Reasons for Data Exploration: - Parallel Coordinates: Analyze relationships in high-dimensional categorical and numerical data.
- Understand the Data: Gain insights into the data's nature, including its size, structure, types of c. Data Profiling
variables, and distributions. Assess completeness, uniqueness, and conformity of data.
- Identify Data Quality Issues: Detect missing values, duplicates, outliers, and inconsistencies. Examine distributions and patterns across datasets.
- Guide Feature Selection: Understand relationships between variables to select or engineer Identify and handle outliers.
features for modeling. d. Correlation Analysis
- Refine Objectives: Align the dataset's characteristics with the problem's objectives. Pearson Correlation: Measures linear relationships between numerical variables.
- Prevent Errors: Address issues early to avoid misleading results in later stages. Spearman Rank Correlation: Captures monotonic relationships.
2. Objectives of Data Exploration Chi-Square Test: Explores relationships between categorical variables.
The main objectives of data exploration are: e. Dimensionality Reduction Techniques
Data Quality Assessment: Principal Component Analysis (PCA): Understand variance in high-dimensional datasets.
- Check for missing values, outliers, and errors. t-SNE or UMAP: Visualize complex patterns in reduced dimensions.
- Assess data types and formats for compatibility. f. Automated Tools
Descriptive Analysis: EDA Libraries:
- Summarize key statistics (mean, median, variance) for numerical features. - Python: Pandas Profiling, Sweetviz, Autoviz.
- Examine frequency distributions for categorical features. - R: DataExplorer, skimr.
These tools provide comprehensive automated reports summarizing key data properties.
III. Data Terminology
1. What is a Feature and How it Relates to a Sample?
A feature is an individual measurable property, variable, or attribute of the data used to describe an
observation (or sample).
A sample (also known as a data point, instance, or observation) is a single row in a dataset, representing
one entity or event being analyzed.
Relationship: A sample consists of multiple features. For example, in a dataset of houses, a sample
could represent one house, and its features might include the house's size, number of bedrooms,
location, and price.
2. Alternative Terms for ‘Feature’
Some commonly used alternative terms include:
- Variable IV. Exploring Data through Plots
- Attribute 1. How plots can be useful in exploring data
- Dimension Plots are powerful tools for visually exploring data. They make it easier to understand patterns,
- Predictor (in the context of machine learning) relationships, and distributions within the dataset. By visualizing data, analysts and data scientists can:
- Input (in machine learning models) - Identify Patterns and Trends: Detect clusters, trends, or deviations over time or across variables.
- Field (in databases) - Spot Anomalies and Outliers: Visualize extreme or unexpected values that could affect analysis.
- Covariate (in statistical analysis) - Understand Distributions: Explore how data is spread or concentrated.
3. Categorical Feature vs. Numerical Feature - Examine Relationships: Study correlations and dependencies between variables.
Categorical Feature: - Guide Feature Engineering: Gain insights that help in creating, selecting, or transforming
Represents discrete categories or groups. features.
Values are non-numeric and describe qualities or labels. 2. Using a Scatter Plot
Examples: A scatter plot is a graph where individual data points are plotted on two axes, representing the
Colors (e.g., red, blue, green) relationship between two numerical variables.
Gender (e.g., male, female) Use Cases:
Product types (e.g., electronics, clothing) Analyze Relationships:
Subtypes: - Scatter plots are ideal for studying the relationship or correlation between two numerical
Nominal: Categories without a natural order (e.g., fruit types: apple, orange, banana). variables. For instance:
Ordinal: Categories with a meaningful order but without numerical differences (e.g., education levels: - Positive correlation: As one variable increases, the other increases.
high school < bachelor's < master's). - Negative correlation: As one variable increases, the other decreases.
Usage: - No correlation: Variables show no discernible pattern.
Often one-hot encoded or label encoded to be used in machine learning models. Detect Patterns:
Numerical Feature: - Observe clusters, gaps, or trends in the data.
Represents continuous or discrete numeric values. Spot Outliers:
Values have quantitative meaning and can be measured or counted. - Identify data points that fall far from the expected relationship.
Examples: Highlight Groups:
Age (e.g., 25, 30) - Add color, size, or shape to distinguish categories or subgroups in the data.
Salary (e.g., $40,000, $50,000) Example:
Temperature (e.g., 25.5°C, 30°C) Plotting house size (x-axis) against price (y-axis) can show whether larger houses tend to have higher
Subtypes: prices.
Continuous: Can take any value within a range (e.g., height, weight). Enhancements:
Discrete: Integer values or counts (e.g., number of children, years of experience). Add a regression line to assess the strength and direction of the relationship.
Use color coding or grouping to add a third variable (e.g., house type).
3. What a Boxplot Shows
A boxplot (or box-and-whisker plot) is a graphical summary of the distribution of a numerical variable. - Measures the spread or dispersion of data around the mean.
It provides insights into data spread and highlights potential outliers. - A low standard deviation indicates that data points are close to the mean, while a high value
Components of a Boxplot: indicates greater variability.
- Median (Line Inside the Box): Represents the middle value of the dataset. - Example: A dataset [10, 20, 30] has a smaller standard deviation compared to [10, 50, 100].
- Box: The top and bottom edges of the box represent the first quartile (Q1) and third quartile (Q3), 3. How Summary Statistics are Useful in Exploring Data
encompassing the interquartile range (IQR) (middle 50% of the data). Summary statistics are invaluable during data exploration for several reasons:
- Whiskers: Extend to the smallest and largest values within 1.5 × IQR from Q1 and Q3. Data Overview: Quickly understand the dataset's central tendency (mean, median), variability
- Outliers: Data points beyond the whiskers are plotted individually as dots or symbols. (standard deviation, variance), and range (min/max).
Use Cases: Detecting Issues:
- Summarize Distribution: Show central tendency, spread, and variability in the data. - Identify anomalies such as outliers or extreme skewness in the data distribution.
- Identify Outliers: Highlight extreme values that may need attention. - For instance, a large difference between the mean and median may indicate skewed data.
- Compare Groups: Compare distributions across multiple categories or groups by displaying Comparing Groups:
side-by-side boxplots. - Compare characteristics across different categories or subgroups in the data.
Example: - Example: Compare average income levels between different age groups.
Using a boxplot to compare test scores across different schools can reveal which schools have higher Guiding Feature Engineering:
medians, wider variability, or more outliers. - Inform decisions about scaling, normalization, or transformations.
- Example: A high standard deviation might suggest the need for normalization before modeling.
Efficient Communication: Summarize complex datasets into easily interpretable metrics for
stakeholders.
By combining summary statistics with visualization techniques, analysts can gain a well-rounded
understanding of the data and make informed decisions for further analysis or modeling.
VI. Addressing Data Quality Issues
1. What is Imputation?
Imputation refers to the process of replacing missing or incomplete data in a dataset with substituted
values to maintain the dataset's usability and integrity for analysis or modeling. The goal of imputation
is to minimize the bias introduced by missing data while preserving the dataset's statistical properties.
2. Three Ways to Handle Missing Values
a. Delete Rows or Columns:
V. Exploring Data through Summary Statistics Method: Remove samples (rows) or features (columns) with missing values.
1. What is a Summary Statistic? When to Use:
A summary statistic is a single value that provides a concise numerical representation of certain - The missing data is sparse and not critical to the analysis.
aspects of a dataset. It simplifies large datasets into meaningful metrics, allowing analysts to quickly - Deleting rows or columns will not significantly reduce the dataset's size.
understand key properties of the data. Drawback: Risk of losing valuable information, especially in small datasets.
2. Three Common Summary Statistics Example: If 2% of rows in a dataset are missing values for a specific feature, these rows can be dropped.
Mean (Average): b. Impute with Statistical Methods:
- The arithmetic average of a dataset. Method: Replace missing values with calculated metrics, such as:
- Calculated as the sum of all data values divided by the number of values. - Mean/Median/Mode: Use the average (mean), middle value (median), or most frequent value
- Example: If the dataset is [10, 20, 30], the mean is (10 + 20 + 30) / 3 = 20. (mode).
Median: - Forward/Backward Fill: Fill values using adjacent data points (common in time-series data).
- The middle value when the data is sorted in ascending order. When to Use: The missing values are random and not correlated with other features.
- If the dataset has an even number of values, the median is the average of the two middle Drawback: May introduce bias if the data distribution is not uniform or skewed.
numbers. Example: For missing house prices, use the median house price to fill gaps.
- Example: For [10, 20, 30], the median is 20. For [10, 20, 30, 40], the median is (20 + 30) / 2 = 25. c. Model-Based Imputation:
Standard Deviation: Method: Use machine learning models to predict missing values based on other features.
- Algorithms like K-Nearest Neighbors (KNN), regression, or deep learning can be used. Transform Data for Analysis: Rescale, normalize, or encode features to make them suitable for machine
When to Use: The missing data shows patterns or correlations with other features. learning algorithms.
Drawback: Computationally expensive and may overfit the imputed values. Enhance Data Usability: Format and organize data for easier analysis and visualization.
Example: Predict a missing salary value using a regression model based on age, education, and Optimize Performance: Reduce noise, irrelevant features, or dimensionality to improve model
experience. performance.
3. Role of Domain Knowledge in Addressing Data Quality Issues Align Data with Objectives: Structure the data to fit the specific goals of the analysis or machine
Domain knowledge plays a crucial role in effectively handling data quality issues, including missing learning task.
values: 3. Activities in Preparing Data
Understanding Context: Data preparation involves several tasks, including:
- Helps identify the likely causes of missing values (e.g., data entry errors, non-response in Data Cleaning:
surveys). - Identify and handle missing data (imputation, removal, or replacement).
- Enables selecting imputation strategies that align with the real-world scenario. - Remove duplicates and irrelevant entries.
Guiding Imputation Choices: - Correct errors in data entry, such as typos or invalid values.
- Ensures imputed values are realistic and meaningful within the domain. Data Transformation:
- For example, imputing a patient's blood pressure requires medically plausible values. - Rescaling: Standardize or normalize numerical features.
Feature Relevance Assessment: - Encoding: Convert categorical variables into numerical formats (e.g., one-hot encoding, label
Domain experts can help determine whether a missing feature is critical or can be discarded without encoding).
affecting analysis outcomes. - Feature Engineering: Create new features or transform existing ones to enhance model
Avoiding Misinterpretation: performance.
- Domain knowledge helps avoid introducing bias or misrepresenting data relationships. - Dimensionality Reduction: Reduce the number of features using techniques like PCA.
- Example: Replacing missing categorical values with the mode may not make sense if the mode Data Integration: Combine data from multiple sources into a cohesive dataset (e.g., joining or merging
is highly context-dependent. datasets).
Evaluating Data Utility: Data Reduction:
Experts can assess whether certain data points should be corrected, imputed, or excluded based on - Remove irrelevant or redundant features.
their impact on analysis. - Sample data to reduce the dataset size without losing critical information.
By combining statistical techniques with domain-specific insights, data quality issues can be Data Validation:
addressed more effectively, leading to robust and accurate analysis. - Verify data quality and consistency after cleaning and transformation.
VII. Data Preparation Overview - Perform exploratory analysis to confirm assumptions.
1. Importance of Data Preparation Data Formatting: Reorganize data into a structure compatible with the intended analysis or machine
Data preparation is a crucial step in the data analysis and machine learning pipeline. It involves learning pipeline (e.g., tabular format for structured data).
cleaning, transforming, and organizing raw data into a format suitable for analysis or modeling. Proper VIII. Dimensionality Reduction
data preparation ensures the quality, accuracy, and usability of the dataset, which significantly impacts 1. What is Dimensionality Reduction?
the reliability of insights and model performance. Dimensionality reduction is the process of reducing the number of features (or dimensions) in a dataset
- Removes inconsistencies, errors, and redundancies in the dataset. while retaining as much meaningful information as possible. This is achieved by either selecting a
- Prepared data reduces noise and improves feature relevance, leading to better model subset of the most relevant features (feature selection) or transforming the data into a lower-
predictions. dimensional space (feature extraction).
- Resolving issues early prevents complications during analysis or model training. 2. Benefits of Dimensionality Reduction
- Clean and well-documented datasets make results easier to replicate. Improves Model Performance: Reducing irrelevant or redundant features helps prevent overfitting and
- High-quality data ensures insights are accurate and actionable. improves model generalization.
2. Objectives of Data Preparation Reduces Computational Complexity: Smaller datasets require less storage, memory, and processing
The primary objectives of data preparation are to: time, enabling faster analysis and model training.
Ensure Data Quality: Enhances Visualization: Reducing high-dimensional data to 2D or 3D allows for easier visualization of
- Address missing values, outliers, and inconsistencies. patterns, clusters, or outliers.
- Validate the completeness and accuracy of the dataset. Minimizes Noise: By focusing on the most important features, dimensionality reduction filters out
noise, improving data quality.
Addresses the Curse of Dimensionality: High-dimensional data can lead to sparsity, making it Improving Model Performance: Reducing the feature set minimizes noise, resulting in better predictions
challenging for algorithms to perform well. Dimensionality reduction mitigates this issue. and reduced overfitting.
3. How PCA Transforms Your Data Reducing Complexity: Simplifies the model, reducing computational cost and training time.
Principal Component Analysis (PCA) is one of the most widely used dimensionality reduction Enhancing Interpretability: Smaller feature sets are easier to analyze, interpret, and communicate.
techniques. It transforms data into a lower-dimensional space by finding new, uncorrelated variables Avoiding the Curse of Dimensionality: Addresses issues related to high-dimensional datasets, such as
(called principal components) that capture the maximum variance in the data. sparsity and degraded algorithm performance.
Steps in PCA: 3. Three Approaches for Selecting Features
Standardize the Data: Filter Methods:
- Normalize the dataset so that all features have the same scale (mean = 0, variance = 1). Use statistical techniques to evaluate the relevance of each feature independently of the model.
- This ensures that features with larger ranges don’t dominate the results. Examples:
Compute the Covariance Matrix: Measure how the features in the dataset are correlated with each - Correlation Analysis: Measures the correlation between features and the target variable.
other. - Chi-Square Test: Evaluates the association between categorical features and the target variable.
Determine Eigenvectors and Eigenvalues: - Mutual Information: Measures the dependency between features and the target variable.
- Eigenvectors represent the directions (principal components) of the new feature space. Pros:
- Eigenvalues indicate the amount of variance captured by each principal component. - Computationally efficient and easy to implement.
Select Principal Components: Rank components based on their eigenvalues and select the top ones - Model-agnostic.
that capture the most variance (e.g., 95% of total variance). Cons:
Project Data: Transform the original data onto the selected principal components to create the - Ignores feature interactions.
reduced-dimensional dataset. Wrapper Methods:
Key Characteristics of PCA: Iteratively select or eliminate features by training and evaluating the model's performance.
Linear Transformation: PCA assumes a linear relationship between features. Examples:
Orthogonal Components: Principal components are uncorrelated and orthogonal to each other. - Forward Selection: Starts with no features, adds features incrementally based on performance
Variance Maximization: The first component captures the highest variance, the second captures the improvement.
next highest variance orthogonal to the first, and so on. - Backward Elimination: Starts with all features, removes the least significant features iteratively.
Example: - Recursive Feature Elimination (RFE): Trains the model, ranks features by importance, and
If you have a dataset with 10 features, and PCA finds that 95% of the variance is explained by the first 3 removes the least important features.
principal components, you can reduce the dataset to these 3 components while retaining most of the Pros:
information. - Considers feature interactions.
Benefits of PCA: - Provides optimized feature subsets.
Reduces dimensionality while preserving critical information. Cons:
Removes multicollinearity by creating uncorrelated components. - Computationally expensive, especially for large datasets.
Simplifies data structure for analysis or visualization. Embedded Methods:
Limitations: Perform feature selection during the model training process, leveraging the algorithm's built-in
PCA assumes linearity and may not perform well with nonlinear relationships. capabilities.
Interpretability of transformed components can be challenging since they are linear combinations of Examples:
the original features. - Lasso Regression (L1 Regularization): Penalizes less important features by shrinking their
XI. Feature Selection coefficients to zero.
1. What is Feature Selection? - Decision Trees and Random Forests: Measure feature importance through splits or impurity
Feature selection involves identifying and retaining the most relevant and important features from a reduction.
dataset for a specific analysis or machine learning task. It eliminates irrelevant, redundant, or noisy Pros:
features, ensuring that only the most informative variables are used in modeling or analysis. - Efficient as it combines feature selection and model training.
2. Goal of Feature Selection - Accounts for feature interactions.
The primary goal of feature selection is to enhance the efficiency and effectiveness of a machine Cons:
learning model by: - Limited to the specific algorithm used.
X. Feature Transformation
1. Purpose of Feature Transformation - Lift: The strength of an association compared to random chance.
Feature transformation is a crucial step in preparing data for machine learning and big data analytics. 2. Applications of Association Analysis
It involves modifying, creating, or enhancing features in a dataset to improve the performance and Market Basket Analysis:
accuracy of predictive models. The key purposes of feature transformation include: - Identifying products often purchased together to optimize store layout, promotions, or cross-
- Improving Model Accuracy: Transformed features can make patterns in the data more selling strategies.
discernible to the algorithm. - Example: Customers who buy bread are likely to buy butter.
- Ensuring Algorithm Compatibility: Some algorithms (e.g., gradient descent) require data in Recommendation Systems:
specific formats or scales. - Suggesting items to users based on their past purchases or preferences.
- Handling Skewed Data: Transformations can reduce skewness, ensuring better model - Example: "Customers who bought X also bought Y."
generalization. Fraud Detection: Spotting unusual patterns or combinations of transactions indicative of fraudulent
- Reducing Dimensionality: Helps eliminate redundant or irrelevant information, streamlining the behavior.
modeling process. Healthcare and Bioinformatics: Discovering relationships between symptoms, diseases, and
2. Three Feature Transformation Operations treatments.
Scaling: Web Usage Mining: Analyzing user clickstream data to understand browsing behavior and improve
- Adjusts the range of numerical features to ensure uniformity (e.g., normalizing values between website design or navigation.
0 and 1 or standardizing to have a mean of 0 and standard deviation of 1). 3. What is an Item Set?
- Methods: Min-Max Scaling, Standard Scaling, Robust Scaling. An item set is a collection of one or more items that appear together in a dataset, such as a
Log Transformation: transactional database. Item sets are classified into:
- Reduces the impact of outliers and handles skewed distributions by applying the logarithmic - Single Item Set: Contains a single item (e.g., {milk}).
function to numerical features. - K-Item Set: Contains 𝑘 items (e.g., {milk, bread} is a 2-item set).
- Example: x→log(x+1). In the context of association analysis, frequent item sets are those that occur in a significant number
Encoding Categorical Features: of transactions, based on a predefined support threshold. Identifying frequent item sets is a key step in
- Converts categorical data into numerical form so that algorithms can process it. deriving association rules.
- Methods: One-Hot Encoding, Label Encoding, Frequency Encoding. For example:
3. When is Scaling Important? - In a retail store's transaction dataset:
Scaling is particularly important in the following scenarios: - {milk, bread, butter} is an item set.
Algorithms Sensitive to Feature Magnitude: Methods like Support Vector Machines (SVM), K-Nearest - If this item set appears in 60% of transactions, its support is 60%.
Neighbors (KNN), Principal Component Analysis (PCA), and Gradient Descent-based models require XII. Association Analysis in Detail
features to be on a similar scale to prevent one feature from dominating others. 1. Key Definitions
Distance-Based Algorithms: For algorithms such as K-Means Clustering and KNN, the distance Support:
between data points is crucial. Unscaled data can lead to biased results. Support measures the frequency of an item or item set in the dataset.
Features with Different Units: When datasets contain features measured in different units (e.g., weight It is the proportion of transactions in which the item set appears:
in kilograms and height in meters), scaling ensures uniform contribution to the model.
XI. Association Analysis
1. What is Association Analysis?
For example, if an item set {milk, bread} appears in 50 out of 200 transactions, the support is:
Association analysis is a data mining technique used to uncover interesting relationships, patterns, or
Support({milk, bread}) = 50/200 = 0.25 = 25%.
associations among items in large datasets. It identifies frequently co-occurring items or events and
Confidence:
derives rules that describe these relationships in a meaningful way. It is widely used in transactional
Confidence indicates the likelihood that a transaction containing one item set will also contain another
databases, such as retail sales or e-commerce.
item set.
The primary objective of association analysis is to discover association rules, which are statements of
It is calculated as:
the form:
If A occurs, then B is likely to occur (e.g., A → B)
These rules are evaluated using metrics such as:
- Support: The proportion of transactions containing a specific item or item set.
For example, if {milk} appears in 100 transactions and {milk, bread} appears in 50 transactions:
- Confidence: The likelihood that a transaction containing one item also contains another.
Confidence({milk → bread}) = 50/100 = 0.5 = 50%. CHAPTER 12: UNSUPERVISED LEARNING
2. Steps in Association Analysis I. Introduction to Unsupervised Learning
Data Preparation: Unsupervised learning, a fundamental type of machine learning, continues to evolve. This approach,
- Collect and preprocess the dataset (e.g., transactional data). which focuses on input vectors without corresponding target values, has seen remarkable
- Represent the data in a suitable format, such as a binary transaction matrix where rows developments in its ability to group and interpret information based on similarities, patterns, and
represent transactions and columns represent items. differences. The latest advancements in deep unsupervised learning models have enhanced this
Frequent Item Set Generation: capability, enabling more nuanced understanding of complex datasets.
- Identify frequent item sets that meet a predefined support threshold. In the table below, we’ve compared some of the key differences between unsupervised and supervised
- Algorithms like Apriori or FP-Growth are commonly used to efficiently find these item sets. learning:
Association Rule Mining:
- Generate association rules from frequent item sets.
- Each rule must meet a minimum confidence threshold to be considered meaningful.
Evaluation: Assess the quality and relevance of the rules using metrics such as:
- Support: Ensures the rule applies to a significant portion of the data.
- Confidence: Ensures reliability of the rule.
- Lift: Evaluates the strength of the rule relative to random chance.
Deployment: Use the rules to derive actionable insights, such as improving marketing strategies or
product recommendations.
3. Formation of Association Rules from Item Sets
Frequent Item Sets:
Start with an item set that meets the minimum support threshold.
For example, if {milk, bread, butter} is a frequent item set, it qualifies for rule generation.
Generating Rules:
For each frequent item set, create possible rules by dividing the item set into antecedent (AAA) and
consequent (BBB).
Example: From {milk, bread, butter}, possible rules include:
milk, bread → butter
milk → bread, butter
bread → milk, butter
Evaluating Rules:
Calculate confidence for each rule and retain only those with confidence above the predefined
threshold.
For instance, if milk, bread → butter has a confidence of 80%, it may be considered a strong rule. 1. Types of Unsupervised Learning
Lift Analysis: In the introduction, we mentioned that unsupervised learning is a method we use to group data when
Optionally, compute the lift to determine the strength of the rule compared to random chance: no labels are present. Since no labels are present, unsupervised learning methods are typically applied
to build a concise representation of the data so we can derive imaginative content from it.
For example, if we were releasing a new product, we can use unsupervised learning methods to identify
who the target market for the new product will be: this is because there is no historical information
A lift > 1 indicates a strong positive association. about who the target customer is and their demographics.
By systematically following these steps, association analysis uncovers meaningful patterns and But unsupervised learning can be broken down into three main tasks:
actionable rules from large datasets. - Clustering
- Association rules
- Dimensionality reduction.
Let’s delve deeper into each one:
Clustering Nonetheless, there are several valuable unsupervised learning use cases at the enterprise level.
From a theoretical standpoint, instances within the same group tend to have similar properties. You can Beyond using unsupervised techniques to explore data, some common use cases in the real-world
observe this phenomenon in the periodic table. Members of the same group, separated by eighteen include:
columns, have the same number of electrons in the outermost shells of their atoms and form bonds of - Natural language processing (NLP). Google News is known to leverage unsupervised learning to
the same type. categorize articles based on the same story from various news outlets. For instance, the results
This is the idea that’s at play in clustering algorithms; Clustering methods involve grouping untagged of the football transfer window can all be categorized under football.
data based on their similarities and differences. When two instances appear in different groups, we can - Image and video analysis. Visual Perception tasks such as object recognition leverage
infer they have dissimilar properties. unsupervised learning.
Clustering is a popular type of unsupervised learning approach. You can even break it down further into - Anomaly detection. Unsupervised learning is used to identify data points, events, and/or
different types of clustering; for example: observations that deviate from a dataset's normal behavior.
- Exlcusive clustering: Data is grouped such that a single data point exclusively belongs to one - Customer segmentation. Interesting buyer persona profiles can be created using unsupervised
cluster. learning. This helps businesses to understand their customers' common traits and purchasing
- Overlapping clustering: A soft cluster in which a single data point may belong to multiple habits, thus, enabling them to align their products more accordingly.
clusters with varying degrees of membership. - Recommendation Engines. Past purchase behavior coupled with unsupervised learning can be
- Hierarchical clustering: A type of clustering in which groups are created such that similar used to help businesses discover data trends that they could use to develop effective cross-
instances are within the same group and different objects are in other groups. selling strategies.
- Probalistic clustering: Clusters are created using probability distribution. 3. Unsupervised Learning Example in Python
Association Rule Mining Principal component analysis (PCA) is the process of computing the principal components then using
This type of unsupervised machine learning takes a rule-based approach to discovering interesting them to perform a change of basis on the data. In other words, PCA is an unsupervised learning
relationships between features in a given dataset. It works by using a measure of interest to identify dimensionality reduction technique.
strong rules found within a dataset. It’s useful to reduce the dimensionality of a dataset for two main reasons:
We typically see association rule mining used for market basket analysis: this is a data mining - When there are too many dimensions in a dataset to visualize
technique retailers use to gain a better understanding of customer purchasing patterns based on the - To identify the most predictive n dimensions for feature selection when building a predictive
relationships between various products. model.
The most widely used algorithm for association rule learning is the Apriori algorithm. However, other In this section, we will implement the PCA algorithm in Python on the Iris dataset and then visualize it
algorithms are used for this type of unsupervised learning, such as the Eclat and FP-growth algorithms. using matplotlib.
Dimensionality Reduction Let’s start by importing the necessary libraries and the data.
Popular algorithms used for dimensionality reduction include principal component analysis (PCA) and
Singular Value Decomposition (SVD). These algorithms seek to transform data from high-dimensional
spaces to low-dimensional spaces without compromising meaningful properties in the original data.
These techniques are typically deployed during exploratory data analysis (EDA) or data processing to
prepare the data for modeling.
It’s helpful to reduce the dimensionality of a dataset during EDA to help visualize data: this is because
visualizing data in more than three dimensions is difficult. From a data processing perspective,
reducing the dimensionality of the data simplifies the modeling problem.
When more input features are being fed into the model, the model must learn a more complex
approximation function. This phenomenon can be summed up by a saying called the “curse of
dimensionality.”
2. Unsupervised Learning Applications The iris dataset has four features. Attempting to visualize data in four dimensions or more is impossible
Most executives would have no problem identifying use cases for supervised machine learning tasks; because we have no clue of how things in such a high dimension would look like. The next best thing we
the same cannot be said for unsupervised learning. could do is to depict it in three dimensions, which is not impossible but still challenging.
One reason this may be is down to the simple nature of risk. Unsupervised learning introduces much
more risk than unsupervised learning since there’s no clear way to measure results against ground truth
in an offline manner, and it may be too risky to conduct an online evaluation.
With PCA, we can reduce the dimensions of the data down to two, which would then make it easier to
visualize our data and tell apart the classes.
For example:
In the code above, we transform the iris dataset features, only keeping two components, and then plot
the reduced data in a two-dimensional plane.
Now, it’s much easier for us to gather information about the data and how the classes are separated.
We can use this insight to decide on the next steps to take if we were to fit a machine learning model
onto our data.
4. Linear Regression
How Linear Regression Works
Linear regression is a statistical method used to model the relationship between one or more
independent variables (predictors) and a dependent variable (response). The goal is to fit a linear
equation to the data that best predicts the dependent variable.
QUIZZES
Chapter 8.
1) What are the two main processes associated with an Apache Spark application? Describe them
in details.
Apache Spark applications rely on two main processes: Driver and Executor. These processes work
together to perform distributed data processing.
Driver Process:
Role: Acts as the master node of a Spark application.
Responsibilities:
- Job Coordination: The driver program defines the main logic of the application and coordinates
the execution of tasks across the cluster.
- Task Scheduling: It breaks the work into smaller tasks and schedules them to run on Executors.
- Communication: Sends tasks to Executors and collects results from them.
- Maintains State: Tracks the state of all tasks running in the Executors.
Components:
- SparkContext: This is the entry point for a Spark application. It connects the application to the
Spark cluster and is responsible for resource allocation.
- DAG Scheduler: Converts high-level operations into a directed acyclic graph (DAG) of stages.
- Task Scheduler: Breaks down stages into tasks and assigns them to Executors.
Lifespan: The Driver process runs for the duration of the Spark application.
Executor Processes:
Role: Perform the actual computation and store data for an application.
Responsibilities:
- Task Execution: Executes tasks assigned by the Driver. Chapter 9.
- Data Storage: Manages in-memory storage of RDDs (Resilient Distributed Datasets) or Research several tools for data wrangling:
DataFrames required for computation. • OpenRefine
- Communication: Sends task status and results back to the Driver. • Google DataPrep
Components: • Watson Studio Refinery
- Task Runner: Executes individual tasks as directed by the Driver. • Trifacta Wrangler
- Block Manager: Manages data storage and memory, including intermediate data between tasks. Data wrangling involves cleaning and transforming raw data into a structured format suitable for
Lifespan: Executors live for the lifetime of the Spark application unless dynamic resource allocation is analysis. Several tools facilitate this process, each offering unique features and capabilities. Here's an
enabled, allowing them to scale up or down. overview of four notable data wrangling tools:
2) Explain the Apache Spark Architecture OpenRefine
Apache Spark follows a Master-Slave Architecture that distributes data and computation across a An open-source tool that allows users to import data in various formats (e.g., CSV, JSON, XML) and
cluster. perform data cleaning and transformation through a user-friendly, menu-based interface. It's
Key Components: particularly useful for detecting and correcting inconsistencies in medium-sized datasets.
Cluster Manager: Google DataPrep
- Responsible for resource allocation and managing the cluster nodes. A cloud-based service integrated with Google Cloud Platform, designed for preparing structured and
- Examples: YARN, Mesos, or Spark's standalone cluster manager. unstructured data. It offers automated schema detection, anomaly identification, and interactive
Driver: transformation suggestions, making it accessible for non-technical users handling large datasets.
- Central coordinator of the application. Watson Studio Refinery
- Creates the SparkContext, which acts as the connection to the cluster manager. Part of IBM's Watson Studio, this tool assists in transforming raw data with features like automatic data
- Generates the Directed Acyclic Graph (DAG) of operations and converts it into execution stages. type detection and the ability to process large datasets. It integrates with IBM's data governance
Executor: framework, ensuring compliance with data policies, which is beneficial for enterprises requiring strict
- Worker nodes that perform the computation tasks. data governance.
- Store intermediate data in memory or on disk during computation. Trifacta Wrangler
- Report task status and results back to the Driver. A cloud-based service known for its collaborative interface, designed for data cleaning and
Cluster Node Types: transformation. It supports exporting cleaned data to various platforms, including Excel and Tableau,
- Master Node: Hosts the Driver program and interacts with the Cluster Manager. facilitating seamless integration into existing workflows. Trifacta Wrangler is suitable for teams working
- Worker Nodes: Run Executor processes to perform computations. on collaborative data wrangling projects.
Workflow: When selecting a data wrangling tool, consider factors such as dataset size, complexity of
Application Submission: The Driver program is submitted to the cluster manager. transformations required, integration with existing data platforms, and the technical proficiency of the
Resource Allocation: The cluster manager allocates resources (Executors) to the application. users. Each of these tools offers distinct advantages tailored to different data preparation needs.
Task Scheduling: The Driver breaks down operations into stages and tasks and schedules them on Chapter 10.
Executors. 1. What is not machine learning?
Task Execution: Executors execute the tasks on chunks of data, leveraging in-memory storage for o Explocot, step-by-step programming
speed. o Data-driven decisions
Result Collection: Executors return results to the Driver for final aggregation. o Discover hidden patterns
Termination: Once the application is complete, the Driver and Executors shut down. o Learning from data
Machine learning involves algorithms that learn patterns from data rather than following explicit step-
by-step instructions. "Explocot" appears to be a typo or an unrelated term, and traditional step-by-step
programming is not considered machine learning.
2. Which of the following is not a category of machine learing?
o Regression
o Association Analysis
o Algorithm Prediction
o Cluster Analysis
o Classification
"Algorithm Prediction" is not a recognized category of machine learning. The main categories of - Understanding the importance of rare events (e.g., fraud detection) can help determine whether
machine learning include supervised learning (e.g., regression, classification), unsupervised learning oversampling or weighted loss functions are necessary
(e.g., cluster analysis, association analysis), and reinforcement learning.
3. Which categories of machine learning techniques are supervised? Chapter 12.
o Classification and cluster analysis Research the Apriori algorithm:
o Regression and association analysis 1) What is the Apriori algorithm?
o Classification and regression 2) Describe the Apriori algorithm
o Cluster analysis and association analysis 3) Run an example using the Apriori algorithm
Supervised learning involves labeled data where the input-output relationship is explicitly defined.
Techniques such as classification (predicting categories) and regression (predicting continuous values) 1. The Apriori algorithm is a fundamental method in data mining used to identify frequent itemsets and
are supervised. Cluster analysis and association analysis are unsupervised techniques. derive association rules from large datasets. It operates on the principle that any subset of a frequent
Chapter 11. itemset must also be frequent, enabling efficient exploration of item combinations within transactional
1. What's Wrong with Pie Charts? databases.
Problems with pie charts: 2. The Apriori algorithm follows a systematic approach to discover frequent itemsets:
- Humans are not great at accurately perceiving and comparing angles, which makes it hard to Initialization:
differentiate between similarly sized slices. - Begin by identifying all individual items (1-itemsets) in the dataset and calculate their support
- Pie charts don’t provide a clear scale for precise comparison, unlike bar charts. (frequency of occurrence).
- They are ineffective for datasets with numerous categories as they become cluttered and unreadable. - Define a minimum support threshold to determine which itemsets are considered frequent.
- Small differences between slices can appear exaggerated depending on visual design, leading to Iteration:
potential misinterpretation. - Candidate Generation: Generate candidate itemsets of length 𝑘 + 1 from the frequent 𝑘-itemsets
- Pie charts take up more space compared to other types of charts that convey the same information. identified in the previous iteration.
Good things about pie charts: - Support Counting: Calculate the support for each candidate itemset by scanning the dataset.
- Pie charts can be visually engaging and help grab attention in presentations. - Pruning: Eliminate candidate itemsets that do not meet the minimum support threshold.
- They provide an intuitive snapshot of proportions or percentages, making them useful for audiences Termination: Repeat the iteration process until no new frequent itemsets are found.
with no statistical background. Association Rule Generation: From the frequent itemsets, generate association rules that satisfy
- They are good for showcasing the largest or smallest component of a dataset. minimum confidence thresholds, indicating strong relationships between items.
Do you agree with the statement that pie charts should never be used? Why or why not? 3. Example
While pie charts have significant flaws, they are not entirely useless. When used sparingly for small Consider a dataset of transactions in a retail store:
datasets with distinct categories (e.g., "50% vs. 30% vs. 20%"), they can effectively convey proportions
to non-technical audiences. However, for detailed analysis or comparisons, bar charts or stacked bar
charts are usually better choices. Pie charts should be avoided in analytical contexts but can serve a
role in presentations or reports for general audiences.
2. Domain Knowledge in Data Preparation
- Knowing whether missing values are systematic or random can guide strategies like imputation or
exclusion. For example, in healthcare, missing blood pressure readings might occur during routine
checkups but could be critical for emergencies
- Understanding units of measurement helps decide whether normalization or log transformation is
appropriate. For instance, in finance, converting logarithmic returns to percentage returns can make Let's apply the Apriori algorithm with a minimum support threshold of 60% (i.e., an itemset must appear
results interpretable in at least 3 out of 5 transactions).
- Domain knowledge helps distinguish between genuine outliers and data errors. For example, a very Identify Frequent 1-Itemsets:
high salary in a dataset of tech company employees might indicate a CEO rather than a mistake Calculate the support for each item:
- Knowing how specific variables influence outcomes allows for creating meaningful features. For Milk: 3/5 = 60%
instance, in e-commerce, combining product views and add-to-cart actions could indicate purchase Bread: 5/5 = 100%
intent Butter: 4/5 = 80%
All items meet the minimum support threshold.
Generate Candidate 2-Itemsets and Prune:
Form pairs: (Milk, Bread), (Milk, Butter), (Bread, Butter)
Calculate support:
(Milk, Bread): 3/5 = 60%
(Milk, Butter): 2/5 = 40%
(Bread, Butter): 4/5 = 80%
(Milk, Butter) is pruned as it doesn't meet the minimum support.
Generate Candidate 3-Itemsets and Prune:
Form the combination: (Milk, Bread, Butter)
Calculate support:
(Milk, Bread, Butter): 2/5 = 40%
This itemset is pruned due to insufficient support.
Resulting Frequent Itemsets:
1-Itemsets: {Milk}, {Bread}, {Butter}
2-Itemsets: {Milk, Bread}, {Bread, Butter}
Generate Association Rules:
From the frequent itemsets, derive rules such as:
{Bread} → {Butter} with confidence = Support({Bread, Butter}) / Support({Bread}) = 80% / 100% = 80%
{Butter} → {Bread} with confidence = Support({Bread, Butter}) / Support({Butter}) = 80% / 80% = 100%
This process illustrates how the Apriori algorithm identifies frequent itemsets and generates
association rules, providing valuable insights for decision-making in various applications, such as
market basket analysis.