Unit 1-2
Unit 1-2
• 3) Poor-Quality Data –
• If the training data has full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it
harder for the system to detect the underlying patterns, so it won’t perform well.
• It is often well worth the effort to spend time cleaning up your training data. The most data scientists spend a
significant part of their time doing just that.
•Healthcare: ML algorithms are used for disease diagnosis, medical imaging analysis, drug
discovery, personalized treatment plans, and predicting patient outcomes.
•Finance: In finance, ML is applied for fraud detection, algorithmic trading, credit scoring, risk
management, and personalized financial advice.
•Retail: ML helps in demand forecasting, recommendation systems, inventory management,
customer segmentation, and personalized marketing.
•Transportation: Self-driving cars, route optimization, traffic prediction, and predictive maintenance
of vehicles all leverage ML algorithms.
•Entertainment: ML is used for content recommendation on platforms like Netflix and Spotify,
automated content moderation, and audience sentiment analysis.
•Manufacturing: Predictive maintenance, quality control, supply chain optimization, and defect
detection are key applications in manufacturing.
•Customer Service: Catboats, virtual assistants, sentiment analysis, and customer feedback analysis improve
customer support and experience.
•Agriculture: Precision farming, crop yield prediction, pest detection, and soil health monitoring benefit from
ML.
•Marketing: ML is used for customer segmentation, predictive analytics, targeted advertising, and customer
lifetime value prediction.
•Cybersecurity: Threat detection, anomaly detection, spam filtering, and intrusion detection systems
enhance cybersecurity measures.
•Energy: ML algorithms are applied for demand forecasting, grid management, predictive maintenance of
infrastructure, and optimizing energy consumption.
•Education: Personalized learning, automated grading, plagiarism detection, and student performance
prediction are some educational applications.
•Human Resources: Resume screening, employee performance analysis, talent management, and predicting
employee turnover benefit from ML.
•Environmental Science: Climate modeling, natural disaster prediction, species identification, and
environmental monitoring leverage ML algorithms.
•Data Collection:
•Gather data from various sources such as databases, APIs, web scraping, or files.
•Ensure the collected data is relevant and sufficient for the problem at hand.
•Data Cleaning:
•Handling Missing Values: Replace or impute missing values using techniques like mean/mode/median imputation,
forward/backward fill, or by using more sophisticated methods.
•Removing Duplicates: Identify and remove duplicate records to avoid redundancy.
•Handling Outliers: Detect and either remove or adjust outliers that can skew the results.
•Fixing Errors: Correct any inaccuracies or inconsistencies in the data.
•Data Transformation:
•Normalization/Standardization: Scale the data to a common range (e.g., 0-1) or to have a mean of 0 and standard
deviation of 1. This is especially important for algorithms sensitive to the scale of data.
•Encoding Categorical Variables: Convert categorical data into numerical format using techniques like one-hot
encoding, label encoding, or ordinal encoding.
•Feature Engineering: Create new features or modify existing ones to better represent the underlying patterns in the
data. This may involve combining features, creating interaction terms, or decomposing features into more meaningful
components.
•Data Reduction:
•Dimensionality Reduction: Reduce the number of features using techniques like Principal
Component Analysis (PCA), Linear Discriminant Analysis (LDA), or feature selection methods to
simplify the model and reduce computation time.
•Sampling: Reduce the size of the dataset by sampling, which can make the training process faster
and more efficient without losing significant information.
•Data Splitting:
•Split the dataset into training, validation, and test sets to evaluate the model's performance and
ensure it generalizes well to unseen data.
•Data Integration:
•Combine data from multiple sources or tables to create a comprehensive dataset for analysis.
•Data Annotation (if applicable):
•Label the data, especially in supervised learning tasks, where each instance needs a corresponding
label or target value.