0% found this document useful (0 votes)
59 views

Notes- Introduction to AI,ML,DS

Uploaded by

kuchbhi323232
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Notes- Introduction to AI,ML,DS

Uploaded by

kuchbhi323232
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

‭ CA520A‬

M
‭Introduction to Artificial Intelligence‬
‭Machine Learning and Data Science‬

‭ nit-1‬
U
‭Introduction to Artificial Intelligence‬

‭1. Definition and Scope of Artificial Intelligence‬

‭ rtificial Intelligence (AI) refers to the simulation of human intelligence in machines that‬
A
‭are programmed to think and learn like humans. It involves the study and development‬
‭of intelligent agents capable of perceiving their environment and taking actions to‬
‭maximize their chances of success.‬
‭The scope of AI encompasses various cognitive functions such as understanding‬
‭natural language, reasoning, problem-solving, learning from experience, and adapting‬
‭to new situations.‬

‭ ome daily life applications of AI we are using are chatbots, google assistant, facial‬
S
‭recognition in mobile phones, Social media applications, spam mail detection etc.‬

‭2. Historical Background and Milestones in AI Development‬

‭ rtificial Intelligence (AI) stands at the forefront of technological advancements today,‬


A
‭but its roots trace back through a fascinating history marked by significant milestones‬
‭and breakthroughs. From early conceptualizations to modern applications, AI has‬
‭evolved into a transformative force shaping various aspects of society.‬
‭ 950s: British mathematician Alan Turing proposed a test to determine a machine's‬
1
‭ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a‬
‭human. This seminal idea laid foundational principles for AI research.‬

‭ 960s-1970s: Symbolic AI emerged with systems capable of manipulating symbols and‬


1
‭using logical reasoning (e.g., expert systems).‬

‭ 980s-1990s: Knowledge-based systems gained popularity, focusing on encoding‬


1
‭expert knowledge into systems. Machine learning approaches began to gain traction.‬

‭ 000s-2010s: The rise of big data fueled advancements in machine learning, especially‬
2
‭with neural networks and deep learning, achieving breakthroughs in tasks like image‬
‭and speech recognition.‬

‭3. Various Branches of AI‬

‭ ymbolic AI:‬‭Involves the use of algorithms to manipulate‬‭symbols based on predefined‬


S
‭rules. Expert systems, which emulate the decision-making ability of a human expert, are‬
‭a notable application.‬
‭An expert system is a computer program that is designed to solve complex problems‬
‭and to provide decision-making ability like a human expert. It performs this by extracting‬
‭knowledge from its knowledge base using the reasoning and inference rules according‬
‭to the user queries.‬
‭The system helps in decision making for complex problems using‬‭both facts and‬
‭heuristics like a human expert‬‭. It is called so because‬‭it contains the expert knowledge‬
‭of a specific domain and can solve any complex problem of that particular domain.‬
‭These systems are designed for a specific domain, such as‬‭medicine, science,‬‭etc.‬

‭ tatistical AI:‬‭Focuses on developing algorithms that‬‭can learn from and make‬


S
‭predictions or decisions based on data. Machine learning techniques such as‬
s‭ upervised learning, unsupervised learning, and reinforcement learning fall under this‬
‭category.‬
‭Statistical AI models include linear regression (for trend prediction), logistic regression‬
‭(binary classification), decision trees (hierarchical decision-making), SVMs‬
‭(high-dimensional classification), naive Bayes (text classification), KNN (similarity‬
‭learning), and neural networks (complex tasks like image/speech recognition). To use‬
‭them, understand the problem and data, select a suitable model, prepare and split data‬
‭into training/testing sets, train the model, evaluate on the test set, and optimize as‬
‭needed.‬

‭ ther Branches:‬‭Natural language processing (NLP)‬‭enables computers to understand‬


O
‭and generate human language, while computer vision allows machines to interpret and‬
‭understand visual information.‬

‭4. Applications of AI in Different Fields‬

‭1.Healthcare:‬
‭Medical Imaging and Diagnostics: AI aids in interpreting medical images like X-rays‬
‭and MRIs, improving accuracy and speed of diagnosis.‬
‭Personalized Medicine: AI analyzes patient data to tailor treatment plans based on‬
‭individual genetic profiles and medical histories.‬
‭Virtual Health Assistants: AI-powered chatbots and virtual agents provide patient‬
‭support, appointment scheduling, and medical advice.‬
‭Predictive Analytics: AI predicts patient outcomes and identifies at-risk individuals,‬
‭aiding in early intervention and preventive care.‬
‭Administrative Efficiency: AI automates tasks such as medical coding, scheduling, and‬
‭billing, improving operational efficiency.‬
‭Drug Discovery and Development: AI accelerates drug discovery processes and‬
‭predicts molecular interactions for new treatments.‬

‭ . Finance‬
2
‭Algorithmic Trading: AI analyzes large datasets and market trends to execute trades‬
‭autonomously and optimize investment strategies.‬
‭ raud Detection: AI algorithms identify unusual patterns in transactions to detect and‬
F
‭prevent fraudulent activities in real-time.‬
‭Credit Scoring and Risk Assessment: AI evaluates creditworthiness by analyzing‬
‭financial data and behavioral patterns, improving accuracy in risk assessment.‬
‭Customer Service and Chatbots: AI-powered chatbots provide personalized customer‬
‭support, assist with inquiries, and manage financial transactions.‬
‭Robo-Advisors: AI algorithms recommend investment portfolios based on individual risk‬
‭profiles and financial goals, providing automated wealth management solutions.‬
‭Sentiment Analysis: AI analyzes news, social media, and other textual data to gauge‬
‭market sentiment and predict market movements.‬

‭ .Gaming‬
3
‭AI techniques are employed to create realistic game environments, develop intelligent‬
‭non-player characters (NPCs), and enhance player experience through procedural‬
‭content generation and adaptive gameplay.‬

‭5. Ethical Considerations and social impact of AI‬

‭ thical Issues: AI systems can exhibit biases learned from training data, leading to‬
E
‭unfair treatment or decisions. Moreover, the automation of jobs raises concerns about‬
‭unemployment and the need for retraining the workforce.‬
‭Societal Impact: AI-driven automation has the potential to improve productivity and‬
‭create new job opportunities in emerging fields such as AI engineering and data‬
‭science. However, it also requires careful management to ensure that societal benefits‬
‭are equitably distributed and that ethical guidelines protect individual rights and privacy.‬
‭Unit 2‬
‭Fundamentals of Machine Learning‬

‭1. Introduction to Machine Learning (ML)‬

‭ achine Learning is a subfield of artificial intelligence (AI) that focuses on developing‬


M
‭algorithms and techniques that enable computers to learn from and make predictions or‬
‭decisions based on data. Unlike traditional programming where rules are explicitly‬
‭defined, ML algorithms learn patterns and relationships from data to improve their‬
‭performance over time.‬

‭Importance of Machine Learning:‬

‭●‬ M
‭ L enables computers to handle complex tasks that are difficult to program‬
‭explicitly.‬
‭●‬ I‭t powers various applications such as recommendation systems, image and‬
‭speech recognition, medical diagnostics, and autonomous driving.‬

‭2. Types of Machine Learning‬

‭ achine Learning can be broadly categorized into three main types based on the nature‬
M
‭of the learning process and the availability of labeled data:‬

‭●‬ ‭Supervised Learning:‬


‭○‬ ‭In supervised learning, the algorithm learns from labeled data, where each‬
‭example is paired with a target label.‬

‭ or example consider a scenario where you have to build an image‬


F
‭classifier to differentiate between cats and dogs. If you feed the datasets‬
‭of dogs and cats labeled images to the algorithm, the machine will learn to‬
‭classify between a dog or a cat from these labeled images. When we input‬
‭new dog or cat images that it has never seen before, it will use the learned‬
‭algorithms and predict whether it is a dog or a cat. This is how supervised‬
‭learning works, and this is particularly an image classification.‬
‭○‬ O
‭ ther examples: Predicting house prices based on features like square‬
‭footage, number of bedrooms, Classifying the images into categories, etc.‬

‭ here are two main categories of supervised learning that are mentioned‬
T
‭below:‬
‭1. Classification 2. Regression‬

‭1. Classification‬

‭ lassification deals with predicting categorical target variables, which‬


C
‭represent discrete classes or labels. For instance, classifying emails as‬
‭spam or not spam, or predicting whether a patient has a high risk of heart‬
‭disease. Classification algorithms learn to map the input features to one of‬
‭the predefined classes.‬

‭Some classification algorithms:‬

‭ aive Bayes‬
N
‭Decision Tree‬
‭Support Vector Machine‬
‭Random Forest‬
‭K-Nearest Neighbors (KNN)‬

‭2. Regression‬

‭ egression, on the other hand, deals with predicting continuous target‬


R
‭variables, which represent numerical values. For example, predicting the‬
‭price of a house based on its size, location, and amenities, or forecasting‬
‭the sales of a product. Regression algorithms learn to map the input‬
‭features to a continuous numerical value.‬

‭Some regression algorithms:‬

‭ inear Regression‬
L
‭Polynomial Regression‬
‭Ridge Regression‬
‭Lasso Regression‬
‭●‬ ‭Unsupervised Learning:‬
‭○‬ ‭Unsupervised learning involves learning patterns from unlabeled data.‬
‭○‬ ‭Example: Clustering similar documents together based on their content.‬

‭ xample: Consider that you have a dataset that contains information‬


E
‭about the purchases you made from the shop. Through clustering, the‬
‭algorithm can group the same purchasing behavior among you and other‬
‭customers, which reveals potential customers without predefined labels.‬
‭This type of information can help businesses get target customers as well‬
‭as identify outliers.‬

‭●‬ ‭Reinforcement Learning:‬


‭○‬ ‭Reinforcement machine learning algorithm is a learning method that‬
‭interacts with the environment by producing actions and discovering‬
‭errors.‬
‭○‬ ‭Trial, error, and delay are the most relevant characteristics of‬
‭reinforcement learning. In this technique, the model keeps on increasing‬
‭its performance using Reward Feedback to learn the behavior or pattern.‬
‭These algorithms are specific to a particular problem‬
‭○‬ ‭Examples are Google Self Driving car, AlphaGo where a bot competes‬
‭with humans and even itself to get better and better performers in Go‬
‭Game. Each time we feed in data, they learn and add the data to their‬
‭knowledge which is training data. So, the more it learns the better it gets‬
‭trained and hence experienced.‬
‭3. Basic Concepts in Machine Learning‬

‭●‬ ‭Features and Labels:‬


‭○‬ ‭Features‬‭(or predictors) are individual measurable properties or‬
‭characteristics of the phenomenon being observed.‬
‭○‬ ‭Labels‬‭(or targets) are the outcomes or predictions‬‭that the model aims to‬
‭predict or classify.‬
‭●‬ ‭Training Data:‬
‭○‬ ‭Training data is the dataset used to train the machine learning model. It‬
‭consists of input-output pairs (features-labels) used to teach the model‬
‭patterns and relationships.‬

‭4. Popular Machine Learning Algorithms‬

‭Here are some widely used machine learning algorithms across different types:‬

‭●‬ ‭Linear Regression:‬


‭○‬ ‭Used for predicting a continuous value based on a linear relationship‬
‭between input features and the target variable.‬
‭○‬ ‭Example: Predicting house prices based on square footage.‬

‭ hen there is only one independent feature, it is known as Simple Linear‬


W
‭Regression, and when there are more than one feature, it is known as‬
‭Multiple Linear Regression.‬

‭Linear Regression:‬

‭ his is the simplest form of linear regression, and it involves only one‬
T
‭independent variable and one dependent variable. The equation for simple‬
‭linear regression is:‬
‭y=β0+β1*X‬
‭Where,‬
‭Y is the dependent variable‬
‭X is the independent variable‬
‭ 0 is the intercept‬
β
‭β1 is the slope‬

‭Multiple Linear Regression‬

‭ his involves more than one independent variable and one dependent‬
T
‭variable. The equation for multiple linear regression is:‬
‭𝑦=𝛽0+𝛽1*𝑋+𝛽2*𝑋+………𝛽𝑛*𝑋‬
‭Where‬
‭Y is the dependent variable‬
‭X1, X2, …, Xp are the independent variables‬
‭β0 is the intercept‬
‭β1, β2, …, βn are the slopes‬

‭ est fit line:‬


B
‭The goal of the algorithm is to find the best Fit Line equation that can predict‬
‭the values based on the independent variables.‬
‭In regression set of records are present with X and Y values and these values‬
‭are used to learn a function so if you want to predict Y from an unknown X‬
‭this learned function can be used.‬
‭●‬ ‭Logistic Regression:‬

‭ sed for binary classification problems where the output is a probability value between‬
U
‭0 and 1.Example: Predicting whether an email is spam or not.‬
‭For example, we have two classes Class 0 and Class 1 if the value of the logistic‬
‭function for an input is greater than 0.5 (threshold value) then it belongs to Class 1‬
‭otherwise it belongs to Class 0. It’s referred to as regression because it is the‬
‭extension of linear regression but is mainly used for classification problems.‬

‭ ey points:‬
K
‭=>Logistic regression predicts the output of a categorical dependent variable.‬
‭Therefore, the outcome must be a categorical or discrete value.‬

‭=>It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact‬
‭value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.‬

‭=>In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic‬
‭function, which predicts two maximum values (0 or 1).‬
‭●‬ ‭Decision Trees:‬

‭ versatile algorithm that can perform both classification and regression‬


A
‭tasks by recursively splitting the data into subsets based on features.‬
‭A decision tree is a flowchart-like structure used to make decisions or‬
‭predictions. It consists of nodes representing decisions or tests on‬
‭attributes, branches representing the outcome of these decisions, and leaf‬
‭nodes representing final outcomes or predictions. Each internal node‬
‭corresponds to a test on an attribute, each branch corresponds to the‬
‭result of the test, and each leaf node corresponds to a class label or a‬
‭continuous value.‬

‭ xample: Predicting whether a customer will purchase a product based on‬


E
‭demographic data.‬

‭Structure of a Decision Tree:‬

‭ oot Node: Represents the entire dataset and the initial decision to be‬
R
‭made.‬
‭Internal Nodes: Represent decisions or tests on attributes. Each internal‬
‭node has one or more branches.‬
‭Branches: Represent the outcome of a decision or test, leading to another‬
‭node.‬
‭Leaf Nodes: Represent the final decision or prediction. No further splits‬
‭occur at these nodes.‬

‭How Decision Trees Work?‬

‭ he process of creating a decision tree involves:‬


T
‭Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or‬
‭information gain, the best attribute to split the data is selected.‬
‭Splitting the Dataset: The dataset is split into subsets based on the‬
‭selected attribute.‬
‭Repeating the Process: The process is repeated recursively for each‬
‭subset, creating a new internal node or leaf node until a stopping criterion‬
‭is met (e.g., all instances in a node belong to the same class or a‬
‭predefined depth is reached).‬

‭●‬ ‭k-Nearest Neighbors (k-NN):‬


‭○‬ ‭A simple algorithm that classifies new data points based on majority vote‬
‭of their neighbors.‬
‭○‬ ‭Example: Classifying a flower species based on its measurements by‬
‭comparing it to the measurements of its nearest neighbors.‬

I‭ntuition Behind KNN Algorithm:‬


‭If we plot these points on a graph, we may be able to locate some clusters‬
‭or groups. Now, given an unclassified point, we can assign it to a group by‬
‭observing what group its nearest neighbors belong to. This means a point‬
‭close to a cluster of points classified as ‘Red’ has a higher probability of‬
‭getting classified as ‘Red’.‬
‭Distance Metrics Used in KNN Algorithm‬

‭ s we know that the KNN algorithm helps us identify the nearest points or‬
A
‭the groups for a query point. But to determine the closest groups or the‬
‭nearest points for a query point we need some metric. For this purpose,‬
‭we use below distance metrics:Euclidean Distance, Manhattan Distance,‬
‭Minkowski Distance‬

‭Workings of KNN algorithm:‬

‭ tep 1‬‭: Selecting the optimal value of K‬


S
‭K represents the number of nearest neighbors that needs to be‬
c‭ onsidered while making prediction.‬
‭Step 2‬‭: Calculating distance‬
‭To measure the similarity between target and training data points,‬
‭Euclidean distance is used. Distance is calculated between each of the‬
‭data points in the dataset and target point.‬
‭Step 3‬‭: Finding Nearest Neighbors‬
‭The k data points with the smallest distances to the target point are the‬
‭nearest neighbors.‬
‭Step 4:‬‭Voting for Classification or Taking Average‬‭for Regression‬
‭In the classification problem, the class labels of are determined by‬
‭performing majority voting. The class with the most occurrences among‬
‭the neighbors becomes the predicted class for the target data point.‬
‭n the regression problem, the class label is calculated by taking average‬
‭of the target values of K nearest neighbors. The calculated average value‬
‭becomes the predicted output for the target data point.‬

‭5. Evaluation Metrics for Machine Learning Models‬

‭●‬ ‭Accuracy:‬‭Proportion of correctly predicted instances‬‭among the total instances.‬

‭●‬ P
‭ recision:‬‭There is another metric named Precision.‬‭Precision is a measure of a‬
‭model’s performance that tells you how many of the positive predictions made by‬
‭the model are actually correct. It is calculated as the number of true positive‬
‭predictions divided by the number of true positive and false positive predictions.‬

‭●‬ R
‭ ecall (Sensitivity):‬‭Proportion of true positive‬‭predictions among all actual‬
‭positive instances.‬
‭●‬ F
‭ 1-score:‬‭Harmonic mean of precision and recall, providing‬‭a balanced measure‬
‭between them.‬
‭F1 score = 2*(1/((1/precision)+(1/recall)))‬

‭Note: Lower recall and higher precision give you great accuracy but then it misses‬
‭a large number of instances. The more the F1 score better will be‬
‭Performance.‬
‭Note:‬
‭True Positives: It is the case where we predicted Yes and the real output was also‬
‭Yes.‬
‭True Negatives: It is the case where we predicted No and the real output was also‬
‭No.‬
‭False Positives: It is the case where we predicted Yes but it was actually No.‬
‭False Negatives: It is the case where we predicted No but it was actually Yes.‬
‭Unit-3‬
‭Machine Learning Techniques‬

‭1. Data Preprocessing Techniques‬

‭ . Handling Missing Data‬


a
‭Missing values are data points that are absent for a specific variable in a dataset. They‬
‭can be represented in various ways, such as blank cells, null values, or special symbols‬
‭like “NA” or “unknown.” These missing data points pose a significant challenge in data‬
‭analysis and can lead to inaccurate or biased results.‬

I‭mportance: Missing data is common in real-world datasets and can adversely affect‬
‭model performance if not handled properly.‬

‭Why Is Data Missing From the Dataset?‬

‭ ata can be missing for many reasons like technical issues, human errors, privacy‬
D
‭concerns, data processing issues, or the nature of the variable itself. Understanding the‬
‭cause of missing data helps choose appropriate handling strategies and ensure the‬
‭quality of your analysis.‬
‭It’s important to understand the reasons behind missing data:‬

‭●‬ I‭dentifying the type of missing data: Is it Missing Completely at Random (MCAR),‬
‭Missing at Random (MAR), or Missing Not at Random (MNAR)?‬
‭●‬ ‭Evaluating the impact of missing data: Is the missingness causing bias or‬
‭affecting the analysis?‬
‭●‬ ‭Choosing appropriate handling strategies: Different techniques are suitable for‬
‭different types of missing data.‬

‭Functions‬ ‭Descriptions‬

‭.isnull()‬ I‭dentifies missing values in a Series or‬


‭DataFrame.‬

‭.notnull()‬ ‭ heck for missing values in a pandas Series‬


c
‭or DataFrame. It returns a boolean Series or‬
‭DataFrame, where True indicates non-missing‬
‭values and False indicates missing values.‬

‭.info()‬ ‭ isplays information about the DataFrame,‬


D
‭including data types, memory usage, and‬
‭presence of missing values.‬

‭.isna()‬ ‭ imilar to notnull() but returns True for missing‬


s
‭values and False for non-missing values.‬

‭dropna()‬ ‭ rops rows or columns containing missing‬


D
‭values based on custom criteria.‬

‭fillna()‬ ‭ ills missing values with specific values,‬


F
‭means, medians, or other calculated values.‬

‭replace()‬ ‭Replaces specific values with other values,‬


f‭acilitating data correction and‬
‭standardization.‬

‭drop_duplicates()‬ R
‭ emoves duplicate rows based on specified‬
‭columns.‬

‭unique()‬ ‭ inds unique values in a Series or‬


F
‭DataFrame.‬

‭Techniques:‬

‭ eletion: Remove rows or columns with missing data (simplest but can lead to loss of‬
D
‭valuable information).‬

I‭mputation: Replace missing values with a statistical estimate (mean, median, mode) or‬
‭use predictive methods like K-Nearest Neighbors (KNN) imputation.‬

‭ dvanced Techniques: Use algorithms like Iterative Imputer or MICE (Multivariate‬


A
‭Imputation by Chained Equations) for more complex missing data patterns.‬

‭b. Feature Scaling‬

‭ eature Scaling is a technique to standardize the independent features present in the‬


F
‭data in a fixed range. It is performed during the data pre-processing to handle highly‬
‭varying magnitudes or values or units. If feature scaling is not done, then a machine‬
‭learning algorithm tends to weigh greater values, higher and consider smaller values as‬
‭the lower values, regardless of the unit of the values.‬
‭Why feature Scaling:‬

‭●‬ S
‭ caling guarantees that all features are on a comparable scale and have‬
‭comparable ranges. This process is known as feature normalization.‬

‭●‬ A
‭ lgorithm performance improvement: When the features are scaled, several‬
‭machine learning methods, including gradient descent-based algorithms,‬
‭distance-based algorithms (such k-nearest neighbours), and support vector‬
‭machines, perform better or converge more quickly.‬

‭●‬ P
‭ reventing numerical instability: Numerical instability can be prevented by‬
‭avoiding significant scale disparities between features. Examples include‬
‭distance calculations or matrix operations, where having features with radically‬
‭differing scales can result in numerical overflow or underflow problems.‬

‭●‬ S
‭ caling features makes ensuring that each characteristic is given the same‬
‭consideration during the learning process. Without scaling, bigger scale features‬
‭could dominate the learning, producing skewed outcomes.‬

‭Techniques:‬

‭ tandardization:‬‭This method of scaling is basically‬‭based on the central tendencies‬


S
‭and variance of the data.‬

‭ irst, we should calculate the mean and standard deviation of the data we would like to‬
F
‭normalize.Then we are supposed to subtract the mean value from each entry and then‬
‭divide the result by the standard deviation.‬

‭ his helps us achieve a normal distribution(if it is already normal but skewed) of the‬
T
‭data with a mean equal to zero and a standard deviation equal to 1.‬

‭ ormalization:‬‭Scale features to a range, typically‬‭[0, 1] (e.g., using MinMaxScaler). we‬


N
‭subtract each entry of data by the mean value of the whole data and then divide the‬
‭results by the difference between the minimum and the maximum value.‬
‭ obust Scaling:‬‭Scale features using statistics robust‬‭to outliers (e.g., using‬
R
‭RobustScaler).‬

‭c. Feature Encoding‬

‭ etter encoding leads to a better model and most algorithms cannot handle the‬
B
‭categorical variables unless they are converted into a numerical value.‬

‭ urpose: Convert categorical variables into numerical representations suitable for‬


P
‭model algorithms.‬

‭ ategorical features are generally divided into 3 types:‬


C
‭A. Binary: Either/or‬
‭Examples: {Yes, No}, {True, False}‬
‭B. Ordinal: Specific ordered Groups.‬
‭Examples: {low, medium, high}, {cold, hot, lava Hot}‬
‭C. Nominal: Unordered Groups.‬
‭Examples: {cat, dog, tiger}, {pizza, burger, coke}‬

‭Techniques:‬

‭ abel Encoding:‬‭Label Encoding is a technique that‬‭is used to convert categorical‬


L
‭columns into numerical ones so that they can be fitted by machine learning models‬
‭which only take numerical data. It is an important pre-processing step in a‬
‭machine-learning project.‬

‭ uppose we have a column Height in some dataset that has elements as Tall, Medium,‬
S
‭and short. To convert this categorical column into a numerical column we will apply label‬
‭encoding to this column. After applying label encoding, the Height column is converted‬
‭into a numerical column having elements 0,1, and 2 where 0 is the label for tall, 1 is the‬
‭label for medium, and 2 is the label for short height.‬
‭Height‬ ‭Height‬

‭Tall‬ ‭0‬

‭Medium‬ ‭1‬

‭Short‬ ‭2‬

‭ ne-Hot Encoding:‬‭Create binary columns for each category‬‭(suitable for nominal‬


O
‭categorical variables).‬

I‭n One Hot Encoding, the categorical parameters will prepare separate columns for both‬
‭Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male‬
‭column and 0 in the Female column, and vice-versa.‬

‭ et’s understand with an example: Consider the data where fruits, their corresponding‬
L
‭categorical values, and prices are given.‬

‭Fruit‬ ‭Categorical value of fruit‬ ‭price‬

‭apple‬ ‭1‬ ‭5‬

‭mango‬ ‭2‬ ‭10‬

‭apple‬ ‭1‬ ‭15‬

‭orange‬ ‭3‬ ‭20‬


‭The output after applying one-hot encoding on the data is given as follows,‬

‭apple‬ ‭mango‬ ‭orange‬ ‭price‬

‭1‬ ‭0‬ ‭0‬ ‭5‬

‭0‬ ‭1‬ ‭0‬ ‭10‬

‭1‬ ‭0‬ ‭0‬ ‭15‬

‭0‬ ‭0‬ ‭1‬ ‭20‬

‭ arget Encoding:‬‭Encode categories based on the target‬‭variable's mean or other‬


T
‭statistics (useful for high-cardinality categorical variables).‬

‭ n a binary classifier, the simplest way to do that is by calculating the probability p(t = 1‬
O
‭| x = ci) in which t denotes the target, x is the input and ci is the i-th category. In‬
‭Bayesian statistics, this is considered the posterior probability of t=1 given the input was‬
‭the category ci.‬

‭2. Model Selection and Hyperparameter Tuning‬

‭a. Model Selection‬

‭ rocess: Evaluate and compare different machine learning models to identify the best‬
P
‭performer for the given task.‬
‭Techniques:‬

‭ rain-Validation-Test Split: Divide data into training, validation, and test sets for model‬
T
‭evaluation.‬
‭Parameters vs Hyperparameters: Parameters of a model are generated by model itself‬
‭during training or learning. Examples are weights of a ML model or neural network.‬
‭While hyperparameters are manually fixed by us before the training phase. Examples:‬
‭Epoch size, batch size, number of layers in neural network, activation function, etc.‬
‭Hyperparameters are adjustable parameters that can be used to obtain an optimal‬
‭model.‬

I‭n machine learning, training, validation, and test data sets are used for different‬
‭purposes to evaluate the performance of algorithms that learn from data and make‬
‭predictions:‬

‭Training data‬

‭ he largest subset of data used to train the model by adjusting its parameters. This‬
T
‭helps the model learn underlying patterns in the data. The training set should not be too‬
‭small, or the model won't have enough data to learn.‬

‭Validation data‬

‭ sed to evaluate the model during the training phase to fine-tune its parameters and‬
U
‭select the best-performing model. The validation set helps improve model performance‬
‭by predicting responses for observations in the data set. If there are multiple models to‬
‭select from, the validation set can help with model selection. Otherwise, it might be‬
‭redundant and can be omitted.‬

‭Test data‬

‭ sed to evaluate the final model's performance on completely unseen data after the‬
U
‭model has been trained and validated. The test set can help approximate the model's‬
‭unbiased accuracy in the real world‬

‭ etrics: Use appropriate metrics (accuracy, precision, recall, F1-score, etc.) for‬
M
‭evaluation based on the problem type (classification, regression).‬
‭b. Hyperparameter Tuning‬

‭ Machine Learning model is defined as a mathematical model with several parameters‬


A
‭that need to be learned from the data. By training a model with existing data, we can fit‬
‭the model parameters.‬
‭However, there is another kind of parameter, known as Hyperparameters, that cannot‬
‭be directly learned from the regular training process. They are usually fixed before the‬
‭actual training process begins. These parameters express important properties of the‬
‭model such as its complexity or how fast it should learn.‬

‭Purpose: Optimize model performance by adjusting hyperparameters.‬

‭Techniques:‬

‭ rid Search: Exhaustively search through a manually specified subset of‬


G
‭hyperparameters. For example: if we want to set two hyperparameters C and Alpha of‬
‭the Logistic Regression Classifier model, with different sets of values. The grid search‬
‭technique will construct many versions of the model with all possible combinations of‬
‭hyperparameters and will return the best one.‬
‭ s in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a‬
A
‭combination of C=0.3 and Alpha=0.2, the performance score comes out to be‬
‭0.726(Highest), therefore it is selected.‬

‭ andom Search:‬‭Randomly sample hyperparameter combinations‬‭from a defined‬


R
‭space.‬

‭ ayesian Optimization: Sequential model-based optimization that uses results from past‬
B
‭iterations to guide the search for optimal hyperparameters.‬

‭3. Cross-Validation Techniques‬

I‭n machine learning, we couldn’t fit the model on the training data and can’t say that the‬
‭model will work accurately for the real data. For this, we must assure that our model got‬
‭the correct patterns from the data, and it is not getting up too much noise. For this‬
‭purpose, we use the cross-validation technique. In this article, we’ll delve into the‬
‭process of cross-validation in machine learning.‬

‭ urpose: Evaluate model performance while maximizing data utilization and minimizing‬
P
‭overfitting.‬

‭Note it is done on a training dataset.‬

‭Techniques:‬

‭ -Fold Cross-Validation‬‭:In K-Fold Cross Validation,‬‭we split the dataset into k number of‬
K
‭subsets (known as folds) then we perform training on the all the subsets but leave‬
‭one(k-1) subset for the evaluation of the trained model. In this method, we iterate k‬
‭times with a different subset reserved for testing purpose each time.‬

‭ he diagram below shows an example of the training subsets and evaluation subsets‬
T
‭generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration‬
‭we use the first 20 percent of data for evaluation, and the remaining 80 percent for‬
‭training ([1-5] testing and [5-25] training) while in the second iteration we use the‬
s‭ econd subset of 20 percent for evaluation, and the remaining three subsets of the data‬
‭for training ([5-10] testing and [1-5 and 10-25] training), and so on.‬

‭ tratified Cross-Validation:‬‭Maintain the percentage‬‭of samples for each class in each‬


S
‭fold to ensure representative training and validation sets.‬

‭●‬ T ‭ he dataset is divided into k folds while maintaining the proportion of classes in‬
‭each fold.‬
‭●‬ ‭During each iteration, one-fold is used for testing, and the remaining folds are‬
‭used for training.‬
‭●‬ ‭The process is repeated k times, with each fold serving as the test set exactly‬
‭once.‬

‭ eave-One-Out Cross-Validation (LOOCV)‬‭: Use each sample‬‭as a validation set once,‬


L
‭particularly useful for small datasets.‬

‭4. Ensemble Methods‬


‭a. Bagging‬

‭ efinition: Bootstrap Aggregating involves training multiple models independently and‬


D
‭combining their predictions.‬

I‭mplementation Steps of Bagging‬


‭Step 1: Multiple subsets are created from the original data set with equal tuples,‬
‭selecting observations with replacement.‬
‭Step 2: A base model is created on each of these subsets.‬
‭Step 3: Each model is learned in parallel with each training set and independent of each‬
‭other.‬
‭Step 4: The final predictions are determined by combining the predictions from all the‬
‭models.‬

‭ xample: Random Forest algorithm, which uses bagging to train decision trees on‬
E
‭random subsets of the data and aggregates their predictions.‬

‭b. Boosting‬

‭ efinition: Sequentially train models where each subsequent model corrects errors‬
D
‭made by the previous one.‬

‭ .‬ I‭nitialise the dataset and assign equal weight to each of the data point.‬
1
‭2.‬ ‭Provide this as input to the model and identify the wrongly classified data points.‬
‭3.‬ ‭Increase the weight of the wrongly classified data points and decrease the‬
‭weights of correctly classified data points. And then normalize the weights of all‬
‭data points.‬
‭4.‬ i‭f (got required results): Goto step 5‬
‭Else: Goto step 2‬
‭5.‬ ‭End‬

‭Example: Gradient Boosting Machines (GBM), XGBoost, AdaBoost.‬

‭c. Random Forests‬

‭ efinition: Ensemble learning method that constructs a multitude of decision trees at‬
D
‭training time and outputs the class that is the mode of the classes (classification) or‬
‭mean prediction (regression) of the individual trees.‬
‭5. Introduction to Deep Learning and Neural Networks‬

‭ he definition of Deep learning is that it is the branch of machine learning that is based‬
T
‭on artificial neural network architecture. An artificial neural network or ANN uses layers‬
‭of interconnected nodes called neurons that work together to process and learn from‬
‭the input data.‬

‭Components:‬

‭ eural Networks: Basic building blocks comprising layers of interconnected nodes‬


N
‭(neurons).‬

‭Example: ANN(Artificial Neural Networks):‬

‭ rtificial neural networks are built on the principles of the structure and operation of‬
A
‭human neurons. It is also known as neural networks or neural nets. An artificial neural‬
‭network’s input layer, which is the first layer, receives input from external sources and‬
‭passes it on to the hidden layer, which is the second layer. Each neuron in the hidden‬
‭layer gets information from the neurons in the previous layer, computes the weighted‬
‭total, and then transfers it to the neurons in the next layer. These connections are‬
‭weighted, which means that the impacts of the inputs from the preceding layer are more‬
‭or less optimized by giving each input a distinct weight. These weights are then‬
‭adjusted during the training process to enhance the performance of the model.‬

‭ ctivation Functions: Functions applied to node outputs to introduce non-linearity (e.g.,‬


A
‭ReLU, Sigmoid, Tanh).‬

‭ raining: Techniques like backpropagation and gradient descent to optimize network‬


T
‭weights.‬
‭Unit‬‭4‬
‭Introduction to Data Science‬

‭1. What is Data Science and Why it is Important?‬

‭ efinition:‬
D
‭Data is widely considered a crucial resource in different organizations across every‬
‭industry. Data Science can be described in simple terms as a separate field of work that‬
‭deals with the management and processing of data using statistical methods, artificial‬
‭intelligence, and other tools in partnership with domain specialists. Pursuing Data‬
‭Science encompasses concepts and epochs derived from different fields including‬
‭Mathematics and Computer Science and Information Theory to interpret large data.‬

‭Importance:‬

‭ usiness Insights: Helps organizations make informed decisions based on data-driven‬


B
‭insights.‬
‭Scientific Discoveries: Facilitates discovery in various fields by analyzing large datasets.‬
‭Personalization: Enables personalized experiences in products and services.‬
‭Predictive Capabilities: Predicts future trends and behaviors.‬
‭2. Role of Data Scientist and Skills Required‬

‭ ole:‬
R
‭Data Scientist is responsible for analyzing, interpreting, and deriving actionable insights‬
‭from complex data sets.‬

‭ kills Required:‬
S
‭Programming: Proficiency in languages like Python, R, or SQL.‬
‭Statistics and Mathematics: Understanding of statistical methods and mathematical‬
‭concepts.‬
‭Machine Learning: Knowledge of algorithms for predictive modeling and pattern‬
‭recognition.‬
‭Data Wrangling: Cleaning, transforming, and preparing data for analysis.‬
‭Data Visualization: Communicating insights through charts, graphs, and dashboards.‬
‭Domain Knowledge: Understanding of the industry or field in which data is being‬
‭analyzed.‬

‭3. Data Acquisition‬

‭ ources of Data:‬
S
‭Internal Sources: Data generated within an organization (e.g., databases, CRM‬
‭systems).‬
‭External Sources: Data obtained from third-party providers, APIs, social media, etc.‬
‭Public Datasets: Available from government agencies, research institutions, etc.‬

‭ ata Formats:‬
D
‭Structured Data: Organized in a predefined format (e.g., databases, spreadsheets).‬
‭Unstructured Data: Not organized in a predefined manner (e.g., text documents,‬
‭images, videos).‬

‭4. Data Cleaning‬

‭ ata cleaning, also known as data cleansing or data preprocessing, is a crucial step in‬
D
‭the data science pipeline that involves identifying and correcting or removing errors,‬
‭inconsistencies, and inaccuracies in the data to improve its quality and usability. Data‬
‭cleaning is essential because raw data is often noisy, incomplete, and inconsistent,‬
‭which can negatively impact the accuracy and reliability of the insights derived from it.‬
‭ rocess:‬
P
‭Handling Missing Values: Imputation techniques or removal.‬
‭Handling Outliers: Identifying and treating outliers appropriately.‬
‭Normalization and Standardization: Scaling numerical data.‬
‭Data Formatting: Ensuring data is in a consistent format.‬

‭5. Exploratory Data Analysis (EDA)‬

‭ xploratory Data Analysis (EDA) is a crucial initial step in data science projects. It‬
E
‭involves analyzing and visualizing data to understand its key characteristics, uncover‬
‭patterns, and identify relationships between variables refers to the method of studying‬
‭and exploring record sets to apprehend their predominant traits, discover patterns,‬
‭locate outliers, and identify relationships between variables. EDA is normally carried out‬
‭as a preliminary step before undertaking extra formal statistical analyses or modeling.‬

‭Why Exploratory Data Analysis is Important?‬

‭ xploratory Data Analysis (EDA) is important for several reasons, especially in the‬
E
‭context of data science and statistical modeling. Here are some of the key reasons why‬
‭EDA is a critical step in the data analysis process:‬

‭●‬ U
‭ nderstanding Data Structures: EDA helps in getting familiar with the dataset,‬
‭understanding the number of features, the type of data in each feature, and the‬
‭distribution of data points.‬

‭●‬ I‭dentifying Patterns and Relationships: Through visualizations and statistical‬


‭summaries, EDA can reveal hidden patterns and intrinsic relationships between‬
‭variables.‬

‭●‬ D
‭ etecting Anomalies and Outliers: EDA is essential for identifying errors or‬
‭unusual data points that may adversely affect the results of your analysis.‬

‭●‬ T
‭ esting Assumptions: Many statistical models assume that data follow a certain‬
‭distribution or that variables are independent. EDA involves checking these‬
‭assumptions. If the assumptions do not hold, the conclusions drawn from the‬
‭model could be invalid.‬
‭●‬ I‭nforming Feature Selection and Engineering: Insights gained from EDA can‬
‭inform which features are most relevant to include in a model and how to‬
‭transform them (scaling, encoding) to improve model performance.‬

‭●‬ O
‭ ptimizing Model Design: By understanding the data’s characteristics, analysts‬
‭can choose appropriate modeling techniques, decide on the complexity of the‬
‭model, and better tune model parameters.‬

‭●‬ F
‭ acilitating Data Cleaning: EDA helps in spotting missing values and errors in the‬
‭data, which are critical to address before further analysis to improve data quality‬
‭and integrity.‬

‭Key aspects of EDA include:‬

‭ istribution of Data, , Graphical Representations, Outlier Detection, Correlation‬


D
‭Analysis, Handling Missing Values, Summary Statistics, Testing Assumptions.‬
‭Statistical Analysis‬

‭ . Descriptive Statistics‬
1
‭Definition:‬
‭Descriptive statistics are used to describe and summarize the features of a dataset.‬
‭They provide simple summaries about the sample and the measures.‬

‭ easures:‬
M
‭Mean: Average of all values in a dataset, sensitive to outliers.‬
‭Median: Middle value of a dataset when arranged in ascending order, less sensitive to‬
‭outliers.‬
‭Mode: Most frequent value in a dataset.‬
‭Range: Difference between the maximum and minimum values.‬
‭Variance: Measure of the spread of data points around the mean.‬
‭Standard Deviation: Square root of the variance, indicating the average deviation from‬
‭the mean.‬

‭ . Inferential Statistics‬
2
‭Definition:‬
‭Inferential statistics use data from a sample to make inferences or generalizations about‬
‭a larger population.‬
‭ echniques:‬
T
‭Hypothesis Testing: Evaluates the likelihood that a result is due to chance.‬
‭Null Hypothesis (H0): Statement of no effect or no difference.‬
‭Alternative Hypothesis (H1): Statement to be tested.‬
‭Significance Level (α): Threshold for rejecting the null hypothesis (typically 0.05).‬
‭Confidence Intervals: Range of values within which the true population parameter is‬
‭estimated to lie.‬
‭Correlation Analysis: Measures the strength and direction of the linear relationship‬
‭between two variables (Pearson correlation coefficient).‬
‭Regression Analysis: Predicts the value of one variable based on the value of another‬
‭(linear regression, logistic regression, etc.).‬

‭ ata Visualization Techniques‬


D
‭Data visualization is crucial for exploring and communicating patterns, trends, and‬
‭insights from data. Here are some key techniques and their applications:‬

‭ . Scatter Plots‬
1
‭Definition:‬
‭A scatter plot is a graph that displays values for two variables as points on a Cartesian‬
‭plane. Each point represents the value of one variable corresponding to the value of the‬
‭other.‬

‭ pplications:‬
A
‭Relationship Exploration: Visualize relationships and correlations between variables.‬
‭Outlier Detection: Identify outliers and anomalies in data.‬
‭ rend Identification: Spot trends such as clusters or patterns in data points.‬
T
‭Example:‬
‭In a dataset of student scores vs. study hours, a scatter plot can show whether there's a‬
‭correlation between hours studied and exam scores.‬

‭ . Line Charts‬
2
‭Definition:‬
‭A line chart displays data points connected by straight line segments. It is particularly‬
‭useful for showing trends over time or ordered categories.‬
‭Applications:‬

‭ ime Series Analysis: Track changes in data over time (e.g., stock prices, temperature‬
T
‭trends).‬
‭Comparison: Compare trends in multiple datasets (e.g., sales performance across‬
‭different regions).‬
‭Example:‬
‭Showing the growth of a company's revenue over the past five years using a line chart.‬

‭ . Histograms‬
3
‭Definition:‬
‭ histogram is a graphical representation of the distribution of numerical data. It‬
A
‭consists of bars that show the frequency of data points within defined intervals (bins).‬

‭ pplications:‬
A
‭Distribution Analysis: Understand the shape, center, and spread of data.‬
‭Identifying Skewness: Determine whether data is symmetric or skewed.‬
‭Data Preprocessing: Assess data quality and potential outliers.‬
‭Example:‬
‭Visualizing the distribution of ages in a population to understand the demographic‬
‭profile.‬

‭ . Bar Charts‬
4
‭Definition:‬

‭ bar chart uses rectangular bars to represent categorical data. The length or height of‬
A
‭each bar corresponds to the frequency, count, or percentage of the categories.‬
‭Applications:‬

‭ omparison: Compare quantities or values across different categories.‬


C
‭Ranking: Rank categories based on their values.‬
‭Part-to-Whole Relationships: Show how each category contributes to the total.‬
‭Example:‬

‭ omparing sales performance of different product categories in a retail store over a‬


C
‭month using a bar chart.‬
‭5. Pie Charts‬
‭Definition:‬

‭ pie chart is a circular statistical graphic divided into slices to illustrate numerical‬
A
‭proportions. The arc length of each slice is proportional to the quantity it represents.‬
‭Applications:‬

‭ roportional Representation: Show the contribution of each category to a whole.‬


P
‭Percentage Breakdown: Display parts of a whole as percentages.‬
‭Example:‬

‭ howing the distribution of expenses (e.g., rent, utilities, groceries) in a household‬


S
‭budget using a pie chart.‬

‭ ools for Data Visualization‬


T
‭Python Libraries: Matplotlib, Seaborn, Plotly, Bokeh.‬
‭R Packages: ggplot2, lattice, plotly.‬
‭Business Intelligence Tools: Tableau, Power BI, QlikView.‬

‭6. Introduction to Libraries/Tools‬

‭ . NumPy‬
1
‭Definition:‬
‭NumPy (Numerical Python) is a library for the Python programming language that‬
‭supports large, multi-dimensional arrays and matrices, along with a collection of‬
‭mathematical functions to operate on these arrays.‬

‭Key Features:‬
‭●‬ ‭N-Dimensional Arrays: Core data structure is ndarray.‬
‭●‬ ‭Mathematical Functions: Functions for linear algebra, statistics, and‬
‭mathematical operations.‬
‭●‬ ‭Broadcasting: Support for arithmetic operations on arrays of different shapes.‬

I‭nstallation:‬
‭pip install numpy‬

‭Basic Usage:‬

‭ . Importing NumPy:‬
1
‭import numpy as np‬
‭2. Creating Arrays:‬
‭# Creating a 1D array‬
‭ rr1 = np.array([1, 2, 3, 4, 5])‬
a
‭print(arr1)‬
‭# Creating a 2D array‬
‭arr2 = np.array([[1, 2, 3], [4, 5, 6]])‬
‭print(arr2)‬
‭3. Array Operations:‬
‭# Array addition‬
‭arr_sum = arr1 + 10‬
‭print(arr_sum)‬
‭# Matrix multiplication‬
‭arr_mult = np.dot(arr2, arr2.T)‬
‭print(arr_mult)‬
‭4. Array Statistics:‬
‭# Mean, Median, Standard Deviation‬
‭mean_val = np.mean(arr1)‬
‭median_val = np.median(arr1)‬
‭std_dev = np.std(arr1)‬

‭print(f"Mean: {mean_val}, Median: {median_val}, Std Dev: {std_dev}")‬

‭ . Pandas‬
2
‭Definition:‬
‭Pandas is a data manipulation and analysis library for Python. It provides data‬
‭structures and functions needed to work on structured data seamlessly.‬

‭Key Features:‬
‭●‬ ‭DataFrames: 2D labeled data structure with columns of potentially different types.‬
‭●‬ ‭Series: 1D labeled array capable of holding any data type.‬

I‭nstallation:‬
‭pip install pandas‬

‭Basic Usage:‬

‭ . Importing Pandas:‬
1
‭import pandas as pd‬
‭2. Creating DataFrames:‬
‭# Creating a DataFrame‬
‭data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}‬
‭ f = pd.DataFrame(data)‬
d
‭print(df)‬
‭3. DataFrame Operations:‬
‭# Adding a new column‬
‭df['Gender'] = ['F', 'M', 'M']‬
‭print(df)‬

‭ Descriptive Statistics‬
#
‭print(df.describe())‬
‭# Data Selection‬
‭print(df['Name'])‬ ‭# Selecting a column‬
‭print(df.iloc[0])‬ ‭# Selecting a row by index‬
‭4. Data Cleaning:‬
‭# Handling Missing Values‬
‭df_missing = df.copy()‬
‭df_missing.loc[1, 'Age'] = None‬ ‭# Introduce a missing‬‭value‬
‭df_cleaned = df_missing.fillna(df_missing['Age'].mean()) # Fill missing values with‬
‭mean‬
‭print(df_cleaned)‬

‭ . Matplotlib‬
3
‭Definition:‬
‭Matplotlib is a plotting library for Python and is widely used for creating static, animated,‬
‭and interactive visualizations.‬

‭ ey Features:‬
K
‭Flexibility: Wide range of plot types.‬
‭Customization: Extensive customization options for plots.‬

I‭nstallation:‬
‭pip install matplotlib‬

‭ asic Usage:‬
B
‭1. Importing Matplotlib:‬
‭import matplotlib.pyplot as plt‬
‭2. Creating Plots:‬
‭Line Plot:‬
‭# Line Plot‬
‭x = [1, 2, 3, 4, 5]‬
y‭ = [1, 4, 9, 16, 25]‬
‭plt.plot(x, y)‬
‭plt.title('Line Plot')‬
‭plt.xlabel('x-axis')‬
‭plt.ylabel('y-axis')‬
‭plt.show()‬
‭Bar Plot:‬
‭# Bar Plot‬
‭categories = ['A', 'B', 'C']‬
‭values = [10, 15, 7]‬
‭plt.bar(categories, values)‬
‭plt.title('Bar Plot')‬
‭plt.xlabel('Categories')‬
‭plt.ylabel('Values')‬
‭plt.show()‬

‭ istogram:‬
H
‭# Histogram‬
‭data = np.random.randn(1000) # Generate 1000 random data points‬
‭plt.hist(data, bins=30)‬
‭plt.title('Histogram')‬
‭plt.xlabel('Value')‬
‭plt.ylabel('Frequency')‬
‭plt.show()‬

‭ . Seaborn‬
4
‭Definition:‬
‭Seaborn is a Python visualization library based on Matplotlib that provides a high-level‬
‭interface for drawing attractive and informative statistical graphics.‬

‭ ey Features:‬
K
‭High-Level Interface: Easier syntax for complex plots.‬
‭Integrated Data Analysis: Built-in functions for statistical plotting.‬
‭Installation:‬
‭pip install seaborn‬

‭ asic Usage:‬
B
‭1. Importing Seaborn:‬
i‭mport seaborn as sns‬
‭2. Creating Plots:‬

‭ catter Plot:‬
S
‭# Scatter Plot‬
‭tips = sns.load_dataset('tips')‬
‭sns.scatterplot(x='total_bill', y='tip', data=tips)‬
‭plt.title('Scatter Plot of Total Bill vs Tip')‬
‭plt.show()‬
‭Box Plot:‬
‭# Box Plot‬
‭sns.boxplot(x='day', y='total_bill', data=tips)‬
‭plt.title('Box Plot of Total Bill by Day')‬
‭plt.show()‬
‭Heatmap:‬
‭# Heatmap‬
‭corr = tips.corr()‬
‭sns.heatmap(corr, annot=True, cmap='coolwarm')‬
‭plt.title('Heatmap of Correlations')‬
‭plt.show()‬
‭Unit 5‬
‭Advanced Topics and Applications‬

‭ . Support Vector Machines (SVM)‬


1
‭Definition:‬
‭Support Vector Machines (SVM) are supervised learning algorithms used for‬
‭classification and regression tasks. SVM aims to find the hyperplane that best separates‬
‭different classes in the feature space.‬

‭ ey Concepts:‬
K
‭Hyperplane: A decision boundary that separates different classes.‬
‭Support Vectors: Data points that are closest to the hyperplane and influence its‬
‭position.‬
‭Margin: The distance between the hyperplane and the support vectors.‬
‭Types of SVM:‬

‭Linear SVM: Finds a linear hyperplane to separate classes.‬


‭ on-Linear SVM: Uses kernels to transform the feature space into higher dimensions to‬
N
‭find a non-linear decision boundary.‬

‭ ow does SVM work?‬


H
‭One reasonable choice as the best hyperplane is the one that represents the largest‬
‭separation or margin between the two classes.‬

‭ o we choose the hyperplane whose distance from it to the nearest data point on each‬
S
‭side is maximized. If such a hyperplane exists it is known as the maximum-margin‬
‭hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a‬
‭scenario like shown below‬

‭ ere we have one blue ball in the boundary of the red ball. So how does SVM classify‬
H
‭the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls.‬
‭The SVM algorithm has the characteristics to ignore the outlier and finds the best‬
‭hyperplane that maximizes the margin. SVM is robust to outliers.‬
‭ . Neural Networks‬
2
‭Definition:‬
‭Neural Networks are computational models inspired by the human brain's network of‬
‭neurons. They consist of layers of interconnected nodes (neurons) that transform input‬
‭data into output predictions.‬

‭ ey Concepts:‬
K
‭Neurons: Basic units that receive inputs, apply weights, and pass the result through an‬
‭activation function.‬
‭Activation Functions: Functions that introduce non-linearity (e.g., Sigmoid, ReLU, Tanh).‬

‭ ommon Architectures:‬
C
‭Feedforward Neural Networks (FNN): Data moves in one direction, from input to output.‬
‭Multi-Layer Perceptrons (MLP): FNNs with one or more hidden layers.‬

‭Let’s understand with an example of how a neural network works:‬

‭ onsider a neural network for email classification. The input layer takes features like‬
C
‭email content, sender information, and subject. These inputs, multiplied by adjusted‬
‭weights, pass through hidden layers. The network, through training, learns to recognize‬
‭patterns indicating whether an email is spam or not. The output layer, with a binary‬
‭activation function, predicts whether the email is spam (1) or not (0). As the network‬
‭iteratively refines its weights through backpropagation, it becomes adept at‬
‭distinguishing between spam and legitimate emails, showcasing the practicality of‬
‭neural networks in real-world applications like email filtering.‬
‭ eural networks are complex systems that mimic some features of the functioning of‬
N
‭the human brain. It is composed of an input layer, one or more hidden layers, and an‬
‭output layer made up of layers of artificial neurons that are coupled. The two stages of‬
‭the basic process are called backpropagation and forward propagation.‬

‭Forward Propagation‬
‭●‬ ‭Input Layer: Each feature in the input layer is represented by a node on the‬
‭network, which receives input data.‬
‭●‬ ‭Weights and Connections: The weight of each neuronal connection indicates‬
‭how strong the connection is. Throughout training, these weights are changed.‬
‭●‬ ‭Hidden Layers: Each hidden layer neuron processes inputs by multiplying them‬
‭by weights, adding them up, and then passing them through an activation‬
‭function. By doing this, non-linearity is introduced, enabling the network to‬
‭recognize intricate patterns.‬
‭●‬ ‭Output: The final result is produced by repeating the process until the output‬
‭layer is reached.‬

‭Backpropagation:‬
‭●‬ ‭Loss Calculation: The network’s output is evaluated against the real goal values,‬
‭and a loss function is used to compute the difference. For a regression problem,‬
‭the Mean Squared Error (MSE) is commonly used as the cost function.‬
‭●‬ G ‭ radient Descent: Gradient descent is then used by the network to reduce the‬
‭loss. To lower the inaccuracy, weights are changed based on the derivative of the‬
‭loss with respect to each weight.‬
‭●‬ ‭Adjusting weights: The weights are adjusted at each connection by applying this‬
‭iterative process, or backpropagation, backward across the network.‬
‭●‬ ‭Training: During training with different data samples, the entire process of‬
‭forward propagation, loss calculation, and backpropagation is done iteratively,‬
‭enabling the network to adapt and learn patterns from the data.‬
‭●‬ ‭Actvation Functions: Model non-linearity is introduced by activation functions like‬
‭the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a‬
‭neuron is based on the whole weighted input.‬

‭Types of Neural Networks:‬

‭There are seven types of neural networks that can be used.‬

‭ eedforward Neteworks:‬‭A feedforward neural network‬‭is a simple artificial neural‬


F
‭network architecture in which data moves from input to output in a single direction. It‬
‭has input, hidden, and output layers; feedback loops are absent. Its straightforward‬
‭architecture makes it appropriate for a number of applications, such as regression and‬
‭pattern recognition.‬

‭ ultilayer Perceptron (MLP):‬‭MLP is a type of feedforward‬‭neural network with three or‬


M
‭more layers, including an input layer, one or more hidden layers, and an output layer. It‬
‭uses nonlinear activation functions.‬

‭ onvolutional Neural Network (CNN)‬‭: A Convolutional‬‭Neural Network (CNN) is a‬


C
‭specialized artificial neural network designed for image processing. It employs‬
‭convolutional layers to automatically learn hierarchical features from input images,‬
‭enabling effective image recognition and classification. CNNs have revolutionized‬
‭computer vision and are pivotal in tasks like object detection and image analysis.‬

‭ ecurrent Neural Network (RNN):‬‭An artificial neural‬‭network type intended for‬


R
‭sequential data processing is called a Recurrent Neural Network (RNN). It is‬
‭appropriate for applications where contextual dependencies are critical, such as time‬
‭series prediction and natural language processing, since it makes use of feedback‬
‭loops, which enable information to survive within the network.‬
‭ ong Short-Term Memory (LSTM):‬‭LSTM is a type of RNN‬‭that is designed to overcome‬
L
‭the vanishing gradient problem in training RNNs. It uses memory cells and gates to‬
‭selectively read, write, and erase information.‬

‭3. Convolutional Neural Networks (CNNs)‬

‭ efinition:‬
D
‭Convolutional Neural Networks (CNNs) are specialized neural networks designed for‬
‭processing structured grid data, like images.‬

‭ ey Concepts:‬
K
‭Convolutions: Operations that apply a filter to an image to create feature maps.‬
‭Pooling Layers: Reduce the spatial dimensions of feature maps (e.g., Max Pooling,‬
‭Average Pooling).‬
‭Fully Connected Layers: Layers where each neuron is connected to every neuron in the‬
‭previous layer.‬

‭ . Convolutional Layer‬
1
‭Function:‬
‭ he convolutional layer is the core building block of a CNN. It applies a set of filters‬
T
‭(kernels) to the input image to produce feature maps. Each filter detects specific‬
‭features like edges, textures, or patterns.‬

‭How It Works:‬

‭ ilters: Small matrices (e.g., 3x3 or 5x5) that slide over the input image. Each filter‬
F
‭detects different features.‬
‭Convolution Operation: The filter multiplies its values with the pixel values of the image‬
‭and sums the results to produce a single output value. This operation is performed‬
‭across the entire image.‬
‭Mathematical Operation:‬

‭Feature Map=Image∗Filter‬

‭ . Activation Function (ReLU)‬


2
‭Function:‬

‭ he ReLU (Rectified Linear Unit) activation function introduces non-linearity into the‬
T
‭model. It replaces all negative pixel values with zero.‬

‭Mathematical Operation:‬

‭ReLU(𝑥)=max(0,𝑥)‬

‭ . Pooling Layer‬
3
‭Function:‬

‭ ooling layers reduce the spatial dimensions of feature maps, decreasing the number of‬
P
‭parameters and computation required, and helping to avoid overfitting.‬

‭Types of Pooling:‬

‭ ax Pooling: Takes the maximum value from a feature map segment.‬


M
‭Average Pooling: Computes the average value from a feature map segment.‬
‭Mathematical Operation (Max Pooling):‬
‭Output=Max(Region)‬

‭ . Flattening Layer‬
4
‭Function:‬

‭ he flattening layer converts the 2D matrix into a 1D vector. This step is necessary to‬
T
‭feed the output into fully connected layers.‬

‭Mathematical Operation:‬

‭The 2D matrix is transformed into a single long vector.‬

‭ . Fully Connected Layer‬


5
‭Function:‬

‭ he fully connected layer (Dense layer) performs the final classification or regression‬
T
‭tasks. Every neuron in this layer is connected to every neuron in the previous layer.‬

‭Mathematical Operation:‬

‭𝑦=𝑊𝑥+𝑏‬

‭ here:‬
W
‭x is the input vector,‬
‭W is the weight matrix,‬
‭b is the bias term,‬
‭y is the output vector.‬
‭ . Recurrent Neural Networks (RNNs)‬
4
‭Definition:‬
‭Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to‬
‭handle sequential data. Unlike traditional feedforward neural networks, RNNs have‬
‭connections that form directed cycles, allowing information to persist across time steps.‬

‭ pplications:‬
A
‭RNNs are used in various applications such as:‬

‭ ime Series Prediction: Forecasting stock prices, weather, etc.‬


T
‭Natural Language Processing: Language modeling, sentiment analysis.‬
‭Speech Recognition: Transcribing spoken language to text.‬
‭Machine Translation: Translating text from one language to another.‬

‭ . How RNNs Work‬


2
‭RNNs process sequences of data by maintaining a state that carries information about‬
‭previous time steps. Here’s a step-by-step explanation of the process:‬

‭ . Recurrent Structure‬
a
‭RNNs have a structure that includes a feedback loop, allowing the network to use‬
‭information from previous time steps.‬
‭Description:‬

I‭nput Vector‬
‭x t: The data at time step 𝑡‬

‭ idden State‬
H
‭ht: The output of the hidden layer that carries information from previous time steps.‬
‭Recurrent Connection: The hidden state ht is used as input for the next time step.‬
‭Mathematical Representation:For a given time step t, the RNN performs the following‬
‭operations:‬

‭ pdate Hidden State:‬


U
‭ℎ𝑡=tanh(𝑊ℎ⋅ℎ𝑡−1+𝑊𝑥⋅𝑥𝑡+𝑏ℎ)‬
‭Wh: Recurrent weight matrix‬
‭𝑊𝑥: Input weight matrix‬
‭bh: Bias term‬
‭Generate Output:‬
‭𝑦𝑡=𝑊𝑦⋅ℎ𝑡+𝑏𝑦‬

‭ y: Output weight matrix‬


W
‭by: Bias term‬
‭.‬

‭ . Backpropagation Through Time (BPTT)‬


b
‭To train RNNs, we use Backpropagation Through Time (BPTT), which is an extension of‬
‭the backpropagation algorithm for sequence data.‬
‭Explanation:‬

‭ nroll the RNN: Expand the RNN into a chain of layers corresponding to each time‬
U
‭step.‬
‭Compute Gradients: Calculate the gradients for each layer over the entire sequence.‬
‭Update Weights: Adjust the weights using the computed gradients.‬

‭5‬‭.‬‭Natural Language Processing‬

‭ hat is NLP?‬
W
‭Definition:‬
‭Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses‬
‭on enabling computers to understand, interpret, and generate human language. It‬
‭combines computational linguistics, machine learning, and computer science to facilitate‬
‭interactions between humans and machines through natural language.‬

‭ urpose:‬
P
‭NLP aims to bridge the gap between human communication and computer‬
‭understanding, making it possible for machines to process and analyze large amounts‬
‭of natural language data.‬

‭Image:‬

‭ ey Techniques in NLP‬
K
‭Text Classification‬
‭ hat: Categorizes text into predefined categories.‬
W
‭Examples: Spam detection, sentiment analysis.‬
‭Technique: TF-IDF, Naive Bayes, SVM.‬

‭ entiment Analysis‬
S
‭What: Determines the sentiment expressed in text.‬
‭Examples: Analyzing product reviews, social media sentiment.‬
‭Technique: Rule-based systems, machine learning, deep learning.‬

‭ achine Translation‬
M
‭What: Translates text from one language to another.‬
‭Examples: Google Translate, language learning apps.‬
‭Technique: Statistical Machine Translation, Neural Machine Translation.‬

‭ amed Entity Recognition (NER)‬


N
‭What: Identifies and classifies entities in text.‬
‭Examples: Extracting names, dates, and locations from documents.‬
‭Technique: Rule-based methods, machine learning, deep learning.‬

‭ uestion Answering‬
Q
‭What: Provides answers to questions posed in natural language.‬
‭Examples: Virtual assistants, customer support.‬
‭Technique: Retrieval-based systems, generative models.‬

‭ peech Recognition‬
S
‭What: Converts spoken language into text.‬
‭Examples: Voice assistants, transcription services.‬
‭Technique: Acoustic models, language models, deep learning.‬

‭ hatbots and Virtual Assistants‬


C
‭What: Simulate human conversation for user interactions.‬
‭Examples: Customer service bots, personal assistants.‬
‭Technique: Rule-based systems, AI-driven conversation models.‬

‭ ext Summarization‬
T
‭What: Creates a concise summary of a longer text.‬
‭Examples: Summarizing news articles, executive summaries.‬
‭Technique: Extractive summarization, abstractive summarization.‬

‭Information Retrieval‬
‭ hat: Searches for relevant information from large datasets.‬
W
‭Examples: Search engines, document retrieval.‬
‭Technique: Vector space models, ranking algorithms.‬

‭ ext Generation‬
T
‭What: Generates coherent and contextually relevant text.‬
‭Examples: Content creation, creative writing.‬
‭Technique: Language models, generative models.‬

‭ eal-World Applications‬
R
‭Customer Support‬
‭Example: Chatbots answering customer queries.‬
‭Benefit: 24/7 support and cost efficiency.‬
‭Language Translation‬
‭Example: Google Translate for multilingual communication.‬
‭Benefit: Breaking language barriers globally.‬
‭Sentiment Analysis‬
‭Example: Analyzing Twitter posts for brand sentiment.‬
‭Benefit: Understanding customer opinions and market trends.‬
‭Healthcare‬
‭Example: Extracting information from medical records.‬
‭Benefit: Improving patient care and management.‬
‭Education‬
‭Example: Language learning apps and automated tutoring.‬
‭Benefit: Enhancing learning experiences.‬
‭Future Trends‬
‭Advanced Models: Development of more sophisticated models like GPT-4.‬
‭Multimodal Approaches: Combining text with other data types (images, videos).‬
‭Ethical NLP: Addressing biases and ensuring fairness in AI applications.‬

‭ onclusion‬
C
‭NLP is a dynamic and rapidly evolving field that leverages techniques from AI and‬
‭machine learning to process and analyze human language. Its applications are diverse‬
‭and impactful, from enhancing customer service to enabling real-time translation and‬
‭generating human-like text. As technology advances, NLP will continue to transform‬
‭how we interact with machines and process language data.‬
‭Introduction to Big Data Technologies: Hadoop, Spark, and More‬

‭ . Overview of Big Data‬


1
‭Definition:‬
‭Big Data refers to large volumes of data that are too complex to be processed using‬
‭traditional data management tools. It encompasses vast datasets that can be analyzed‬
‭for insights and decision-making, characterized by the 3Vs:‬

‭ olume: The sheer amount of data generated.‬


V
‭Velocity: The speed at which data is generated and processed.‬
‭Variety: The different types and sources of data.‬

‭ urpose:‬
P
‭The aim of Big Data technologies is to efficiently store, manage, and analyze massive‬
‭datasets to extract valuable insights, drive decisions, and create innovative solutions.‬

‭ . Key Big Data Technologies‬


2
‭a. Apache Hadoop‬
‭What is Hadoop?‬
‭Apache Hadoop is an open-source framework for storing and processing large datasets‬
‭across a distributed cluster of computers. It is designed to scale from a single server to‬
‭thousands of machines.‬

‭Components of Hadoop:‬

‭Hadoop Distributed File System (HDFS):‬

‭ unction: A distributed file system designed to run on commodity hardware.‬


F
‭Feature: Stores data across multiple machines and ensures high availability and fault‬
‭tolerance.‬
‭Architecture: Data is split into blocks and replicated across different nodes in the cluster.‬

‭ apReduce:‬
M
‭Function: A programming model for processing large data sets with a distributed‬
‭algorithm.‬
‭Components:‬
‭ apper: Processes input data and generates key-value pairs.‬
M
‭Reducer: Aggregates and processes the results of the Mapper.‬

‭YARN (Yet Another Resource Negotiator):‬

‭ unction: Manages resources and job scheduling across the Hadoop cluster.‬
F
‭Components:‬
‭ResourceManager: Manages resources across the cluster.‬
‭NodeManager: Manages resources and tasks on individual nodes.‬

‭Use Cases:‬

‭ ata Storage: Store and manage large datasets from various sources.‬
D
‭Data Processing: Analyze large-scale data for patterns and insights.‬
‭Data Integration: Combine data from different sources for a unified analysis.‬

‭ . Apache Spark‬
b
‭What is Spark?‬
‭Apache Spark is an open-source unified analytics engine for large-scale data‬
‭processing. It provides fast, in-memory data processing capabilities and supports‬
‭various workloads like batch processing, streaming, and machine learning.‬

‭Components of Spark:‬

‭Spark Core:‬

‭ unction: The foundation of Spark, providing essential functionalities like task‬


F
‭scheduling and fault tolerance.‬
‭Features:‬
‭Resilient Distributed Datasets (RDDs): Immutable collections of objects that can be‬
‭processed in parallel.‬
‭DataFrames: A higher-level abstraction for working with structured data.‬
‭Spark SQL:‬

‭ unction: Provides a programming interface for working with structured and‬


F
‭semi-structured data.‬
‭Features: Supports querying data through SQL as well as DataFrame and Dataset‬
‭APIs.‬
‭Image:‬

‭Spark Streaming:‬

‭ unction: Enables processing of real-time data streams.‬


F
‭Features: Supports processing data from sources like Kafka and Flume.‬

‭ se Cases:‬
U
‭Data Processing: High-performance processing of large datasets.‬
‭Real-Time Analytics: Analyzing data as it is generated.‬
‭Machine Learning: Building and deploying machine learning models.‬
‭Image:‬

‭ . Case Studies and Real-World Applications‬


3
‭a. Case Study: Netflix‬
‭Overview:‬
‭Netflix uses Big Data technologies to recommend movies and TV shows to users. It‬
‭leverages Apache Spark for data processing and Hadoop for data storage.‬

‭Approach:‬

‭ ata Collection: Collects user activity data and viewing preferences.‬


D
‭Data Analysis: Analyzes data to provide personalized recommendations.‬
‭Outcome: Improved user engagement and satisfaction.‬
‭Image:‬

‭Benefits:‬

‭Personalized Recommendations: Suggests content based on user preferences.‬


‭ nhanced User Experience: Tailors content to individual tastes.‬
E
‭b. Case Study: LinkedIn‬
‭Overview:‬
‭LinkedIn uses Hadoop for managing user data and Spark for real-time analytics to‬
‭provide job recommendations and improve the user experience.‬

‭Approach:‬

‭ ata Collection: Collects data on job applications, user profiles, and interactions.‬
D
‭Data Analysis: Analyzes data to improve job matching algorithms.‬
‭Outcome: More relevant job recommendations and improved user engagement.‬
‭Image:‬

‭Benefits:‬

I‭mproved Job Matching: Provides better job recommendations.‬


‭Real-Time Insights: Analyzes user interactions and trends.‬
‭c. Case Study: Amazon‬
‭Overview:‬
‭Amazon uses Hadoop and Spark to handle its vast amounts of transactional data and to‬
‭optimize its supply chain.‬

‭Approach:‬

‭ ata Collection: Collects data on purchases, reviews, and inventory.‬


D
‭Data Analysis: Analyzes data to forecast demand and optimize inventory.‬
‭Outcome: More efficient supply chain management and personalized shopping‬
‭experiences.‬
‭Image:‬

‭Benefits:‬

‭ ptimized Inventory: Better demand forecasting and inventory management.‬


O
‭Personalized Shopping Experience: Recommendations based on user behavior.‬
‭4. Future Trends and Career Prospects‬
‭a. Future Trends‬
‭Increased Adoption of Cloud Solutions:‬
‭ rend: More organizations are moving to cloud-based Big Data solutions like AWS,‬
T
‭Azure, and Google Cloud.‬
‭Example: Amazon EMR for Hadoop and Spark.‬
‭Enhanced Real-Time Analytics:‬

‭ rend: Growing use of real-time data processing technologies.‬


T
‭Example: Apache Flink for advanced stream processing.‬
‭Integration with Machine Learning and AI:‬

‭ rend: Combining Big Data technologies with machine learning and AI for advanced‬
T
‭analytics.‬
‭Example: Databricks platform for integrated analytics.‬
‭Image:‬

‭ . Career Prospects‬
b
‭Job Roles:‬

‭ ig Data Engineer: Designs and builds Big Data systems.‬


B
‭Data Scientist: Analyzes data to extract insights.‬
‭Data Analyst: Interprets data and generates reports.‬
‭Skills Needed:‬

‭ rogramming Languages: Python, Java, Scala.‬


P
‭Tools and Frameworks: Hadoop, Spark, Kafka.‬
‭Mathematics and Statistics: Understanding of statistical models and data analysis‬
‭techniques.‬
‭Image:‬

‭ onclusion‬
C
‭Big Data technologies like Hadoop and Spark are crucial for managing and analyzing‬
‭massive datasets. They offer tools for scalable data storage, efficient processing, and‬
‭advanced analytics. Real-world applications span various domains, from entertainment‬
‭to e-commerce, demonstrating the impact of Big Data on modern businesses.‬

You might also like