Data Science: Industrial Training Report
Data Science: Industrial Training Report
Data Science
Submitted in partial fulfilment for the award
of the Degree Of
Bachelor of Technology In
ARTIFICIAL INTELLIGENCE AND
DATA SCIENCE
Certificate
Head Of Department
Training Certificate
i
Candidate’s Declaration
I hereby declare that the work, which is being presented in the
Industrial Training report, entitled “Data Science” in partial
fulfilment for the award of Degree of “Bachelor of Technology”
submitted to the Department of Artificial Intelligence and Data
Science, Arya College of Engineering, is a record of my own
investigations carried under the Guidance of Mr. Ankur Dutt
Sharma, Head of Department of Artificial Intelligence and Data
Science.
(Signature of Candidate)
Candidate Name
Manoj Kumari
ii
Abstract
Python, known for its simplicity, flexibility, and large ecosystem of libraries and modules, is
the perfect choice for creating AI and machine learning applications. In this tutorial, we explore
the basics of AI as it relates to Python, discussing its core concepts, libraries for AI and ML,
and code examples showcasing basic principles.
Artificial Intelligence, Machine Learning and Deep Learning are the buzzwords that have been
able to grasp the interest of many researchers since various years. Enabling computers to think,
decide and act like humans has been one of the most significant and noteworthy developments
in the field of computer science. Various algorithms have been designed over time to make
machines impersonate the human brain and many programming languages have been used to
implement those algorithms. Python is one such programming language that provides a rich
library of modules and packages for use in scientific computing and machine learning. This
paper aims at exploring the basic concepts related to machine learning and attempts to
implement a few of its applications using python. This paper majorly used the Scikit-Learn
library of Python for implementing the applications developed for the purpose of research.
Data science is a multidisciplinary field that uses scientific methods, algorithms, and systems
to extract knowledge and insights from structured and unstructured data. It combines
statistics, computer science, and domain expertise to analyze data, build predictive models,
and solve co mplex problems. Essentially, it's all about turning raw data into actionable
intelligence.
Keywords: Machine Learning, Python, Scikit-Learn, AI, ML, Deep Learning, NumPy,
Matplotlib, Workflow of machine learning, NLTK, statistics, multidisciplinary, bData science
predictive models.
iii
ACKNOWLEDGEMENT
Manoj Kumari
21EAIAD020
iv
Learning/Internship Objectives
v
TABLE OF CONTENTS
TITLE PAGE NO.
CERTIFICATE i
CANDIDATE’S DECLARATION ii
ABSTRACT iii
ACKNOWLEDGEMENT iv
LEARNING OBJECTIVES v
CHAPTER 1: INTRODUCTTON OF DATA SCIENCE 1-2
CHAPTER 2: OVERVIEW OF AI&ML 3-4
CHAPTER 3: PYTHON OVERVIEW 5-6
CHAPTER 4: IMAGE PROCESSING 7-8
CHAPTER 5: STATISTICS 9-10
CHAPTER 6: APPLICATIONS 11-12
CHAPTER 7: LIBRARIES IN PYTHON 13-20
CHAPTER 8: MACHINE LEARNING 21-28
ALGORITHMS
CHAPTER 9: NATURAL LANGUAGE PROCESSING 29-30
CHAPTER 10:DEEP LEARNING ALGORITHMS 31-35
CHAPTER 11: CONCLUSION 36
Chapter 1
INTRODCTION OF
DATA SCIENCE
By following this structured approach, data scientists can ensure that their work is aligned with the
business objectives, leverages the most relevant data sources, and delivers actionable insights that
can be effectively communicated to stakeholders
Data science is a game-changer across a ton of fields. Here are some of its killer applications:
1) Healthcare:
Improving diagnostics, personalized treatments, and predicting disease outbreaks. Think AIdriven
medical imaging and patient data analysis.
2) Finance:
Fraud detection, risk management, and algorithmic trading. Banks use data science to spot fraudul
ent transactions and optimize investment portfolios.
3) Marketing:
Personalized marketing campaigns, customer segmentation, and sentiment analysis. It's why you
g et those eerily spot-on recommendations!
4) Retail:
Inventory management, demand forecasting, and customer behavior analysis. Helps companies ke
ep their shelves stocked with what you want.
5) Transportation:
Optimizing routes, predicting maintenance needs, and managing fleets. Makes logistics and ridesh
aring super efficient.
2
Chapter 2
OVERVIEW OF AI&ML
AI & ML are techniques, code or algorithms that enable machines to develop, demonstrate and mimic
human cognitive behavior or intelligence and hence the name “Artificial Intelligence”. Some of the
most successful applications of AI around us can be seen in Robotics, Computer Vision, Virtual
Reality, Speech Recognition, Automation, Gaming and so on…
Artificial Intelligence is constantly pushing the boundaries of what machines are capable of. The Main
purpose of AI & ML is to train real time smart machines to use their speed and capability. Most
importantly, machines can think and perform tasks like humans.
By AI & ML we would get to learn about Building Artificially Intelligent systems including computer
vision and natural language processing techniques. Machine Learning & Deep Learning are the key
part of this course, and are implemented using Python Scripting. Various Libraries like numpy, pandas,
matplotlib, scikit-learn, tensorflow etc. were used .
Introduction of AI & Machine Learning
Artificial Intelligence is a technique for building systems that mimic human behavior or
decisionmaking.
Machine Learning is a subset of AI that uses data to solve tasks. These solvers are trained models of
data that learn based on the information provided to them. This information is derived from probability theory
and linear algebra. ML algorithms use our data to learn and automatically solve predictive tasks. Deep
Learning is a subset of Machine Learning Which relies on the multi layer neural networks to
Solve.
FIG.1
3
FIG 2: Relation between Artificial Intelligence , Machine Learning and Deep Learning
4
Chapter 3
Python Overview
FIG 3 FIG 4
5
FIG 5
o Comprehensions in python:
1 List Comprehensions:
Syntax:
output_list = [output_exp for var in input_list if (var satisfies this condition)]
2 Dictionary Comprehensions:
Syntax:
output_dict = {key:value for (key, value) in iterable if (key, value satisfy this condition)}
3 Set Comprehensions:
Syntax:
newSet= { expression for element in iterable }
4 Generator comprehension:
Syntax:
generator= (expression for element in iterable if condition)
6
Names of python modules : creating module , import module , Rename module , import a part of
module.
7
Chapter 4
Image Processing
• What Is Image Processing?
Image processing is the process of transforming an image into a digital form and performing certain
operations to get some useful information from it. The image processing system usually treats all
images as 2D signals when applying certain predetermined signal processing methods.
FIG 6
There are a few main types of image processing:
• Visualization: Objects not visible in the image are detected
• Recognition: Detect objects present in the image
• Sharpening and Restoration: Original images are enhanced
• Pattern Recognition: The patterns in the image are measured
• Retrieval: Find images that are similar to the original by searching a large database.
8
Some Libraries that are using in Image Processing and Data Processing:
OpenCV:
OpenCV is often deployed for computer vision tasks like face detection, object detection, face
recognition, image segmentation, and much more.
Some of the main highlights of OpenCV:
1. Used by major companies like IBM, Google, and Toyota
2. Algorithmic efficiency Vast access to algorithms
3. Multiple interfaces
Scikit-Image :
Scikit-Image, which uses NumPy arrays as image objects, offers many different algorithms for
segmentation, color space manipulation, geometric transformation, analysis.
SciPy :
This image processing library is another great option if you’re looking for a wide range of applications
like image segmentation, convolution, reading images, face detection, feature extraction, and more.
Matplotlib:
The image processing library is usually used for 2D visualizations like scatter plots, histograms, and
bar graphs, but it has proven to be useful for image processing by effectively pulling information out
of an image.
NumPy:
NumPy is an open-source Python library used for numerical analysis, it can also be used for image
processing tasks like image cropping, manipulating pixels, masking of pixel values, and more. NumPy
contains a matrix and multi-dimensional arrays as data structures.
Pandas
Pandas is an open-source library commonly used in data science. It is primarily used for data analysis,
data manipulation, and data cleaning. Pandas allow for simple data modeling and data analysis
operations without needing to write a lot of code. As stated on their website, pandas is a fast, powerful,
flexible, and easy-to-use open-source data analysis and manipulation tool.
Scikit-Learn
The terms machine learning and scikit-learn are inseparable. Scikit-learn is one of the most used
machine learning libraries in Python. Built on NumPy, SciPy, and Matplotlib, it is an open-source
Python library that is commercially usable under the BSD license. It is a simple and efficient tool for
predictive data analysis tasks.
9
Chapter 5
STATISTICS
Statistics provides the framework for understanding and interpreting data. It enables us to calculate
uncertainty, spot trends, and draw conclusions about populations from samples. In data science, a
strong grasp of statistical concepts is crucial for making informed decisions, validating findings,
and building robust models.
1. Descriptive Statistics
Descriptive statistics help us summarize and describe the key characteristics of a dataset. This
includes measures of central tendency like mean (average), median (middle value), and mode
(most frequent value), which tell us about the typical or central value of a dataset. We also use
measures of variability, such as range (difference between maximum and minimum values),
variance, and standard deviation, to understand how spread out the data is. Additionally, data
visualization techniques like histograms, bar charts, and scatter plots provide visual
representations of data distributions and relationships, making it easier to grasp complex
patterns.
2. Inferential Statistics
Inferential statistics, on the other hand, allow us to make generalizations about a population
based on a sample. This involves understanding how to select representative samples and how
they relate to the overall population. Hypothesis testing is a key tool in inferential statistics,
allowing us to evaluate whether a hypothesis about a population is likely to be true based on
sample data. We also use confidence intervals to estimate the range of values within which a
population parameter is likely to fall. Finally, p-values and significance levels help us
determine the statistical significance of results and whether they are likely due to chance.
10
The Fundamental Statistics Concepts for Data Science:
1. Correlation
Correlation quantifies the relationship between two variables. The correlation coefficient, a
value between -1 and 1, indicates the strength and direction of this relationship. A positive
correlation means that as one variable increases, so does the other, while a negative correlation
means that as one variable increases, the other decreases. Pearson correlation measures linear
relationships, while Spearman correlation assesses monotonic relationships.
2. Regression
Regression analysis is a statistical method used to model the relationship between a dependent
variable and one or more independent variables. Linear regression models a linear relationship,
while multiple regression allows for multiple independent variables. Logistic regression is used
when the dependent variable is categorical, such as predicting whether a customer will churn or
not.
3. Bias
Bias refers to systematic errors in data collection, analysis, or interpretation that can lead to
inaccurate conclusions. Selection, measurement, and confirmation bias are examples of different
types of bias. Mitigating bias requires careful data collection and analysis practices, such as
random sampling, blinding, and robust statistical methods.
4. Probability
Probability is the study of random events and their likelihood of occurrence. Expected values,
variance, and probability distributions are examples of fundamental probability concepts.
Conditional probability and Bayes’ theorem allow us to update our beliefs about an event based
on new information.
5. Statistical Analysis
Statistical analysis is the process of testing hypotheses and making inferences about data using
statistical techniques. Analysis of variance (ANOVA) compares means between multiple groups,
while chi-square tests assess the relationship between categorical variables.
6. Normal Distribution
Numerous natural phenomena can be described by the normal distribution, commonly referred
to as the bell curve. It is a common probability distribution. It’s characterized by its mean and
standard deviation. Z-scores standardize values relative to the mean and standard deviation,
allowing us to compare values from different normal distributions.
11
Chapter 6
APPLICATIONS
When discussing applications in the context of Artificial Intelligence and Machine Learning, we're
referring to the practical uses and implementations of these technologies across various industries and
domains. Here are some notable AI and ML applications:
1. Healthcare:
• Medical Diagnosis: AI is used for diagnosing diseases, such as cancer, diabetes, and heart conditions,
by analysing medical images and patient data.
• Drug Discovery: ML models help identify potential drug candidates and predict their efficacy,
accelerating the drug development process.
• Personalized Medicine: AI assists in tailoring treatment plans and medications based on an individual's
genetic makeup and health history.
2. Finance:
• Algorithmic Trading: ML algorithms analyse financial data to make real-time trading decisions,
optimizing investment portfolios.
• Credit Scoring: AI assesses creditworthiness by analysing an applicant's financial history and
behaviour.
• Fraud Detection: ML models detect fraudulent transactions and activities by identifying unusual
patterns and anomalies.
3. Autonomous Vehicles:
• Self-Driving Cars: AI and ML enable vehicles to perceive their environment, make decisions, and
navigate without human intervention.
• Drones and UAVs: Unmanned aerial vehicles use AI for navigation, surveillance, and delivery tasks.
4. Natural Language Processing (NLP):
• Chatbots: NLP-powered chatbots provide customer support, answer queries, and automate interactions
in various industries.
• Language Translation: AI translates text and speech across languages, enabling global
communication.
• Sentiment Analysis: NLP algorithms analyse social media and customer reviews to gauge public
sentiment about products and services.
12
5. AI CHATBOT: AI chatbots are computer programs that use artificial intelligence to mimic human
conversation. They can be used for customer service, education, and entertainment. Some popular
AI chatbots include Bing, ChatGPT, Tay, ELIZA and cleverbot.
6. RECOMMENDATION SYSTEM: Various platforms that we use in our daily lives like e-
commerce, entertainment websites, social media, video sharing platforms, like youtube, etc., all
use the recommendation system to get user data and provide customised recommendations to users
to increase engagement.
7. ROBOTICS: Robotics is another field where Artificial Intelligence applications are commonly
used. Robots powered by AI, use real-time updates to sense obstacles in its path and pre-plan its
journey instantly. It can be used for: Carrying goods in hospitals, factories, and warehouses,
Cleaning offices and large equipment, Inventory management.
8. AUTOMOBILES: AI is also used in self -driving vehicles .AI can be used along with the
vehicle’s camera, radar, cloud services, GPS, and control signals to operate the vehicle. AI can
improve the in-vehicle experience and provide additional systems like emergency braking, blind-
spot monitoring, and driver-assist steering.
9. SPAM FILTERS: The email that we use in our day-to-day lives has AI that filters out spam emails
sending them to spam or trash folders, letting us see the filtered content only. The popular email
provider, Gmail, has managed to reach a filtration capacity of approximately 99.9%.
13
Chapter 7
Libraries in Python
Python has a rich ecosystem of libraries for data science, analysis, machine learning, and artificial
intelligence (AI). Here's a list of popular libraries in each of these categories:
6.1 Pandas
Pandas is a popular Python library for data manipulation and analysis. It provides data structures
and functions for working with structured data, such as spreadsheets or SQL tables, making it a
fundamental tool for data scientists and analysts. Below, I'll explain some of the key functions and
concepts in Pandas:
1. Data Structures:
• Series: A one-dimensional array-like object containing data and associated labels or indexes. It is
similar to a column in a spreadsheet or a single column of a database table.
• DataFrame: A two-dimensional, tabular data structure with rows and columns. It is similar to a
spreadsheet or a SQL table. DataFrames are the most commonly used Pandas data structure.
2. Data Import and Export:
• pd.read_csv(): Reads data from a CSV file into a DataFrame.
• pd.read_excel(): Reads data from an Excel file into a DataFrame.
• df.to_csv(): Writes data from a DataFrame to a CSV file.
• df.to_excel(): Writes data from a DataFrame to an Excel file.
3. Data Explossration:
• df.head(): Returns the first n rows of a DataFrame.
• df.tail(): Returns the last n rows of a DataFrame.
• df.info(): Provides information about the DataFrame, including data types and missing values.
• df.describe(): Generates summary statistics of numeric columns.
• df.shape: Returns the dimensions (number of rows and columns) of the DataFrame.
• df.columns: Returns the column names of the DataFrame.
4. Data Selection and Indexing:
• df['column_name'] or df.column_name: Selects a single column from the DataFrame.
• df[['column1', 'column2']]: Selects multiple columns.
• df.loc[row_label]: Selects rows by label.
• df.iloc[row_index]: Selects rows by integer index.
14
5. Data Manipulation and Transformation:
▪ df.drop(): Removes specified rows or columns from the DataFrame.
• df.rename(): Renames columns or indexes.
• df.sort_values(): Sorts the DataFrame by one or more columns.
• df.groupby(): Groups data based on a column or multiple columns.
• df.pivot_table(): Creates pivot tables to summarize data.
• df.apply(): Applies a function to each element or row in the DataFrame.
6. Data Cleaning:
• df.isnull(): Checks for missing values.
• df.dropna(): Removes rows or columns with missing values.
• df.fillna(): Fills missing values with specified values.
7. Data Aggregation:
• df.sum(), df.mean(), df.median(): Compute various summary statistics.
• df.max(), df.min(): Find the maximum and minimum values.
• df.count(): Counts the number of non-null elements.
8. Data Visualization Integration:
• Pandas integrates with data visualization libraries like Matplotlib and Seaborn to create plots and
charts directly from DataFrames.
9. Merging and Joining Data:
• pd.concat(): Concatenates DataFrames along rows or columns.
• pd.merge(): Performs database-style joins on DataFrames.
6.2 NumPy
NumPy (Numerical Python) is a fundamental library in the Python ecosystem, particularly in the
context of data analysis and machine learning (ML). It provides support for working with
numerical data efficiently, making it an essential tool for data scientists and ML practitioners.
Here's how NumPy is used in data analysis and ML, along with some key functions:
Data Representation:
• ndarray: NumPy's core data structure is the ndarray (N-dimensional array). It allows for efficient
storage and manipulation of multi-dimensional data, such as matrices and tensors. This is crucial
in data analysis and ML where datasets are often multi-dimensional.
15
• Handling Missing Data: NumPy provides functions like np.isnan() and np.nan_to_num() for
identifying and handling missing data, a common preprocessing step in data analysis.
• Data Transformation: NumPy allows you to reshape and transform data using functions like
np.reshape(), np.transpose(), and np.concatenate(). This is useful for preparing data for various
analysis and modeling tasks.
Data Exploration:
• Descriptive Statistics: NumPy offers functions for computing basic statistics, such as np.mean(),
np.median(), np.std(), and np.var(), which are essential for exploring and summarizing data.
Machine Learning:
• Data Representation: In ML, datasets are often represented as NumPy arrays. Many ML libraries,
including Scikit-Learn, expect data in this format.
• Feature Engineering: NumPy is used to create new features and transform existing ones, a critical
aspect of feature engineering in ML.
• Performance Optimization: NumPy's efficient array operations are crucial for optimizing ML
algorithms, particularly when working with large datasets.
16
o Missing Data: NumPy provides tools to identify and manage missing data, which is a common issue
6.3 Matplotlib
Matplotlib is a powerful Python library for creating data visualizations and plots. It provides various
functions and modules that enable users to customize, create, and display a wide range of
visualizations. Here are some key functions and concepts associated with Matplotlib in the context
of data analysis and visualization:
17
• plt.annotate(): Annotates specific data points with arrows and labels.
6.4 Seaborn
Seaborn is a Python data visualization library based on Matplotlib that provides a high-level
interface for creating informative and aesthetically pleasing statistical graphics. It is particularly
well-suited for data analysis and exploration, as it simplifies the process of creating complex
visualizations with concise code.
Here's an explanation of Seaborn in the context of data analysis and visualization, along with its
key functions: Advantages of Seaborn:
1. High-Level Interface: Seaborn is designed to work seamlessly with Pandas DataFrames, making
it easier to visualize data directly from data structures commonly used in data analysis.
2. Beautiful Aesthetics: Seaborn provides attractive default styles and color palettes that enhance the
visual appeal of plots.
3. Statistical Plotting: Seaborn specializes in creating statistical plots that help users understand data
distributions, relationships, and patterns.
Statistical Enhancements:
• sns.regplot(): Combines a scatter plot with a linear regression fit line.
• sns.lmplot(): Creates regression plots for visualizing relationships between variables.
Pairwise Relationships:
18
• sns.pairplot(): Generates a grid of scatter plots for examining pairwise relationships between
numerical columns in a dataset, with histograms along the diagonal.
Heatmaps:
• sns.heatmap(): Generates heatmaps to visualize the correlation matrix or other 2D data
structures.
6.5 Scikit-Learn
Scikit-Learn, often referred to as sklearn, is a Python library for machine learning that provides a wide
range of functions and tools for various aspects of machine learning tasks. Below, I'll explain
Scikit-Learn in the context of machine learning, along with some key functions and concepts:
Data Preparation:
• Data Splitting: train_test_split(): Splits a dataset into training and testing sets for model
evaluation.
• Data Preprocessing: Functions like StandardScaler() and MinMaxScaler() are used to scale
and normalize features. LabelEncoder() and OneHotEncoder() are used for encoding
categorical variables.
Supervised Learning:
• Classification: Scikit-Learn includes classifiers like LogisticRegression,
DecisionTreeClassifier, RandomForestClassifier, and more. Key functions include fit(),
predict(), and score().
• Regression: Regression models like LinearRegression, Ridge, and Lasso are available for
predictive modeling. Similar functions as in classification are used for regression tasks.
Unsupervised Learning:
• Clustering: Scikit-Learn provides clustering algorithms such as KMeans, DBSCAN, and
AgglomerativeClustering. Key functions include fit() and predict().
• Dimensionality Reduction: Techniques like PCA (Principal Component Analysis) and TSNE
(tdistributed Stochastic Neighbor Embedding) are used for dimensionality reduction and
visualization.
Model Evaluation:
• Cross-Validation: cross_val_score() and KFold() are used for k-fold cross-validation to
estimate a model's performance on unseen data.
19
• Metrics: Scikit-Learn provides metrics like accuracy_score, precision_score, recall_score,
f1_score, and mean_squared_error for evaluating model performance.
Hyperparameter Tuning:
• Grid Search: GridSearchCV() allows you to perform hyperparameter tuning by specifying a
grid of hyperparameters to search over.
• Randomized Search: RandomizedSearchCV() performs hyperparameter tuning using
randomized search, which is often faster than grid search.
6.6 TensorFlow
TensorFlow is an open-source machine learning framework developed by Google. It's designed for
creating, training, and deploying machine learning models, particularly deep learning models.
TensorFlow allows you to build and train neural networks for a wide range of machine learning
tasks. Here's an explanation of TensorFlow in the context of machine learning, along with some
key functions and concepts:
TensorFlow Core:
• Tensors: TensorFlow is named after its core concept, tensors, which are multi-dimensional
arrays. Tensors can be constants, variables, or placeholders.
• Computational Graph: TensorFlow builds a computational graph that represents the
operations to be performed on tensors. This allows for efficient execution and optimization.
Model Training:
• model.compile(): Configures the model with the chosen loss function, optimizer, and metrics.
• model.fit(): Trains the model on labeled training data, specifying the number of epochs and
batch size.
Model Evaluation:
• model.evaluate(): Evaluates the trained model on a test dataset to assess its performance using
metrics like accuracy, loss, etc.
20
Chapter 8
Machine Learning Algorithms
21
FIG 7
Import statements
# from sklearn.linear_model import LogisticRegression
# from sklearn.model_selection import train_test_split
FIG 8
22
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
Cost function - The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient for the best fit
line. o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function. For
Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function. o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function. o It is done by a random selection of values of coefficient and then
iteratively update the values to reach the minimum cost function.
R-squared method: o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for
multiple regression.
23
o It can be calculated from the below formula:
FIG 9
8.4 Random Forest
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output. The greater number of
trees in the forest leads to higher accuracy and prevents the problem of overfitting.
24
FIG 10
Why use Random Forest?
Below are some points that explain why we should use the Random Forest
algorithm. o It takes less training time as compared to other algorithms. o It
predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
There are mainly four sectors where Random forest mostly used:
o Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
25
o Land Use: We can identify the areas of similar land use by this algorithm. o Marketing:
Marketing trends can be identified using this algorithm.
26
FIG 11
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
27
Bayes' Theorem: o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where ,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence. P(B) is Marginal
Probability: Probability of Evidence.
28
Chapter 9
Natural Language Processing
Natural Language Processing (NLP) is a field of computer science that deals with the interaction
between computers and human language. It enables computers to understand, interpret, and
generate human language in a meaningful way. NLP encompasses a wide range of tasks, from basic
language analysis to complex tasks like machine translation, text summarization, and question
answering.
NLP systems leverage various techniques from linguistics, computer science, and artificial
intelligence to process text and speech data. They employ algorithms to analyze the structure,
meaning, and context of language, allowing computers to extract valuable information, perform
tasks, and communicate effectively with humans.
NLP Algorithms
1. Statistical Methods
Statistical methods, such as Hidden Markov Models (HMMs) and Conditional Random Fields
(CRFs), have been widely used in NLP for tasks like part-of-speech tagging and named entity
recognition. These methods rely on statistical probabilities to predict linguistic patterns.
30
Chapter 10
Deep Learning Algorithms
The given figure illustrates the typical diagram of Biological Neural Network.
31
The typical Artificial Neural Network looks something like the given figure.
FIG 12
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell nucleus
represents Nodes, synapse represents Weights, and Axon represents Output.
FIG 13
32
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations to find
hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally results in
output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and includes a
bias. This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should fire or not. Only those who are fired make it to the
output layer. There are distinctive activation functions available that can be applied upon the sort of
task we are performing.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
33
The Convolutional layer applies filters to the input image to extract features, the Pooling layer down
samples the image to reduce computation, and the fully connected layer makes the final prediction.
The network learns the optimal filters through backpropagation and gradient descent.
Types of layers:
• Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will
be an image or a sequence of images. This layer holds the raw input of the image with width 32, height
32, and depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature from the input dataset. It
applies a set of learnable filters known as the kernels to the input images. The filters/kernels are smaller
matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and computes the dot
product between kernel weight and the corresponding input image patch. The output of this layer is
referred ad feature maps. Suppose we use a total of 12 filters for this layer we’ll get an output volume
of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the preceding layer, activation
layers add nonlinearity to the network. it will apply an element-wise activation function to the output
of the convolution layer. Some common activation functions are RELU: max(0, x), Tanh, Leaky
RELU, etc. The volume remains unchanged hence output volume will have dimensions 32 x 32 x 12.
• Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce the
size of volume which makes the computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average pooling. If we use a max pool with 2
x 2 filters and stride 2, the resultant volume will be of dimension 16x16x12.
34
• Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for categorization
or regression.
• Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
• Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or soft max which converts the output of each class into the probability
score of each class.
35
Chapter 11 CONCLUSION
“Data science is not just a technical discipline; it's a strategic asset. It has the power to reshape
industries, enhance human experiences, and address global challenges. The ability to derive actionable
insights from data sets organizations apart in today’s competitive landscape.
As we move forward, the role of data science will continue to expand, integrating with emerging
technolo gies like artificial intelligence and the Internet of Things (IoT). This synergy will unlock even
greater pot ential, pushing the boundaries of what’s possible. In essence, data science is the key to
unlocking a future driven by data, where informed decisions lead to impactful and sustainable
outcomes.”
“The integration of Artificial Intelligence (AI) and Machine Learning (ML) with Python has opened up
a world of possibilities for solving complex problems, automating tasks, and making data driven
decisions. AIML, which stands for Artificial Intelligence and Machine Learning, leverages Python's rich
ecosystem of libraries and tools to create intelligent systems, predictive models, and data driven
applications. Whether it's natural language processing, image recognition, recommendation systems, or
predictive analytics, Python's versatility and extensive AI and ML libraries like TensorFlow, Scikit-Learn,
and Keras have made it a leading choice for researchers, data scientists, and developers. AIML using
Python empowers us to harness the power of data and create intelligent solutions that drive innovation
and transform industries."
“In the field of artificial intelligence and machine learning has made substantial progress in the past five
years and is having real-world influence on people, institutions and culture. Even if the current state of
AI technology is still far short of the field’s creation initiative of reconstructing full human-like
intelligence in machines, research and development teams are exploiting these moves forward and
absorbing them into society-facing applications. Artificial Intelligence has helped people to create
robotic and computer systems to make their businesses more economically efficient. Life was forever
changed by AI because humans could use the support of machines to complete repetitive, dangerous and
difficult tasks. With the help of AI machines, people could get jobs done faster and easier. Businesses
could improve the efficiency of manufacturing output, data processing and customer service.”
36
37