0% found this document useful (0 votes)
9 views52 pages

Artificial and Data Science

Abeyaantrix Softlab (opc) Pvt Ltd. is a private information technology company located in Davanagere, Karnataka, specializing in computer software training. The document introduces concepts of Artificial Intelligence and Data Science, highlighting the importance of recommender systems and data analysis techniques. It also covers Python programming, including data structures like lists and dictionaries, as well as libraries such as NumPy and Matplotlib for data manipulation and visualization.

Uploaded by

candykoko37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views52 pages

Artificial and Data Science

Abeyaantrix Softlab (opc) Pvt Ltd. is a private information technology company located in Davanagere, Karnataka, specializing in computer software training. The document introduces concepts of Artificial Intelligence and Data Science, highlighting the importance of recommender systems and data analysis techniques. It also covers Python programming, including data structures like lists and dictionaries, as well as libraries such as NumPy and Matplotlib for data manipulation and visualization.

Uploaded by

candykoko37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Artificial Intelligence and Data Science

Chapter 1
COMPANY PROFILE

Name Abeyaantrix Softlab(opc) Pvt Ltd.


Address Abeyaantrix Softlab, #1659/5,2nd floor, NN
complex, beside coffee day, Vidyanagar,Vidyanagar
first bus stop, Davanagere, Karnataka 577005
Contact Number +91 9972140290
Email [email protected]
Website http://www.asoftlab.com
Company Registration Number U72900KA2018OPC114372
Type of the Company Private
Nature of the Company Information Technology
Company Logo

Vision Abeyaantrix Softlab (opc) Pvt ltd. in davangere is one


of the leading businesses in the computer software
training institutes. Also known for computer training
institutes, libraries, computer software training
institutes, python training institutes, computer
hardware training institutes, java training institutes
and much more.
Learning computers has become a necessity, this is
why several people enrol in computer software
training institutes that provide the best computer
software courses, such as mobile app development,
data analytics, Ul/UX development, database and
more.
Company Operational Status Private

Department of ECE, GM Institute of Technology, Davanagere Page 1


Artificial Intelligence and Data Science

Chapter 2
INTRODUCTION

2.1 Introduction To Artificial Intelligence (AI)

A recommender system (Rechy’s), or a recommendation system (sometimes replacing system with


terms such as platform, engine, or algorithm), sometimes only called the algorithm or algorithm is
a subclass of information filtering system that provides suggestions for items that are most pertinent
to a particular user. Recommender systems are particularly useful when an individual needs to
choose an item from a potentially overwhelming number of items that a service may offer. Modern
recommendation systems such as those used on large social media sites, make extensive use of AI,
machine learning and related techniques to learn the behavior and preferences of each user, and
tailor their feed accordingly.

Fig 2.1: Artificial Intelligence


The Fig (2.1) Shows the AI suggestions refer to various decision-making processes, such as
what product to purchase, what music to listen to, or what online news to read.[2] Recommender
systems are used in a variety of areas, with commonly recognised examples taking the form of
playlist generators for video and music services, product recommenders for online stores, or content
recommenders for social media platforms and open web content recommenders. These systems can
operate using a single type of input, like music, or multiple inputs within and across platforms like
news, books and search queries. There are also popular recommender systems for specific topics
like restaurants and online dating. Recommender systems have also been developed to explore
research articles and experts, collaborators, and financial services.

Department of ECE, GM Institute of Technology, Davanagere Page 2


Artificial Intelligence and Data Science

2.2 Introduction to Data Science


Data Science is an interdisciplinary field that focuses on extracting meaningful insights and
knowledge from structured and unstructured data. It combines techniques from statistics, computer
science, mathematics, and domain knowledge to analyze, interpret, and visualize data.

With the explosion of digital data in recent years, data science has become essential for solving
real-world problems and making data-driven decisions. It is widely used in industries such as
healthcare, finance, marketing, e-commerce, and technology. Data scientists use tools and
programming languages like Python, R, SQL, and libraries such as pandas, NumPy, and scikit-learn
to clean, explore, model, and visualize data. They apply machine learning and statistical methods to
discover trends, make predictions, and support strategic decisions. As organizations increasingly rely
on data, the role of data science has become critical in shaping innovation, improving efficiency, and
gaining competitive advantage.

Fig 2.2: Data Science

The Fig (2.2) shows the Data Science is a multidisciplinary field that focuses on analyzing
and interpreting large volumes of data to uncover patterns, trends, and insights that can guide
decision-making. It combines elements from computer science, mathematics, statistics, and domain
expertise to process raw data into valuable information. In today’s digital world, data is generated at
an unprecedented rate through smartphones, social media, sensors, online transactions, and countless
other sources. This vast amount of data—often referred to as big data—holds tremendous potential,
but only if it is properly analyzed. Data science provides the tools, techniques, and methodologies to
extract meaningful knowledge from this data

Department of ECE, GM Institute of Technology, Davanagere Page 3


Artificial Intelligence and Data Science

Chapter 3
TASK PERFORMED

3.1 Review of Python Programming Language


Python stands out as a programming language due to its simplicity and readability. Its syntax,
emphasizing clean and readable code, resembles pseudo-code, making it beginner-friendly and
understandable for experienced developers alike. This readability is enforced through indentation-
based structure, which encourages organized and consistent code.

Figure 3.1 : Python

The language's versatility allows it to support multiple programming paradigms, making it


adaptable across various domains. One of its standout features is the extensive library ecosystem.
Python boasts an array of libraries and frameworks catering to diverse needs—from data analysis
with Pandas to machine learning using TensorFlow and web development with Django or Flask.
These libraries expedite development and enhance Python's capabilities.The language benefits from
a large and active community, contributing to its rich documentation, abundant online resources,
and a plethora of third-party packages.

Example : 1. Create a Dictionary


list 1 = [1,2,3,4,5]
list 2 = [“A”, “B”, “C”, “D”, “E”]
dictionary = dict (list 1, list 2) print
(dictionary

Department of ECE, GM Institute of Technology, Davanagere Page 4


Artificial Intelligence and Data Science

3.1.1 List :
In Python, a list is a versatile and fundamental data structure used to store an ordered collection of
elements. Lists are mutable, meaning they can be changed or modified after creation. They allow
for a flexible and dynamic way to store and manipulate data.

Creating a List: A list is defined by enclosing elements within square brackets [], separated
by commas. my_list = [1, 2, 3, 'hello', 'world']

Figure 3.1.1 : List in Python

List Methods:

Python lists have built-in methods for various operations such as:

• append(): Add an element to the end of the list.


• insert(): Insert an element at a specific position.
• remove(): Remove the first occurrence of a specified value.
• pop(): Remove and return an element at a given index (by default the last element).
• index(): Find the index of the first occurrence of a value.
• count(): Count the number of occurrences of a value in the list.
• sort(): Sort the list in ascending order.
• reverse(): Reverse the elements of the list in place.

3.1.2 Dictionary :
In Python, a dictionary is a powerful and versatile data structure used to store collections of data in
the form of key-value pairs. It allows you to associate a unique key with a value, enabling efficient
retrieval of values based on their associated keys

Department of ECE, GM Institute of Technology, Davanagere Page 5


Artificial Intelligence and Data Science

Creating a Dictionary:

You can create a dictionary by specifying key-value pairs within curly braces, separated by colons (:),
with keys and values separated by columns
my_dict = {'name': 'Alice', 'age': 30, 'city': 'New York'}

Figure 3.1.2: Dictionary in python

Dictionary Methods: Python provides built-in methods to perform various operations on


dictionaries:

• keys(): Retrieve a list of all keys in the dictionary.


• values(): Retrieve a list of all values in the dictionary.
• items(): Retrieve a list of key-value pairs as tuples.
• get(): Retrieve the value for a given key, providing a default value if the key doesn't exist.
• update(): Merge one dictionary into another.
• clear(): Remove all items from the dictionary.

3.2 Numpy

• As compare to list will not allow more than one data type in single array.
• But will allow us to store data in any number of dimension.
• Numpy array are much more faster as compare to the list.
• Numpy library is written using optimized C program hence performance and efficiency is
very high
• It is used in most of numeric and scientific computing.

Department of ECE, GM Institute of Technology, Davanagere Page 6


Artificial Intelligence and Data Science

Importing library with alias


import numpy as np import
numpy without alias print
(np._version_)

Fig 3.2 : Numpy


Numpy is a powerful Python library for numerical computing. It provides support for arrays,
matrices, and a wide range of mathematical functions to operate on these data structures efficiently.
NumPy, short for "Numerical Python," is an open-source Python library primarily used for
numerical and mathematical operations. Its core feature is the ndarray (n-dimensional array) object,
offering a versatile and efficient way to work with large datasets and perform various computations.
At its heart, NumPy revolves around the ndarray, a multi-dimensional container for
homogeneous data. These arrays can have different shapes and sizes, ranging from simple 1D
arrays (vectors) to 2D arrays (matrices) and higher-dimensional arrays. One of NumPy's significant
advantages is its ability to execute operations on entire arrays without the need for explicit loops.
This capability, often termed vectorization, leverages optimized C and Fortran libraries at the
backend, making computations faster and more efficient compared to using standard Python lists.
Creating arrays in NumPy is straightforward. You can initialize an array using a Python list
or through NumPy's functions like np.array(), np.zeros(), np.ones(), or np.arange().

Department of ECE, GM Institute of Technology, Davanagere Page 7


Artificial Intelligence and Data Science

import numpy as np
# Creating a 1D array
arr_1d = np.array([1, 2, 3, 4, 5])
# Creating a 2D array (matrix)
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
# Creating an array of zeros and ones
zeros_array = np.zeros((2, 3)) # 2 rows, 3 columns
ones_array = np.ones((3, 2)) # 3 rows, 2 columns
NumPy also provides various functions for array manipulation, statistical operations, linear
algebra, Fourier analysis, and more. For instance, you can calculate the mean, standard deviation,
perform matrix multiplication, transpose matrices, find eigenvalues, solve linear equations, and apply
Fourier transforms. NumPy's array indexing and slicing capabilities allow efficient access to specific
elements, rows, columns, or subarrays within arrays. This feature is essential for data extraction and
manipulation. Here are some key aspects of Numpy:

• Arrays: The fundamental data structure in Numpy is the numpy.ndarray, an n-dimensional


array that can hold elements of the same data type. These arrays offer efficient storage and
manipulation of numerical data.
• Mathematical Operations: Numpy provides a vast collection of mathematical functions for
performing operations on arrays. These include arithmetic operations, trigonometric
functions, statistical functions, linear algebra operations, and more.
• Indexing and Slicing: Numpy arrays support powerful indexing and slicing operations to
access specific elements, rows, columns, or subarrays efficiently.
• Broadcasting: Numpy's broadcasting feature allows operations between arrays of different
shapes and sizes, making it easier to perform element-wise operations without explicitly
aligning array dimensions.
• Integration with Python: Numpy seamlessly integrates with other libraries in the scientific
Python ecosystem, such as SciPy, Pandas, and Matplotlib, making it an essential tool for data
analysis, scientific computing, and machine learning.

Department of ECE, GM Institute of Technology, Davanagere Page 8


Artificial Intelligence and Data Science

3.3 Matplotlib
Visualization :- Visualization make any data easily understable to any person. Data Visualization
is the process of presenting data in the form of graphs or charts. It helps to understand large and
complex amounts of data very easily. It allows the decision-makers to make decisions very
efficiently and also allows them in identifying new trends and patterns very easily. It is also used in
high-level data analysis for Machine Learning and Exploratory Data Analysis (EDA). Data
visualization can be done with various tools like Tableau, Power BI, Python. In this article, we will
discuss how to visualize data with the help of the Matplotlib library of Python.
Matplotlib : Matplotlib is a low-level library of Python which is used for data visualization.
It is easy to use and emulates MATLAB like graphs and visualization. This library is built on the
top of.
NumPy: arrays and consist of several plots like line chart, bar chart, histogram, etc. It provides
a lot of flexibility but at the cost of writing more code.

Fig 3.3 : Matplotlib.

Installation :- We will use the pip command to install this module. If you do not have pip installed
then refer to the article, Download and install pip Latest Version.

Matplotlib stands as a cornerstone for data visualization in Python, offering a


comprehensive toolkit for creating a diverse range of plots and charts. With a MATLAB-like
interface provided by its pyplot module, Matplotlib allows users to generate high-quality
visualizations with relative ease. Through simple commands, you can create line plots to display
trends or relationships between variables, scatter plots to visualize data distributions, and an array
of other plot types to suit specific data visualization needs.

Beyond its simplicity, Matplotlib offers extensive customization options. Users can fine-
tune every aspect of their plots, altering colors, markers, line styles, labels, titles, and grid lines.
This flexibility empowers individuals to craft visualizations that effectively communicate their data
insights. Matplotlib also supports the creation of multiple subplots within a single figure, enabling
the display of various plots simultaneously.

Department of ECE, GM Institute of Technology, Davanagere Page 9


Artificial Intelligence and Data Science

3.2.1 Line Chart


A line chart, also known as a line plot, is a type of chart that displays data points as a series of
markers connected by straight lines. It's commonly used to visualize trends and relationships
between continuous data points over a certain period or sequence. In Python, you can create line
charts using Matplotlib, a powerful plotting library.Creating a line chart in Python using Matplotlib
involves a few simple steps. First, you import Matplotlib, a widely-used plotting library, providing
powerful tools for visualizing data. Once imported, you prepare your data by defining values for the
x and y axes, representing the horizontal and vertical components of your plot.

Next, using Matplotlib's plot() function, you can generate the line chart by passing your x
and y values. This function creates a basic line representation of your data. You can further
customize the chart by adding parameters to the plot() function, such as markers, line styles, colors,
and labels. These adjustments allow you to personalize the appearance and enhance the readability
of your visualization.Lastly, additional features like axis labels and a title can be set using xlabel(),
ylabel(), and title() functions, respectively. Finally, the show() function displays the generated line
chart.

Fig 3.3.1 : Sine wave

Department of ECE, GM Institute of Technology, Davanagere Page 10


Artificial Intelligence and Data Science

3.2.2 Bar Chart


A bar chart is a visualization that represents categorical data with rectangular bars. The length or
height of each bar is proportional to the value it represents. In Python, Matplotlib is commonly
used to create bar charts. a bar chart in Python can be created using the Matplotlib library, a
powerful tool for data visualization. To start, import Matplotlib by using import matplotlib.pyplot as
plt. Next, define your data that you want to visualize. Typically, this includes two lists: one for the
categories or labels you want on the x-axis and another for the corresponding values on the y- axis.
For example, you could have a list of categories like ['Category A', 'Category B', 'Category C'] and a
list of values like [20, 35, 45].

Once you have your data, use plt.bar() to create the bar chart. Pass the category list as the
first argument and the values list as the second argument (plt.bar(categories, values)). You can
further customize the chart by adding a title (plt.title()), labels for the x and y axes (plt.xlabel() and
plt.ylabel()), and even change the colors or styles of the bars.

Finally, display the chart using plt.show(). This command will render the bar chart in a
window or notebook depending on your Python environment. If you want to save the chart as an
image file, you can use plt.savefig('bar_chart.png') before plt.show().

Fig 3.3.2 : Bar chart

Department of ECE, GM Institute of Technology, Davanagere Page 11


Artificial Intelligence and Data Science

3.2.3 Histogram
A histogram is a type of bar plot that represents the distribution of a continuous numerical variable
by dividing the data into bins or intervals and displaying the frequency of occurrences within each
bin. It's an effective way to visualize the distribution and identify patterns or trends in your data.
In Python, you can create histograms, a graphical representation of the distribution of
numerical data, using Matplotlib library. To begin, import Matplotlib with import matplotlib.pyplot
as plt. Histograms are particularly useful for understanding the frequency or density distribution of a
dataset.Prepare your data, usually in the form of a list or array of numerical values. The plt.hist()
function is used to create the histogram. Pass your data to this function, specifying the number of
bins (which determines the number of intervals on the x-axis) or let Matplotlib choose a default
value. For instance, plt.hist(data, bins=10) will generate a histogram with 10 bins.

You can further customize the histogram by adding labels to the x and y axes
(plt.xlabel() and plt.ylabel()), a title (plt.title()), adjusting colors, and specifying the range of values
using range parameter if needed. Matplotlib will calculate the frequencies or densities and plot the
bars accordingly.Lastly, use plt.show() to display the histogram. This will render the histogram in
your Python environment. If you wish to save the histogram as an image file, you can employ
plt.savefig('histogram.png') before plt.show().

Fig 3.3.3 : Histogram

Department of ECE, GM Institute of Technology, Davanagere Page 12


Artificial Intelligence and Data Science

3.2.4 Scatter plot


A scatter plot is a type of plot that displays values for two sets of data as points on a two-
dimensional coordinate system. It's useful for visualizing the relationship between two continuous
variables and identifying any patterns, trends, or correlations within the data. A scatter plot in
Python, created using Matplotlib library, is a visualization that displays individual data points on a
two-dimensional graph. It's especially useful for showing the relationship between two variables. To
start, import Matplotlib with import matplotlib.pyplot as plt.Prepare your data in the form of two
lists or arrays, one for the x-axis values and another for the y-axis values. Then, use plt.scatter() to
generate the scatter plot. For instance, plt.scatter(x_data, y_data) will plot the points according to
the values provided in x_data and y_data.

You can customize the scatter plot by adding labels to the x and y axes (plt.xlabel() and
plt.ylabel()), giving it a title (plt.title()), adjusting point size or colors, and setting markers to
differentiate data points. Additionally, you can add a color map or change marker styles for added
clarity or aesthetic appeal. Once you've customized your plot, use plt.show() to display it in
your Python environment. This will show the scatter plot with the data points represented on the
graph. If you want to save the scatter plot as an image file, you can use plt.savefig('scatter_plot.png')
before plt.show().

Fig 3.3.4 : Scatter Plot

Department of ECE, GM Institute of Technology, Davanagere Page 13


Artificial Intelligence and Data Science

3.2.5 Pie Chart

Creating a pie chart in Python using Matplotlib is a straightforward process, enabling the
visualization of data distributions and proportions within a dataset. To begin, you import Matplotlib,
a powerful plotting library that facilitates the creation of various chart types, including pie charts.
Utilizing Matplotlib's pie() function, you can generate the pie chart by passing the data values as
arguments. This function translates the values into proportional segments of the pie, creating a visual
representation of each category's contribution relative to the whole.
Creating a pie chart in Python, typically using the Matplotlib library, allows you to
represent categorical data as a circular statistical graphic divided into slices to illustrate
proportions. Begin by importing Matplotlib with import matplotlib.pyplot as plt.

Prepare your data in the form of a list or array containing values representing the sizes or
proportions of different categories. Use plt.pie() to generate the pie chart. For instance, plt.pie(sizes,
labels=labels, autopct='%1.1f%%') would create a pie chart based on the sizes list and label each
slice using the corresponding values in the labels list, while autopct='%1.1f%%' adds percentage
labels to each slice showing their proportion relative to the whole.

Customize the pie chart by specifying colors for the slices, exploding certain slices to
emphasize them, adding a title using plt.title(), and adjusting settings like the starting angle or
shadow effects for visual enhancements.To display the pie chart, use plt.show(). This will render
the pie chart in your Python environment. If you want to save the pie chart as an image file, you
can use plt.savefig('pie_chart.png') before plt.show().

Fig 3.3.4 : Pie chart

Department of ECE, GM Institute of Technology, Davanagere Page 14


Artificial Intelligence and Data Science

3.3 Pandas

Pandas is a most powerful and widely used library for data analytics using pythonPandas is a
powerful open-source Python library used for data manipulation, analysis, and manipulation. It
provides data structures and functions that make working with structured and time-series data
intuitive and straightforward.

Fig 3.4: Pandas


3.2.5 Data Frame and Series

• Data Frame: It resembles a table or spreadsheet with rows and columns, allowing users to
work with 2-dimensional labeled data. Each column in a DataFrame can hold different data
types, such as integers, floats, strings, or even complex objects. Pandas DataFrames enable easy
handling and manipulation of data in a tabular format, resembling a SQL table or Excel
spreadsheet.
• Series: A Series is a one-dimensional labeled array that can hold any data type. It represents a
single column or row in a DataFrame and is the building block for DataFrames.

3.2.6 Features and Functionalities

• Data Loading and Input/Output: Pandas simplifies the process of loading data from various
sources like CSV, Excel, JSON, SQL databases, HTML, and more. It allows for seamless data
import/export, providing functions like read_csv(), read_excel(), to_csv(), and others.
• Cleaning and Data Preparation: Handling missing data is crucial in data analysis. Pandas
provides tools to deal with missing or inconsistent data, allowing for tasks like filling missing
values, removing duplicates, reshaping data, and transforming data structures.

Department of ECE, GM Institute of Technology, Davanagere Page 15


Artificial Intelligence and Data Science

• Data Manipulation and Exploration: Pandas offers a plethora of operations for data
manipulation, including indexing, slicing, filtering, merging, and grouping. With functions like
loc[], iloc[], and groupby(), users can access, filter, and aggregate data based on their
requirements.
• Time Series Analysis: Pandas is highly effective for working with time series data, offering
specialized data structures and functions for handling dates and time-related data. It facilitates
tasks like date range generation, frequency conversion, and calculating moving window
statistics.
• Integration with Visualization Libraries: While Pandas itself doesn't specialize in
visualization, it seamlessly integrates with popular visualization libraries like Matplotlib and
Seaborn. Users can plot Pandas data structures using these libraries to create various visual
representations like line plots, histograms, scatter plots, and more.

3.2.7 Example Use Cases

• Data Import: Loading a CSV file into a Pandas DataFrame using read_csv().
• Data Exploration: Displaying the first few rows of a DataFrame using head(), generating
summary statistics with describe(), and checking for missing values with isnull().sum().
• Data Manipulation: Selecting specific columns, filtering data based on conditions, grouping
data, and performing aggregation operations like mean, sum, count, etc.
• Visualization: Using Matplotlib or other libraries to visualize data within Pandas DataFrames
by creating various plots and charts. Pandas plays a pivotal role in data analysis, data
preparation, and transformation workflows, forming the backbone of numerous data-related
tasks in fields such as finance, economics, data science, and beyond. Its user-friendly interface,
extensive functionality, and seamless integration with other Python libraries make it a
fundamental tool for data professionals and analysts.
Import Library
import pandas as
pd # Reading a
CSV file
data = pd.read_csv('data.csv')

Department of ECE, GM Institute of Technology, Davanagere Page 16


Artificial Intelligence and Data Science

3.5 Data Exploration

Pandas, a powerful Python library for data analysis, streamlines data exploration through intuitive
functions. With effortless data loading from diverse file formats, it provides immediate insights via
head() and describe() for a quick understanding of the dataset's structure and summary statistics.
Pandas' info() method offers crucial details like data types and missing values. Handling missing
data becomes simple with isnull().sum() and options to drop or fill missing values. Selecting
columns, filtering data based on conditions, and performing group-wise operations using groupby()
aid in insightful analysis. While Pandas doesn't specialize in visualization, its seamless integration
with libraries like Matplotlib enables easy generation of various plots for visual exploration, making
it an indispensable tool for thorough data understanding and preprocessing.

Fig 3.5 : Data Exploration in Python


Visualization plays a pivotal role in data exploration. Matplotlib, Seaborn, and Plotly enable
the creation of diverse plots—histograms, box plots, scatter plots, and more—to visualize
distributions, relationships between variables, identify outliers, and comprehend patterns within
the data. These visualizations aid in uncovering trends, correlations, and potential insights.
Handling missing data is another critical aspect. Visualizations like heatmaps help visualize the
distribution of missing values, guiding decisions on how to handle them through imputation or
removal.

Department of ECE, GM Institute of Technology, Davanagere Page 17


Artificial Intelligence and Data Science

3.6 Missing Data Handling


Handling missing data in a dataset is a critical aspect of data analysis to ensure accurate and reliable
insights. Here's an overview of approaches and techniques commonly used to address missing values
within datasets:

3.6.1: Identifying Missing Values


3.6.1.1 Dropping Missing Values: dropna() function allows dropping rows or columns with
missing values. This method is useful when the missing values don’t significantly impact the
analysis and when removing data doesn’t compromise the integrity of the dataset.
3.6.1.2 Imputation: Imputing missing values involves replacing them with estimated values.
Common techniques include:
3.6.1.2.1 Mean/Median/Mode Imputation: Filling missing values with the mean, median, or
mode of the respective column.
3.6.1.2.2 Forward/Backward Fill: In time-series data, filling missing values with the previous
or next known value.
3.6.1.2.3 Interpolation: Using interpolation techniques like linear or polynomial to
estimate missing values based on existing data points.
3.6.1.3 Custom Imputation: Employing domain knowledge or context-specific
information to impute missing values based on logical assumptions or business rules.

Fig 3.6.1: Identifying Missing Values

Department of ECE, GM Institute of Technology, Davanagere Page 18


Artificial Intelligence and Data Science

3.6.2 Implementation in Pandas

3.6.1.4 Drop Rows/Columns: dropna() method to drop rows or columns containing


missing values.
3.6.1.5 Imputation: Using fillna() method to fill missing values with desired strategies
(mean, median, forward fill, etc.).
Example:

# Drop rows with any missing values

data.dropna(inplace=True)

# Impute missing values with mean


data['column_name'].fillna(data['column_name'].mean(), inplace=True)
Handling missing data requires a careful approach, considering the impact on analysis while
maintaining the integrity of the dataset.

3.7 Categorical Data Handling

Categorical data is a set of predefined categories or groups an observation can fall into. Categorical
data can be found everywhere. For instance, survey responses like marital status, profession,
educational qualifications, etc. However, certain problems can arise with categorical data that must
be dealt with before proceeding with any other task. This article discusses various methods to handle
categorical data. So, let us take a look at some problems posed by categorical data and how to handle
them.

Handling categorical data in Python involves various techniques aimed at effectively


managing non-numeric information within datasets. Categorical data represents qualitative
variables, such as colors, types, or categories, which can't be directly processed by machine learning
algorithms in their raw form.Python provides several approaches for categorical data handling. One
common method is encoding, where categorical variables are transformed into a numerical format.
One-Hot Encoding creates binary columns for each category, while Label Encoding assigns a unique
numerical label to each category. These techniques are easily implemented using libraries like
Pandas or Scikit-learn.

Department of ECE, GM Institute of Technology, Davanagere Page 19


Artificial Intelligence and Data Science

Ordinal categorical data, with a specific order among categories, can be encoded using
Ordinal Encoding, preserving the inherent order in the numerical representation. Dealing with
missing values in categorical data involves strategies like filling missing values with the mode
(most frequent category) or introducing a separate category for missing data. Feature engineering
is another crucial aspect, involving the creation of new features from categorical data to enhance
model performance. Techniques like extracting information from date-time data or combining
categories based on domain knowledge can be valuable.
Target encoding or mean encoding assigns the mean of the target variable for each category,
capturing relationships between categorical variables and the target in regression or classification
problems. In deep learning, category embeddings are employed to represent categorical variables
as continuous vectors, allowing models to learn meaningful representations. For high cardinality
categorical features, techniques like frequency encoding or grouping rare categories can prevent
overfitting and manage the abundance of unique categories.

• Nominal data: data categories has no weightage association, or they dont have any specific
order or distinct meaning between them. Eg: country name:India,USA,CHINA,UK
• Binominal data : only 2 categories are present.
• Ordinal data : data categories has weightage factor associated,or they have spacific order.

3.8 Numerical Data Handling

Quantitative data is the measurement of something—whether class size, monthly sales, or student
scores. The natural way to represent these quantities is numerically (e.g., 29 students, $529,392 in
sales). In this chapter, we will cover numerous strategies for transforming raw numerical data into
features purpose-built for machine learning algorithms.

Numerical data handling in Python involves managing and manipulating data that consists of
quantitative or continuous values. Python provides various techniques and libraries to effectively
handle numerical data:

⚫ Data Cleaning: Dealing with missing values by imputation (replacing missing values with
statistical measures like mean, median, or mode) or removal of rows or columns with missing
data using libraries like Pandas or NumPy.

⚫ Normalization and Scaling: Rescaling numerical features to a similar scale to avoid dominance
of certain features in machine learning models.

Department of ECE, GM Institute of Technology, Davanagere Page 20


Artificial Intelligence and Data Science

• Outlier Detection and Treatment: Identification and handling of outliers through statistical
methods like IQR (Interquartile Range) or Z-scores, either by removing them or transforming
them to minimize their impact on the analysis.
• Feature Engineering: Creating new features from existing numerical data, such as binning or
discretization to convert continuous variables into categorical ones, or deriving new features
through mathematical operations or domain-specific knowledge.
• Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) or Singular
Value Decomposition (SVD) to reduce the dimensionality of datasets containing numerous
numerical features, aiding in visualization or improving computational efficiency.
• Handling Skewed Data: Transformation of skewed numerical data using methods like
logarithmic, square root, or Box-Cox transformations to improve the distribution of data and
meet assumptions of certain statistical models.
• Feature Scaling for Machine Learning Models: Preparing numerical features to fit machine
learning models by ensuring they are on a similar scale, preventing certain features from
dominating the learning process.

Python libraries like Pandas, NumPy, and Scikit-learn offer robust functionalities for
handling numerical data. Choosing the appropriate technique depends on the dataset characteristics,
the analysis or modeling objectives, and the specific requirements of the data analysis or machine
learning task at hand.

3.9 Label Encoding


Label Encoding in Python, using the LabelEncoder from sklearn.preprocessing, converts categorical
data into numerical form. This method is beneficial for machine learning algorithms that require
numeric input. To employ Label Encoding, create a LabelEncoder object and apply it to categorical
data. For instance, a list containing categories like 'red', 'blue', and 'green' is transformed into numeric
labels (0, 1, 2) via the fit_transform() method. It's crucial to consider that Label Encoding is ideal
for categorical data with an inherent order or ranking among categories.

However, for nominal categorical data lacking a hierarchical order, alternative techniques
like One-Hot Encoding are more suitable to avoid introducing unintended relationships. The
inverse_transform() function allows for the reversal of encoded labels back to their original
categorical representation, enhancing flexibility in data transformation workflows. While Label
Encoding simplifies categorical data handling,

Department of ECE, GM Institute of Technology, Davanagere Page 21


Artificial Intelligence and Data Science

Fig 3.7 : Label encoding

Importing Libraries:
from sklearn.preprocessing import LabelEncoder
Creating LabelEncoder Object:
label_encoder = LabelEncoder()
Applying Label Encoding to Categorical Data:
# Example categorical data categories= ['red', 'blue', 'green', 'green', 'red']
# Fit label encoder and transform data
encoded_labels=label_encoder.fit_transfor

3.10 One Hot Encoding

Quantitative data is the measurement of something—whether class size, monthly sales, or student
scores. The natural way to represent these quantities is numerically (e.g., 29 students, $529,392 in
sales). In this chapter, we will cover numerous strategies for transforming raw numerical data into
features purpose-built for machine learning algorithms.
One hot encoding is a technique used in machine learning when dealing with categorical
data. Categorical data represents groups or labels and isn't inherently numerical. For example,
"Color" with labels like "Red," "Green," "Blue" is categorical.

Fig 3.10: One Hot Encoding

Department of ECE, GM Institute of Technology, Davanagere Page 22


Artificial Intelligence and Data Science

For instance, if you have a 'Color' column with three categories 'Red', 'Green', 'Blue', after
one hot encoding, you'd have three separate columns: 'Red', 'Green', 'Blue'. If a row initially had
'Red' in the 'Color' column, it would have a '1' in the 'Red' column and '0' in the other two
columns after encoding.This conversion allows machine learning models to better understand and
utilize categorical data for making predictions or classifications.
df1=pd.get_dummies(df,columns=['Country')
df1

3.10.1 The advantages of using one hot encoding include

3.10.1.1 It allows the use of categorical variables in models that require numerical input.
3.10.1.2 It can improve model performance by providing more information to the model
about the categorical variable.
3.10.1.3 It can help to avoid the problem of ordinality, which can occur when a categorical
variable has a natural ordering (e.g. “small”, “medium”, “large”).

3.10.2 The disadvantages of using one hot encoding include


3.10.2.1 It can lead to increased dimensionality, as a separate column is created for each
category in the variable. This can make the model more complex and slow to train.
3.10.2.2 It can lead to sparse data, as most observations will have a value of 0 in most of the
one-hot encoded columns.
3.10.2.3 It can lead to overfitting, especially if there are many categories in the variable and the
sample size is relatively small.
3.10.2.4 One-hot-encoding is a powerful technique to treat categorical data, but it can lead to
increased dimensionality, sparsity, and overfitting. It is important to use it cautiously and
consider other methods such as ordinal encoding or binary encoding.

Department of ECE, GM Institute of Technology, Davanagere Page 23


Artificial Intelligence and Data Science

3.11 Standardization And Normalization

Fig 3.11 : Standardization And Normalization


3.10.3 Standardization

Standardization scales the features so that they have a mean of 0 and a standard deviation of 1. This is
done by subtracting the mean of each feature and dividing by its standard deviation.

In Python, you can use sklearn.preprocessing.StandardScaler from the scikit-learn library to


perform standardization
Why we used standardization :

3.10.3.1 To reduce the impact of differnt scale of measures used in the data collected.
3.10.3.2 To bring all the numerical data into same scale so that all the column as the similar on
the ML algorithums(no scaling bias).
Standardization, a fundamental data preprocessing technique, is pivotal in preparing numerical data
for analysis, modeling, and machine learning in Python. It aims to transform numerical features into
a standardized scale, ensuring a mean of 0 and a standard deviation of 1 across the dataset. This
normalization technique is widely employed to mitigate issues arising from varying scales and
different units within numerical features.

In Python, implementing standardization is streamlined, especially with libraries like Scikit-


learn and NumPy. The process involves a straightforward mathematical transformation applied to
each feature. First, the mean of the dataset is subtracted from each data point to center the values
around zero.

Department of ECE, GM Institute of Technology, Davanagere Page 24


Artificial Intelligence and Data Science

Scikit-learn's StandardScaler class encapsulates this process efficiently. Initializing and


fitting the scaler on the training data using fit() computes the mean and standard deviation of each
feature. Then, the transform() function applies the transformation to the training and test datasets,
maintaining consistency in scaling between them. This step is crucial to prevent information leakage
and ensure that both datasets are standardized using the same parameters. Standardization offers
numerous advantages, particularly in machine learning. Algorithms sensitive to feature scaling, such
as SVMs, KNN, or PCA, benefit significantly from standardized data. By bringing features onto a
common scale, standardization prevents certain features from disproportionately influencing model
training due to their scales or units. This prevents biases in the model's learning process, enabling
more reliable and robust model performance.

Moreover, standardization is beneficial when features have different ranges or distributions.


It addresses the issue of varying scales, ensuring that no feature dominates simply because of its
larger magnitude. It aids in faster convergence of gradient-based optimization algorithms by
providing a more balanced landscape for model training. Additionally, it facilitates the
interpretability of model coefficients, making comparisons between feature importance more
straightforward. However, it's essential to note that standardization might not always be the ideal
choice for every dataset or model.

In some cases, alternative scaling methods or normalization techniques like Min-Max


scaling might be more suitable, depending on the nature of the data and the requirements of the
analysis or model. standardization in Python using libraries like Scikit-learn empowers practitioners
to preprocess numerical data effectively. By centering data around zero and scaling it to a standard
deviation of 1, this technique enhances the performance and reliability of machine learning models,
especially those sensitive to feature scaling. Understanding when and how to apply standardization
is crucial for optimizing model training and ensuring robustness in data-driven analyses.
Standardization is valuable when features have different units or different ranges of values,
allowing algorithms to converge faster and preventing some features from dominating others due to
their scales. However, it doesn't necessarily guarantee improved performance for all machine
learning models or datasets, and its effectiveness depends on the specific characteristics of the data
and the algorithms being used.

Department of ECE, GM Institute of Technology, Davanagere Page 25


Artificial Intelligence and Data Science

3.11.2 Normalization
Normalization scales the values of features between 0 and 1. It's particularly useful when the features
have varying scales and ranges. In Python, you can use sklearn.preprocessing.MinMaxScaler from
scikit-learn to perform normalization. Both techniques are helpful for machine learning algorithms,
as they make the data more conducive for training by ensuring that different features contribute
equally and that the algorithms converge more efficiently.
3.10.3.3 Standard scalar : Standard scalar maintains original Distribution of the data.
3.10.3.4 MinMax scalar : This used when you want to bring the data scale in the range of 0 to 1.
Normalization, a pivotal data preprocessing technique in Python, involves transforming
numerical data to a common scale, typically between 0 and 1, to facilitate consistent comparisons
and effective modeling across different features. In numerous data analysis and machine learning
scenarios, normalization plays a crucial role in handling varying scales, magnitudes, and units within
numerical data.
Python, with its rich ecosystem of libraries such as Scikit-learn and NumPy, offers versatile
methods to perform normalization efficiently. The primary objective of normalization is to rescale
numerical features to a uniform range while preserving the inherent relationships and distributions
within the data. This ensures that no single feature dominates due to its scale, thereby preventing
biases during analysis or model training.

The process of normalization involves different techniques, with the most common being
Min- Max scaling. This method linearly transforms data to a range between 0 and 1, where the
minimum value of the feature becomes 0, and the maximum value becomes 1. Using Scikit-learn's
MinMaxScaler, one can easily apply this normalization technique to numerical features
transformation uniformly across the dataset. Another normalization technique, RobustScaler, is
beneficial when the dataset contains outliers. It uses statistics that are robust to outliers by scaling
data based on the interquartile range (IQR), making it more resilient to extreme values.
Normalization offers several advantages in data analysis and machine learning workflows.
Firstly, it facilitates a fair comparison between features by bringing them to a common scale,
allowing algorithms to treat all features equally during analysis or modeling. This aids in preventing
certain features from disproportionately influencing the learning process based on their scales or
units. Furthermore, normalization contributes to faster convergence of gradient-based optimization
algorithms in machine learning models, enabling more efficient training. It also enhances the
interpretability of models by ensuring that coefficients or feature importances are comparable across
different features, simplifying the understanding of their contributions.

Department of ECE, GM Institute of Technology, Davanagere Page 26


Artificial Intelligence and Data Science

3.11 Tableau Desktop

Fig 3.12 : Tableau Software Logo

3.11.1 Download and Install Tableau Desktop:


Step 1 : Go to https://www.tableau.com/products/desktop on your web browser

Fig 3.12.1 : Tableau Web browser

Step 2 : Click on the “TRY NOW” button shown in the top right corner of the website.

Fig 3.12.2 : Click on the “TRY NOW” button

Department of ECE, GM Institute of Technology, Davanagere Page 27


Artificial Intelligence and Data Science

Step 3 : It will redirect to the page where you need to enter your email id and click on “DOWNLOAD
FREE TRIAL” button.

Fig 3.12.3: Click on Download free trial button


Step 4 : This will start downloading tableau latest version. An .exe file for Windows is downloaded,
and you can see the downloading process in the bottom left corner of the website.

Fig 3.12.4 : Start downloading tableau latest version

Step 5 : Open the downloaded file. Check in to accept the terms and conditions and click on
“Install” button.

Fig 3.12.5 : Open the downloaded file

Department of ECE, GM Institute of Technology, Davanagere Page 28


Artificial Intelligence and Data Science

Step 6 : A optional pop-up message will be shown to get the approval of Administrator to install
the software. Click on “Yes” to approve it. Installation of the Tableau Desktop on Windows system
starts Step 7 : Once the tableau desktop download and installation is completed, open the Tableau
Desktop software.

Step 8 : In the Registration window

• Click on Activate Tableau and entire your license details.


• If you do not have a license, enter your credentials.
• Click on Start Trial now.

Fig 3.12.6: Activate Tableau license


Step 9 : Start Screen of Tableau is shown

Fig 3.12.7: Start Screen of Tableau

Department of ECE, GM Institute of Technology, Davanagere Page 29


Artificial Intelligence and Data Science

3.12 Email spam Detection using python | Spam detection

3.12.1 Introduction

In today’s digital world, email is one of the most widely used methods of communication. However,
users often receive unwanted or irrelevant emails, commonly referred to as spam. These spam emails
may include advertisements, phishing attempts, or harmful content that can compromise user safety
and data privacy. To address this issue, this project focuses on developing an Email Spam Detection
system using Machine Learning techniques. The model is built using the Python programming
language and trained on a dataset named emails.csv. By analyzing the content of emails, the model can
automatically predict whether a given message is spam or not. This project demonstrates a basic but
effective approach to filtering out spam emails using text classification algorithms, specifically the
Naive Bayes classifier

3.12.2 Libraries Used in the Project:

1. import pandas as pd

Used for data loading and manipulation. It helps in reading the dataset and handling data in tabular
format.

2. from sklearn.feature_extraction.text import CountVectorizer

Converts text data (emails) into numerical feature vectors so that machine learning algorithms can
process them.

3. from sklearn.model_selection import train_test_split

Splits the dataset into training and testing sets to evaluate the performance of the model.

4. from sklearn.naive_bayes import MultinomialNB

A Naive Bayes classifier suitable for classification with discrete features, like word counts in emails.

5. from sklearn.metrics import accuracy_score

Calculates how accurate the model's predictions are compared to the actual results

Department of ECE, GM Institute of Technology, Davanagere Page 30


Artificial Intelligence and Data Science

Fig 3.13 Email spam by unknown to a Recepient(spammer or Attacker)

3.12.3 Python Code


import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
dataset = pd.read_csv("/content/emails.csv")
print(dataset.head())
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['text'])
x_train, x_test, y_train, y_test = train_test_split(x, dataset['spam'], test_size=0.2) model =
MultinomialNB()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
def predict(message):

Department of ECE, GM Institute of Technology, Davanagere Page 31


Artificial Intelligence and Data Science

3.12.4 Python Code with Explanations

1. import pandas as pd

This imports the pandas library, which is used for loading and handling datasets in a table format called

DataFrames.

2. from sklearn.feature_extraction.text import CountVectorizer

This imports the CountVectorizer tool, which converts text data (like email content) into numerical form

(word count vectors) that a machine learning model can understand.

3. from sklearn.model_selection import train_test_split

This imports a function to split the dataset into training and testing sets. This is essential for training the

model on one part of the data and testing its performance on unseen data.

4. from sklearn.naive_bayes import MultinomialNB

This imports the Multinomial Naive Bayes classifier, a model that works well with text data and is

commonly used for tasks like spam detection.

5. from sklearn.metrics import accuracy_score

This function calculates the accuracy of the model by comparing its predicted values with the actual labels

from the test set.

6. dataset = pd.read_csv("/content/emails.csv")

This line loads the email dataset from a CSV file located at /content/emails.csv into a DataFrame named

dataset.

7. dataset.head()

This displays the first few rows of the dataset so you can get an overview of its structure and contents.

8. vectorizer = CountVectorizer()

This initializes the CountVectorizer object, which will later be used to transform the email text data into

vectors (numerical format) for model training.

9. x = vectorizer.fit_transform(dataset['text'])

Transforms the email text column into numerical feature vectors using the CountVectorizer.

Department of ECE, GM Institute of Technology, Davanagere Page 32


Artificial Intelligence and Data Science

10. x_train, x_test, y_train, y_test = train_test_split(x, dataset['spam'], test_size=0.2)

Splits the data into training (80%) and testing (20%) sets. This allows the model to learn on one portion

and be evaluated on another.

11. model = MultinomialNB()

Creates an instance of the Multinomial Naive Bayes model, suitable for text classification.

12. model.fit(x_train, y_train)

Trains the model using the training data (x_train, y_train), learning patterns in the emails.

13. y_pred = model.predict(x_test)

Uses the trained model to make predictions on the test data (x_test).

14. accuracy = accuracy_score(y_test, y_pred)

Calculates the accuracy by comparing the predicted labels (y_pred) with the true labels (y_test).

15. print("Accuracy:", accuracy)

Prints the calculated accuracy score to the console.

16. def predict(message):

message_vector = vectorizer.transform([message])

prediction = model.predict(message_vector)

return 'spam' if prediction[0] == 1 else 'not spam'

This function takes a new email message, converts it to vector form, uses the trained model to predict if it's

spam or not, and returns the result.

17. userMessage = input('Enter text to predict: ')

prediction = predict(userMessage)

print(f'Prediction:', prediction)

Takes user input from the console, runs it through the prediction function, and prints whether it's spam or

not spam.

Explanation: This imports the pandas library used for loading and handling datasets in table format

Department of ECE, GM Institute of Technology, Davanagere Page 33


Artificial Intelligence and Data Science

Fig 3.13.1 Accuracy of the model is clearly very good

Fig 3.13.2 Input is given and output is recognized as spam.

Department of ECE, GM Institute of Technology, Davanagere Page 34


Artificial Intelligence and Data Science

Fig 3.13.3 Input is given and output is predicted as HAM(not spam)

Department of ECE, GM Institute of Technology, Davanagere Page 35


Artificial Intelligence and Data Science

3.14 Applications

Email Service Providers

• Spam Filtering: This model helps major email services like Gmail, Outlook, and Yahoo Mail to
automatically filter and block spam or phishing messages before reaching the user's inbox.

• User Security: It enhances user safety by detecting harmful content that may contain malware or
phishing links.

Corporate Email Gateways

• Preventing Data Breach: Organizations can use this spam detection system to protect sensitive
company data by filtering out suspicious emails targeting employees.

• Reducing Bandwidth Waste: By automatically discarding spam emails, companies reduce


unnecessary network load and save server bandwidth.

3.15 Conclusion

The email spam detection system is a highly practical and efficient application of machine learning in
cybersecurity. By leveraging natural language processing and classification techniques like Naive
Bayes, this model helps in automatically identifying and filtering out spam messages. Its
implementation ensures secure, clean, and reliable email communication for both individuals and large-
scale organizations.

Department of ECE, GM Institute of Technology, Davanagere Page 36


Artificial Intelligence and Data Science

Chapter 4

OUTCOMES FROM THE INTERNSHIP

Participating in an Applied Data Analytics internship presents an unparalleled opportunity for


individuals aspiring to delve deep into the world of data-driven insights and analytics. This
immersive experience provides a multifaceted platform to acquire and sharpen skills, forge
meaningful connections, and lay the foundation for a promising career in this dynamic field.
Central to the internship experience is the hands-on application of theoretical knowledge to real-
world scenarios. Interns are exposed to a myriad of tools, methodologies, and techniques essential
for data analysis.

The hallmark of an Applied Data Analytics internship lies in its project-based nature.
Interns are tasked with projects that mirror authentic challenges encountered in the industry. These
projects serve as laboratories for innovation and problem-solving. They offer a hands-on
experience where interns can experiment, apply theoretical concepts, and derive actionable
insights from datasets. Successful completion of these projects not only demonstrates proficiency
but also showcases adaptability and the ability to thrive in a dynamic, data-centric environment.

Beyond skill development, internships present a unique networking opportunity, Engaging


with seasoned professionals offers invaluable insights into industry practices and nuances. These
interactions foster mentorship possibilities and establish connections that could potentially shape
the trajectory of one's career.Through these relationships, interns gain practical insights into the
real-world application of data analytics across diverse sectors and industries.

Moreover, an internship in applied data analytics is a catalyst for personal and professional
growth. It provides a panoramic view of how data fuels decision-making processes within
businesses. Exposure to industry standards and practices nurtures a comprehensive understanding
of the role of analytics in driving strategic initiatives and shaping organizational outcomes. An
internship in this domain significantly enhances one's employability. The experience serves as a
pivotal element on resumes, distinguishing candidates in a competitive job market. Successful
completion of an internship not only demonstrates technical proficiency but also showcases a
candidate's adaptability, problem-solving acumen, and capacity to thrive in a data-centric
environment.

Department of ECE, GM Institute of Technology, Davanagere Page 37


Artificial Intelligence and Data Science

Chapter 5

RESULTS & CONCLUSION

During my internship focused on "Applied Data Analytics," the primary goal was to leverage data-
driven methodologies to derive valuable insights and support informed decision-making. The initial
phase involved meticulous data collection and preparation, incorporating diverse datasets and
addressing challenges associated with data quality and consistency. Subsequently, the exploratory
data analysis (EDA) phase unearthed compelling patterns, trends, and unexpected findings,
providing a comprehensive understanding of the dataset.

The heart of the project lay in the application of various data modeling techniques. Machine
learning and statistical models were deployed to extract meaningful information from the data.
Evaluation metrics were employed to assess the performance of each model, identifying those that
proved most effective while acknowledging inherent limitations. The results highlighted key
findings that directly aligned with the project's objectives, offering actionable insights for
stakeholders.

In conclusion, the internship not only met but exceeded its objectives by delivering tangible
and applicable outcomes. The practical implications of the data analytics findings were discussed,
emphasizing their potential impact on real-world scenarios and decision-making processes. The
experience provided valuable lessons, including overcoming challenges and refining analytical
skills. Looking ahead, recommendations for future work were outlined, suggesting areas for further
exploration and improvements in the data analytics methodology. The success of the internship
owes gratitude to the collaborative efforts of individuals and organizations involved in the project.

Department of ECE, GM Institute of Technology, Davanagere Page 38


Artificial Intelligence and Data Science

Chapter 6
Monthly Report

6.1 Month-1 Report: February 2025


6.1.1 Introduction
The training began with an overview of three key areas: Python programming, Artificial
Intelligence (AI), and Data Science. These domains form the foundation of modern computational
and analytical fields.
Python Programming
Python is a high-level, interpreted programming language known for its simplicity, readability, and
wide range of libraries. It has become the preferred language for AI and data science applications due
to its strong community support and extensive libraries like NumPy, Pandas, Matplotlib, TensorFlow,
and Scikit-learn. The session covered the basics of Python syntax, its importance, and its role in real-
world problem-solving.
Artificial Intelligence (AI)
AI involves creating systems that can simulate human intelligence processes, including learning,
reasoning, and self-correction. The session introduced the fundamental concepts of AI, such as
machine learning, deep learning, and neural networks. Real-life applications like recommendation
systems, chatbots, and self-driving cars were discussed to highlight the significance of AI in today's
world.
Data Science
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to
extract insights and knowledge from structured and unstructured data. The introduction covered the
data science workflow — including data collection, cleaning, analysis, visualization, and
interpretation. It also emphasized the importance of data-driven decision-making in various
industries.This introductory session laid a solid foundation for the technical topics that followed in
subsequent days.

Department of ECE, GM Institute of Technology, Davanagere Page 39


Artificial Intelligence and Data Science

6.1.2 Python Basics, Variables, Comments

The second day of the training focused on building foundational programming skills in Python by
introducing two essential concepts: variables and comments.

Variables in Python Variables are used to store data values in a program. Unlike some other
programming languages, Python does not require an explicit declaration of variable types, as it uses
dynamic typing. This means a variable's type is determined by the value assigned to it.

Comments in Python Comments are used to explain code, making it easier to understand and
maintain. They are ignored by the Python interpreter and are an essential tool for documenting code.

6.1.3 Data Types, Strings, Slicing


On the third day of training, the focus was on understanding Python Data Types and exploring
String manipulation using slicing techniques. These are fundamental concepts that play a crucial
role in data handling and preprocessing tasks in both Artificial Intelligence and Data Science
applications.
Python Data Types
Data types define the kind of value a variable can hold. Python has several built-in data types,
categorized as follows:
⚫ Numeric Types: int, float, complex
⚫ Text Type: str (string)
⚫ Sequence Types: list, tuple, range
⚫ Mapping Type: dict
⚫ Set Types: set, frozenset
⚫ Boolean Type: bool
⚫ Binary Types: bytes, bytearray, memoryview

Understanding data types is essential for memory management, operations, and choosing the
appropriate structures for storing and processing data.

Department of ECE, GM Institute of Technology, Davanagere Page 40


Artificial Intelligence and Data Science

Strings in Python
Strings are sequences of Unicode characters and are one of the most commonly used data types in
Python. They are immutable, meaning they cannot be changed after creation.
String Slicing
Slicing is a technique used to extract a portion (substring) of a string using indexing. Python
supports both positive and negative indexing.

6.1.4 Strings – Looping Statements, If–Else Statements


Loops are used to execute a block of code repeatedly as long as a certain condition is met. Two
primary types of loops in Python were discussed:

⚫ for Loop: Used for iterating over a sequence (such as a string, list, or range).
⚫ while Loop: Repeats a block of code as long as a specified condition is True.
⚫ If–Else Statements Conditional statements enable decision-making in code by executing
different blocks based on Boolean expressions.
6.1.5 Strings – String Methods, Encoding and Decoding
Python provides a wide range of built-in methods to manipulate strings efficiently. These methods
do not change the original string (since strings are immutable) but return new modified strings.
Common String Methods Covered:
⚫ lower()
⚫ upper()
⚫ strip()
⚫ replace()
⚫ find()
⚫ split()
Encoding and Decoding
Encoding is the process of converting a string into a byte representation, which is necessary for file
handling, web communication, and machine learning tasks. Decoding is the reverse process
converting bytes back into a readable string format.

Department of ECE, GM Institute of Technology, Davanagere Page 41


Artificial Intelligence and Data Science

6.1.5 Lists, Sets, Tuples, Dictionary


Lists The session began with an introduction to lists, a fundamental data structure in Python. Lists
are ordered, mutable, and allow duplicate elements. They are widely used for storing collections of
data and manipulating sequences in data science workflows.
Tuples are immutable sequences used to store fixed collections of items.
Sets are unordered collections of unique elements used for operations like union, intersection, and
difference,
dictionary in Python is an unordered, mutable collection of key-value pairs. Each key must be
unique and immutable (like strings, numbers, or tuples), while values can be of any data type and can
repeat.

6.1.6 Practical Applications


Data Prepossessing: String encoding/decoding is widely used in cleaning raw text data in AI.
User Interaction Systems: Conditional logic and loops are essential in building chatbots and
interactive AI systems.
Automation Scripts: Python basics learned this week are applicable in automating tasks and
building intelligent workflows.

6.1.7 Conclusion

The month successfully provided a solid grasp of Python, setting the stage for more advanced topics
in AI and Data Science. The practical knowledge of loops, conditions, and string operations ensures
that I can handle input/output systems, write clean code, and understand how data is structured and
transformed. These skills are crucial as I move forward into domains like machine learning, deep
learning, and NLP.

Department of ECE, GM Institute of Technology, Davanagere Page 42


Artificial Intelligence and Data Science

6.2 Month-2 Report: March 2025

6.2.1 Pandas, Diplomatic, Weather Forecast Analysis

This module introduced data manipulation and analysis using the Pandas library in Python, with a
focus on handling real-world datasets. A mini-project was conducted on Weather Forecast Analysis
to reinforce concepts through practical application.

Pandas: Introduction to data handling with Python's Pandas library, including DataFrames and Series.

Matplotlib: Basic plotting functions for data visualization (line charts, bar graphs).

Mini-Project – Weather Forecast Analysis:

⚫ Real-world data-set analysis for weather forecasting.


⚫ Use of Pandas and Matplotlib for trend visualization.

6.2.2 Pandas, CRUD Operations

Pandas – CRUD Operations:

⚫ Create, Read, Update, and Delete operations on datasets using Pandas.


⚫ Handling DataFrames and Series for efficient data processing
⚫ Learn to perform Create, Read, Update, and Delete operations on structured data.
⚫ Techniques for data cleaning and preprocessing.
⚫ Use of functions like .loc[], .iloc[], .drop(), .fillna(), etc.

6.2.3 Seaborn, House Price Prediction Project, Customer Churn Model Project,
Customer Segmentation Project

Seaborn: Advanced data visualization library built on top of Matplotlib. Covered techniques for
generating attractive and informative statistical graphics including pairplot(), heatmap(), catplot() etc.

Department of ECE, GM Institute of Technology, Davanagere Page 43


Artificial Intelligence and Data Science

House Price Prediction Project:

⚫ Used regression techniques to predict housing prices based on features like location, area,
rooms, and amenities.
⚫ Focused on data preprocessing, feature engineering, model training, and evaluation using
metrics like RMSE.

Customer Churn Model Project:

Developed a classification model to predict whether a customer will leave a service provider.Used
logistic regression and decision tree classifiers with confusion matrix, precision, recall, and F1- score
evaluation.

Customer Segmentation Project:

⚫ Applied clustering (K-Means) to group customers based on behavior, spending habits, and
demographic details.
⚫ Visualized the clusters using PCA and Seaborn plots for interpretation.

6.2.4 Advanced Machine Learning, Lasso and Ridge Regression

Advanced Machine Learning: Overview of ensemble methods, overfitting/underfitting challenges,


regularization, and model validation.

Lasso and Ridge Regression: Introduction to regularization techniques:

⚫ Lasso (L1): Shrinks some coefficients to zero – useful for feature selection.
⚫ Ridge (L2): Penalizes large coefficients to prevent overfitting.
⚫ Compared both methods using sklearn with cross-validation.

6.2.5 Decision Tree Regression, Extreme Gradient Boosting (XGBoost)

Decision Tree Regression:

⚫ Explained decision trees for regression tasks.


⚫ Applied to datasets with continuous target variables.

Department of ECE, GM Institute of Technology, Davanagere Page 44


Artificial Intelligence and Data Science

⚫ Emphasized splitting criteria, tree depth, and pruning to avoid overfitting.

Extreme Gradient Boosting (XGBoost):

⚫ Introduction to one of the most powerful ML algorithms.


⚫ Hands-on implementation using the xgboost library.
⚫ Showed how XGBoost optimizes both performance and speed.
⚫ Hyperparameter tuning with cross-validation and GridSearchCV.

6.2.6 Neural Networks, Optimizers, Loss functions

⚫ Neural Networks: Architecture, activation functions, forward and backward propagation.


⚫ Optimizers: SGD, Adam, RMSprop.
⚫ Loss functions for classification and regression. Explained overfitting, regularization, and
dropout. Practical demos with TensorFlow/Keras.

6.2.7 Deep Learning Refresher, Handwritten Digit Classification using ANN

Deep Learning Refresher: A continuation and deeper insight into deep learning principles.

Handwritten Digit Classification using ANN:Built an Artificial Neural Network (ANN) to


classify digits from the MNIST dataset.Steps included:

⚫ Data preprocessing and normalization.


⚫ Designing the ANN architecture using Keras.
⚫ Model training and accuracy evaluation.
⚫ Visualization of predictions using matplotlib.
⚫ Discussed overfitting and regularization techniques in ANN.

6.2.8 Practical Applications

Agriculture Planning: Farmers can plan crop cycles based on rainfall and temperature
trends.Forecasting drought or excessive rainfall helps in choosing the right irrigation techniques.

Urban Flood Management:Authorities can analyze seasonal rainfall patterns and predict flood-
prone periods.Enables better drainage system planning and early warning systems.

Department of ECE, GM Institute of Technology, Davanagere Page 45


Artificial Intelligence and Data Science

Energy Consumption Forecasting: Weather data (e.g., cold spells or heatwaves) affects electricity
and gas usage. Power companies can prepare supply forecasts and optimize grid usage.

6.2.9 Conclusion

This month provided hands-on exposure to practical machine learning and deep learning workflows.
The application of theoretical concepts on real datasets improved both programming and analytical
skills. Projects like weather forecasting, customer churn, and digit recognition offered real-world
relevance, reinforcing understanding of ML pipelines, data preprocessing, model tuning, and
performance evaluation.

Department of ECE, GM Institute of Technology, Davanagere Page 46


Artificial Intelligence and Data Science

6.3 Month-3 Report: April 2025

6.3.1 Adaptive Thresholding Techniques, Filters in Image Processing, Edge


Detection

Adaptive Thresholding Techniques:

⚫ Explored local thresholding for handling variable lighting conditions in image processing.
⚫ Implemented adaptive mean and Gaussian thresholding using OpenCV.

Filters in Image Processing:

⚫ Covered basics of convolution and kernel-based filtering.


⚫ Applied smoothing (Gaussian Blur), sharpening, and edge enhancement techniques.

Edge Detection:

⚫ Studied the theory behind edge detection algorithms.


⚫ Implemented Sobel, Laplacian, and Canny Edge Detection in practice.

6.3.2 Object Detection, Image Classification using CNN

Object Detection:

⚫ Introduced traditional and CNN-based object detection methods.


⚫ Used bounding boxes for identifying objects within images.

Image Classification using CNN:

⚫ Trained a Convolutional Neural Network from scratch for multiclass image classification.
⚫ Used datasets like CIFAR-10 for practical implementation and evaluated model performance

Department of ECE, GM Institute of Technology, Davanagere Page 47


Artificial Intelligence and Data Science

6.3.3 Transfer Learning in Image Classification, CNN Transfer Learning, Plant


Disease Detection

Transfer Learning in Image Classification:

⚫ Discussed reusing pre-trained models like VGG16 and ResNet.


⚫ Covered feature extraction and fine-tuning techniques.

CNN Transfer Learning:

⚫ Practical implementation using TensorFlow and Keras.


⚫ Loaded pre-trained weights and customized dense layers for a new dataset.

Plant Disease Detection:

⚫ Applied transfer learning for plant leaf disease classification.


⚫ Employed data augmentation and trained the model on plant dataset images.

6.3.4 Fine Tuning Pre-trained Models, Transfer Learning using Keira's


Application

Fine Tuning Pre-trained Models: Focused on selectively training deeper layers to enhance
performance on target tasks.

Transfer Learning using Keira's Application: Used an application (possibly a platform or tool
named Keira) to build a practical transfer learning-based image classification system.Implemented
model deployment using a GUI-based interface.

Department of ECE, GM Institute of Technology, Davanagere Page 48


Artificial Intelligence and Data Science

6.3.5 Practical Applications

Medical Imaging: Enhancing MRI and CT scans to highlight anomalies like tumors using filters and
edge detection (Canny/Sobel).

Document Scanning: Removing shadows and correcting lighting inconsistencies with adaptive
thresholding to digitize handwritten or printed documents.

Robotics: Used in robots for real-time navigation and object detection in varying light environments.

Security Systems: Identifying motion or changes in surveillance feeds using edge detection.

6.3.6 Conclusion

The exploration of image processing and deep learning in Module 6.3 provided a strong foundation
in both fundamental and advanced concepts of computer vision. Beginning with adaptive thresholding
and edge detection, learners gained practical skills in manipulating image data under varying
conditions. Progressing into CNN-based classification and object detection, the module emphasized
the power of neural networks in recognizing complex visual patterns.

A major highlight was the use of transfer learning, enabling high-accuracy models to be built
efficiently by leveraging pre-trained architectures like VGG16 and ResNet. These techniques proved
especially effective for real-world tasks such as plant disease detection, where labeled data is scarce.

The inclusion of GUI-based applications (e.g., Keira) demonstrated the accessibility and
deployability of AI solutions, bridging the gap between development and user interaction. Projects
like house price prediction, customer churn modeling, and image classification also fostered practical
understanding and industry relevance.

Overall, this module not only enhanced technical proficiency in tools like OpenCV,
TensorFlow, and Keras, but also cultivated the ability to apply these skills to impactful real-world
problems in healthcare, agriculture, manufacturing, and more.

Department of ECE, GM Institute of Technology, Davanagere Page 49


Artificial Intelligence and Data Science

6.4 Month-4 Report: March 2025

6.4.1 Large Language Models (LLMs) and Conversational AI

In recent years, Large Language Models (LLMs) and Conversational AI have revolutionized
human-computer interaction, enabling machines to understand, generate, and respond to natural
language in a human-like manner. These technologies are central to advancements in virtual
assistants, chatbots, customer support automation, and more.

Large Language Models (LLMs)

Definition:
LLMs are deep learning models trained on massive amounts of text data to understand and generate
human language. These models are built using transformer architectures and have billions (or even
trillions) of parameters.

Examples:

⚫ GPT-4 (OpenAI)
⚫ BERT (Google)
⚫ LLaMA (Meta)
⚫ PaLM (Google)

Conversational AI

Definition: Conversational AI refers to technologies that enable machines to understand, process, and
respond to human language in a conversational manner. It combines Natural Language Processing
(NLP), Machine Learning (ML), and sometimes LLMs to simulate human-like interactions.

Components:

⚫ Natural Language Understanding (NLU) – To interpret user input.


⚫ Dialog Management – To maintain context and decide system responses.
⚫ Natural Language Generation (NLG) – To generate coherent replies.

Department of ECE, GM Institute of Technology, Davanagere Page 50


Artificial Intelligence and Data Science

6.4.2 Practical Application

Virtual Assistants

⚫ LLMs power natural conversation, understanding user queries, scheduling tasks, and providing
information instantly.Used in: Smartphones, smart homes, vehicles.

Healthcare Support

⚫ Medical chatbots answer patient questions, provide symptom checks, and schedule appointments.
LLMs help doctors by summarizing medical records and suggesting possible diagnoses based on
large datasets.

Customer Service Automation

⚫ LLM-powered bots are used in banking, e-commerce, and telecom to handle large volumes of
queries 24/7. They reduce workload and improve response time with natural language
understanding and generation.

Education and E-Learning

⚫ Personalized tutoring systems powered by LLMs adapt content based on student queries and
progress.Can explain complex topics, generate quizzes, and help in language learning.

6.4.3 Conclusion

These month explain about the Large Language Models (LLMs) and Conversational AI are at the
heart of today’s intelligent systems, capable of understanding, processing, and generating human- like
text with remarkable accuracy. Their ability to adapt across domains—healthcare, education,
customer service, business, and more—makes them invaluable tools in the digital age.

Through practical projects and tools like ChatGPT, voice assistants, and AI chatbots, learners
and developers gain hands-on experience in how these models work and are deployed. The
advancements in LLMs have shifted human-computer interaction from rigid commands to fluid,
contextual conversations—paving the way for more accessible, personalized, and intelligent digital
experiences.

Department of ECE, GM Institute of Technology, Davanagere Page 51


Artificial Intelligence and Data Science

REFERENCES

[1] https://www.geeksforgeeks.org/data-visualization-using-matplotlib/
[2] https://www.guru99.com/download-install-tableau.html
[3] https://jupyter.org/

[4] https://www.tableau.com/support/releases
[5] https://www.pcmag.com/reviews/tableau-desktop

Department of ECE, GM Institute of Technology, Davanagere Page 52

You might also like