Data
Analytics
Using
Python
ELC Activity
Thapar Institute of Engineering and
Technology
By:
Dr. Aditi Sharma
Assistant Professor
Python
A High-level Programming Language, as well as a scripting
language
Python is an easy language to learn because of its simple
syntax
It can be used for simple tasks as well as complex tasks
like machine learning
Different data types available: primitive, string, list, tuple,
set, dictionary.
Applications of Python for AI
Data Preprocessing: Python libraries like Pandas and NumPy are widely used for cleaning,
transforming, and preprocessing raw data into a suitable format for machine learning models.
Machine Learning Libraries: Python offers powerful machine learning libraries such as
scikit-learn, TensorFlow, and PyTorch. Scikit-learn provides simple and efficient tools for data
mining and data analysis, while TensorFlow and PyTorch are deep learning frameworks that
allow users to build and train complex neural network models.
Natural Language Processing (NLP): Python's NLTK (Natural Language Toolkit) and spaCy
libraries are extensively used for processing and analyzing human language data. These
libraries are crucial for applications like sentiment analysis, language translation, and
chatbots.
Computer Vision: Libraries like OpenCV and Dlib are widely used for computer vision tasks
such as image and video analysis, facial recognition, object detection, and image segmentation.
Applications of Python for AI
Reinforcement Learning: Python is often used in reinforcement learning
applications, and libraries like OpenAI Gym provide environments for
developing and testing reinforcement learning algorithms.
Big Data Processing: Python can be integrated with big data processing
frameworks such as Apache Hadoop and Apache Spark for large-scale
machine learning tasks on big datasets.
Web Development and APIs: Python frameworks like Flask and Django
are used to deploy machine learning models as web applications or APIs,
allowing easy integration of machine learning functionalities into web
services.
Applications of Python for AI
Automated Machine Learning (AutoML): Python has several AutoML libraries
like TPOT and Auto-sklearn that automate the process of selecting the best machine
learning model and hyperparameters for a given dataset, making it easier for non-
experts to work on machine learning projects.
Data Visualization: Libraries like Matplotlib, Seaborn, and Plotly enable data
visualization, helping data scientists and researchers to understand complex
patterns and relationships in data, which is crucial for feature selection and model
evaluation.
Predictive Analytics: Python is used for building predictive models in various
domains such as finance, healthcare, marketing, and sales, helping businesses make
data-driven decisions.
Python Libraries
Numpy Pandas Scipy
Scikit- Matplot Seabor
Learn lib n
Numpy
NumPy is a powerful library in Python used for numerical computing.
Provides support for large, multi-dimensional arrays and matrices,
along with a collection of high-level mathematical functions to operate
on these arrays.
NumPy is a fundamental package for scientific computing in Python
and is widely used in various fields such as physics, engineering, data
science, and machine learning.
Arrays: Multidimensional homogenous array of fixed size is provided
in Numpy.
• import numpy as np
• # Creating a 1D array
• a = np.array([1, 2, 3, 4, 5])
• # Creating a 2D array
• b = np.array([[1, 2, 3], [4, 5, 6]])
Numpy • # Element-wise operations
• a = np.array([1, 2, 3])
• b = np.array([4, 5, 6])
• c = a + b # [5, 7, 9]
Numpy Functions
Shape and Dimesions
Indexing and Slicing
Universal Function
Linear Algebra
Scientific Computing
Pandas
Pandas is a popular open-source data analysis and manipulation
library for Python.
It provides easy-to-use data structures such as Series and
DataFrame, along with data analysis tools for cleaning,
transforming, and analyzing structured data.
Pandas is widely used in data science, machine learning, and
finance for handling and analyzing data efficiently.
Series & DataFrame
A Series is a one-dimensional labeled array that can hold any data
type. It is like a column in a DataFrame or a single attribute of an
object.
# Creating a Series
s = pd.Series([1, 3, 5, 6, 8])
A DataFrame is a two-dimensional labeled data structure with
columns that can be of different data types. It is similar to a
spreadsheet or SQL table or a dictionary of Series objects. You
can think of it like a table in a relational database or an Excel
spreadsheet.
• import pandas as pd
• # Creating a DataFrame from a
dictionary
DataFram • data = {'Name': ['Alice', 'Bob',
'Charlie'],
e • 'Age': [25, 30, 35],
• 'City': ['New York', 'London',
'Paris']}
• df = pd.DataFrame(data)
• data = {'state': ['Ohio', 'Ohio', 'Ohio',
'Nevada', 'Nevada'],
DataFram • 'year': [2000, 2001, 2002, 2001, 2002],
e •
•
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
DataFrame can be treated as
an ordered collection of • print(frame)
columns: Each column can be state year pop
a different data type and Have
both row and column indices. 0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
A column in a DataFrame can be retrieved as a
Series by dict-like notation or as attribute
• data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
•
DataFram
'year': [2000, 2001, 2002, 2001, 2002],
• 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
• frame = DataFrame(data)
e–
• print(frame['state’])
0 Ohio
1 Ohio
Retrievin 2
3
4
Ohio
Nevada
Nevada
ga •
Name: state, dtype: object
print(frame.state)
Column
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
• data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
• 'year': [2000, 2001, 2002, 2001, 2002],
• 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
• frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C',
'D', 'E'])
• print(frame2.loc[['A', 'B']])
• print(frame2)
DataFram year state pop debt
A 2000
B 2001
Ohio 1.5 NaN
Ohio 1.7 NaN
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
e–
• print(frame2.loc['A':'E',['state','pop']])
C 2002 Ohio 3.6 NaN
state pop
D 2001 Nevada 2.4 NaN
A Ohio 1.5
E 2002 Nevada 2.9 NaN
Fetching
B Ohio 1.7
• print(frame2.loc['A’])
C Ohio 3.6
year 2000
D Nevada 2.4
state Ohio
Rows
E Nevada 2.9
pop 1.5
• print(frame2.iloc[:,1:3])
debt NaN state pop
Name: A, dtype: object A Ohio 1.5
• print(frame2.iloc[1:3]) B Ohio 1.7
year state pop debt C Ohio 3.6
B 2001 Ohio 1.7 NaN D Nevada 2.4
C 2002 Ohio 3.6 NaN E Nevada 2.9
• frame2['debt'] = 0
• print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 0
C 2002 Ohio 3.6 0
D 2001 Nevada 2.4 0
DataFram
E 2002 Nevada 2.9 0
• frame2['debt'] = range(5)
• print(frame2)
e–
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 1
C 2002 Ohio 3.6 2
Modifying •
D 2001 Nevada 2.4
E 2002 Nevada 2.9
3
4
val = Series([10, 10, 10], index = ['A', 'C', 'D'])
Columns
• frame2['debt'] = val
• print(frame2)
year state pop debt
A 2000 Ohio 1.5 10.0
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 10.0
D 2001 Nevada 2.4 10.0
E 2002 Nevada 2.9 NaN
• Rows or individual elements can be modified similarly.
Using loc or iloc.
DataFram • del frame2['debt']
• print(frame2)
e– A
year
2000
state pop
Ohio 1.5
B 2001 Ohio 1.7
Removing C
D
2002
2001
Ohio 3.6
Nevada 2.4
Columns E 2002 Nevada 2.9
• data = pd.read_csv('data.csv')
Data • data.to_csv('output.csv',
index=False)
Reading/ • pd.read_excel(‘myfile.xlsx’,sheet
Writing _name=‘sheet1’,
Pandas provides functions index_col=None,
to read data from various
na_values=[‘NA’])
file formats like CSV,
Excel, SQL databases, and • pd.read_sata(‘myfile.dta’)
output data to these
formats. • pd.read_sas(‘myfile.sas7bdat’)
• pd.read_hdf(‘myfile.h5’, ‘df’)
Pandas provides functions for
handling missing data, dropping
unnecessary columns, filling missing
values, and performing other data
cleaning tasks.
Data # Handling missing data
Cleaning
and
Preprocessi df.dropna() # Drop rows with
missing values
ng
df.fillna(value=0) # Fill missing
values with 0
Projects
Automated Social
Fraud Healthcar
Machine Media
Detection e
Learning Analytics
Voice Customer Automated
Recognitio Segmentat Machine
n ion Learning
Projects
Text Handwritte Object
Emotion
Summariza n Data Identificati
Analysis
tion Recognition on
Game
Sentiment Recommen
Developme
Analysis der System
nt
Thank You