0% found this document useful (0 votes)

82 views27 pages

Unit 5

Uploaded by

kumarmagesh0055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views27 pages

Unit 5

Uploaded by

kumarmagesh0055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

UNIT -V DATA SCIENCE USING PYTHON

Introduction to Essential Data Science Packages: Numpy, Scipy, Jupyter, Stats

models and Pandas Package – Data Munging: Introduction to Data Munging, Data
Pipeline and Machine Learning in Python – Data Visualization Using Matplotlib –
Interactive Visualization with Advanced Data Learning Representation in Python.
Visualizations - Visual data analysis techniques, interaction techniques.

Introduction to Essential Data Science Packages:

1. NumPy
2. Pandas
3. Matplotlib
4. Seaborn
5. Scikit-learn
6. Statsmodels
8. SciPy
NumPy
● NumPy is a Python library used for working with arrays.
● NumPy (Numerical Python) is a fundamental library for numerical computing in
Python.
● It provides support for arrays and matrices, along with a collection of mathematical
functions to perform operations on these data structures.
Key Features
1. N-dimensional Arrays:
NumPy introduces the ndarray (n-dimensional array) object, which is a powerful data
structure for storing large datasets in a grid format.
2. Mathematical Functions:
It offers a variety of mathematical functions to perform operations like element-wise
addition, multiplication, trigonometric functions, and more.
3. Linear Algebra:
NumPy has built-in functions for linear algebra operations, including matrix
multiplication, determinants, and eigenvalue problems.
4. Random Number Generation:
The library includes functionality for generating random numbers, which is essential
for simulations and statistical analysis.
5. Integration with Other Libraries:
NumPy serves as the foundational library for many other scientific computing
libraries, such as Pandas, Matplotlib, and SciPy.
Basics of NumPy Arrays
1. Creating Arrays
To create arrays from lists or tuples using np.array().
2. Array Properties
To check the shape and dimensions of an array.
3. Indexing and Slicing
To access elements using indexing and slicing.
4. Element-wise Operations
NumPy supports vectorized operations, allowing you to perform arithmetic on
arrays directly.
5. Reshaping Arrays
To change the shape of an array using reshape().
Example
import numpy as np
# Creating arrays
array_1d = np.array([1, 2, 3, 4])
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
# Properties
print("1D Array:", array_1d)
print("Shape of 1D Array:", array_1d.shape)
print("2D Array:\n", array_2d)
print("Shape of 2D Array:", array_2d.shape)
# Indexing
print("First element of 1D Array:", array_1d[0])
print("First row of 2D Array:", array_2d[0])
# Slicing
print("First two elements of 1D Array:", array_1d[:2])
print("First column of 2D Array:\n", array_2d[:, 0])
# Element-wise operations
array_sum = array_1d + 10
print("Sum with scalar:", array_sum)
# Reshaping
reshaped_array = array_1d.reshape((2, 2))
print("Reshaped Array:\n", reshaped_array)

SciPy
SciPy is a scientific computation library.SciPy stands for Scientific Python.
It builds on NumPy and provides additional functionality for various tasks, including
Optimization:
It is a function to find minimums and maximums of objective functions.
Integration:
It is a tool for numerical integration of functions and ordinary differential
equations.
Interpolation:
It is a technique for estimating values between known data points.
Linear Algebra:
It is a function used for solving linear equations and matrix operations.
Statistics:
A wide range of statistical tests and probability distributions can be performed.
Jupyter
Jupyter is an open-source project that provides a web-based interactive computing
environment, allowing users to create and share documents that contain live code,
equations, visualizations, and narrative text.
Jupyter key features
• Jupyter Notebooks:
A web based interface to programming environments of Python, Julia, R and many
others.
• Interactive Widgets:
Jupyter supports interactive elements like sliders and buttons, making it easier to create
dynamic visualizations and user interfaces.
• Kernel Support:
Jupyter can run code in various programming languages through different kernels, with
the most popular being the IPython kernel for Python.
• Export Options:
Notebooks can be exported to various formats, including HTML, PDF, and Markdown,
making it easy to share results and documentation.
• Collaboration:
Jupyter Notebooks can be shared and collaborated on through platforms like JupyterHub
or cloud services like Google Colab.

STATSMODEL
● Statsmodels is a Python library built specifically for statistics.
● Statsmodels is built on top of NumPy, SciPy, and matplotlib, but it contains more
advanced functions for statistical testing and modeling that is not found in numerical
libraries like NumPy or SciPy.
● It provides a complement to scipy for statistical computations including descriptive
statistics and estimation and inference for statistical models.
Key features
● Regression Analysis: Supports ordinary least squares (OLS) and generalized linear
models (GLMs).
● Time Series: Offers methods for ARIMA and seasonal decomposition.
● Statistical Tests: Includes t-tests, ANOVA, and other hypothesis tests.
● Model Diagnostics: Tools for checking assumptions and validating models.
● Formula Interface: Allows R-like formula syntax for model specification.
Pandas
● Pandas is a Python library used for working with data sets and it is a data
manipulation package in Python for tabular data.
● That is, data in the form of rows and columns, also known as DataFrames.
● Pandas has functions for analyzing, cleaning, exploring,and manipulating data.
Pandas key features
Data Structures:
● Series: One-dimensional labeled array capable of holding any data type.
● DataFrame: Two-dimensional labeled data structure, similar to a table or
spreadsheet, with columns of potentially different types.
Data Manipulation:
● Indexing and Selection: Easily access rows and columns using labels, conditions,
and various slicing methods.
● Filtering and Subsetting: Select specific data based on conditions.
What is pandas used for?
● Import datasets from databases, spreadsheets, comma-separated values (CSV)
files, and more.
● Clean datasets, for example, by dealing with missing values.
● Tidy datasets by reshaping their structure into a suitable format for analysis.
● Aggregate data by calculating summary statistics such as the mean of columns,
correlation between them, and more.
● Visualize datasets and uncover insights.
AGGREGATE FUNCTION
● The aggregate function in Pandas allows to compute summary statistics for data
groups within a DataFrame.
● An aggregate is a function where the values of multiple rows are grouped to form a
single summary value.
● descriptions:

Function Description
count() Counts the number of non-NA/null entries.
sum() Computes the sum of values.
mean() Calculates the average of values.
median() Computes the median of values.
min() Returns the minimum value.
max() Returns the maximum value.
std() Computes the standard deviation.
var() Calculates the variance.
quantile(q) Returns the value at the given quantile (0-1).
first() Returns the first value in the group.
last() Returns the last value in the group.
unique() Returns unique values in a Series.
value_counts() Returns a Series containing counts of unique values.

Example Program
import pandas as pd
data = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70],
'Quantity': [1, 2, 3, 4, 5, 6, 7]
}
df = pd.DataFrame(data)
grouped = df.groupby('Category')
result = grouped.agg(
count=('Value', 'count'),
size=('Value', 'size’),
total_sum=('Value', 'sum’),
mean_value=('Value', 'mean'),
std_dev=('Value', 'std'),
variance=('Value', 'var'),
min_value=('Value', 'min'),
max_value=('Value', 'max'),
first_value=('Value', 'first'),
last_value=('Value', 'last'),
nth_value=('Value', lambda x: x.iloc[1] if len(x) > 1 else None)
)
description = grouped['Value'].describe()
# Display results
print("Aggregate Functions Result:")
print(result)
print("\nDescription Statistics:")
print(description)
DATA OBJECTS IN PANDAS
1. Series
Definition: A one-dimensional labeled array capable of holding any data type (integers,
strings, floating-point numbers, Python objects, etc.).
Indexing: Each element in a Series has an associated index label, which can be either a
default integer index or a custom index.
Uses: Useful for representing a single column of data or a single variable.
Example
import pandas as pd
# Creating a Series with default integer index
s1 = pd.Series([10, 20, 30, 40])
print(s1)
# Creating a Series with a custom index
s2 = pd.Series([1.5, 2.5, 3.5], index=['a', 'b', 'c'])
print(s2)
2. DataFrame
Description: A two-dimensional labeled data structure with columns that can hold
different types of data (similar to a table).
Index: Contains both row and column indices.
Uses: Suitable for storing and manipulating datasets with multiple variables, akin to a
spreadsheet.
Example
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Write a Python program using Pandas to analyze a dataset. Perform the following
tasks:
Load a dataset from a CSV file into a Pandas DataFrame.
Calculate and display the mean and standard deviation of a specific numeric column.
Group the data by a categorical column and compute the average of the numeric column for each
group.
pip install pandas
import pandas as pd
# Load the dataset from a CSV file
df = pd.read_csv('data.csv’)
# Display the first few rows of the DataFrame
print("Dataset:")
print(df)
# Calculate and display the mean and standard deviation of the 'Salary' column
mean_salary= df['Salary'].mean()
std_salary = df['Salary'].std()
print(f"\nMeanSalary:{mean_salary}")
print(f"Standard Deviation of Salary: {std_salary}")
# Group the data by 'City' and compute the average 'Salary' for each group
average_salary_by_city= df.groupby('City')['Salary'].mean()
print("\nAverage Salary by City:")
print(average_salary_by_city)
Data Munging: Introduction to Data Munging
Data munging, sometimes called data wrangling or data cleaning, is converting and
mapping unprocessed data into a different format to improve its suitability and value for
various downstream uses, including analytics. This procedure entails preparing raw data
for analysis by cleaning, organizing, and enriching it in a readable format.
Why is Data Munging important?
Data munging holds immense significance in the field of data analysis, playing a crucial
role in ensuring the quality and reliability of the data used for making informed decisions.
Several key aspects highlight the significance of data munging in the data
analysis process:
• Accuracy and Precision: Data munging addresses discrepancies and errors in raw
data, leading to more accurate and precise analyses. Cleaning and organizing data
ensure that the insights derived are trustworthy and dependable.
• Quality Improvement: By cleaning and preprocessing data, errors and inconsistencies
are reduced, improving overall data quality.
• Compatibility: Data munging ensures that data from different sources can be
integrated and used together effectively.
• Analysis Readiness: Properly formatted data is essential for accurate analysis and
modeling, enabling organizations to make data-driven decisions.
• Facilitation of Decision-Making: Clean and well-structured data, resulting from
effective data munging, facilitates the decision-making process. Decision-makers can
rely on accurate insights derived from trustworthy data.
Techniques and Steps in Data Munging
Data munging involves a series of techniques and steps to transform raw data into a
usable form. Here are some common techniques used:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in
the data. Common tasks include handling missing values, correcting data formats
(e.g., dates, numeric values), and removing duplicates.
2. Data Transformation: Data often needs to be transformed to fit the analytical
requirements. This may include converting categorical data into numerical format
(encoding), normalizing or scaling numeric data, and aggregating or summarizing
data.
3. Handling Missing Data: Techniques such as imputation (replacing missing values
with estimated ones) or deletion (removing rows or columns with missing data) are
used to handle missing data appropriately.
4. Data Integration: Combining data from multiple sources involves aligning
schemas, resolving inconsistencies, and merging datasets to create a unified view.
5. Feature Engineering: Creating new features or variables from existing data that
can enhance the predictive power of machine learning models.
6. Data Validation: Checking data integrity to ensure it meets expected standards and
business rules.
Data Pipeline in Machine Learning
A data pipeline in machine learning is a series of data processing steps that are
designed to transform raw data into a format suitable for training machine learning
models. It automates the workflow, ensuring that data is collected, processed, and fed
into models in a consistent and repeatable manner.
Benefits of a Data Pipeline
• Automation: Reduces manual intervention, making the workflow efficient and
reproducible.
• Scalability: Easily adapts to larger datasets or more complex data processing steps.
• Consistency: Ensures that data is processed in the same way every time, which is
crucial for model performance.
• Collaboration: Makes it easier for teams to work together, as the pipeline provides a
clear structure.
Example of a Simple Data Pipeline in Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Step 1: Data Ingestion
data = pd.read_csv('data.csv')
# Step 2: Data Cleaning (handling missing values)
data.fillna(method='ffill', inplace=True)
# Step 3: Splitting the Data
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Creating a Pipeline
pipeline = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())])
# Step 5: Training the Model
pipeline.fit(X_train, y_train)
# Step 6: Evaluating the Model
accuracy = pipeline.score(X_test, y_test)
print(f'Model Accuracy: {accuracy:.2f}')
Write a program by loading the Iris dataset, split it into train and test sets, and
compute the accuracy score of a pipeline on the test data.
pip install pandas scikit-learn
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Step 1: Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Create a pipeline
pipeline = Pipeline(steps=[
('scaler', StandardScaler()), # Data scaling
('classifier', RandomForestClassifier(random_state=42)) # Model training
])
# Step 4: Train the model
pipeline.fit(X_train, y_train)
# Step 5: Make predictions on the test set
y_pred = pipeline.predict(X_test)
# Step 6: Compute the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')
Output:
Model Accuracy: 1.00

Steps in creating a pipeline

1. Data Collection
• Description: Gather raw data from various sources, such as databases, APIs, or
files.
• Tools:
o Pandas: Use pd.read_csv() to load data from CSV files, pd.read_sql() for
databases, or requests for APIs.
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('data.csv')
2. Data Preprocessing
• Description: Clean and prepare the data for analysis.
o Handling Missing Values: Impute or remove missing values.
o Removing Duplicates: Ensure data integrity by dropping duplicates.
o Correcting Data Types: Ensure all columns have the correct data types.
• Tools:
o Pandas: Utilize functions like fillna(), drop_duplicates(), and astype().

# Handling missing values

data.fillna(method='ffill', inplace=True)
# Removing duplicates
data.drop_duplicates(inplace=True)
3. Data Transformation
• Description: Transform the data into a suitable format for modeling.
o Feature Engineering: Create new features or modify existing ones.
o Encoding Categorical Variables: Convert categorical variables into
numerical formats (e.g., one-hot encoding).
o Scaling: Normalize or standardize numerical features.
• Tools:
o Pandas: Use get_dummies() for one-hot encoding.
o Scikit-learn: Use StandardScaler or MinMaxScaler for scaling.

from sklearn.preprocessing import StandardScaler

# One-hot encoding
data = pd.get_dummies(data, drop_first=True)
# Scaling features
scaler = StandardScaler()
data[['feature1', 'feature2']] = scaler.fit_transform(data[['feature1',
'feature2']])
4. Data Splitting
• Description: Divide the dataset into training, validation, and test sets to evaluate
model performance.
• Tools:
o Scikit-learn: Use train_test_split().

from sklearn.model_selection import train_test_split

X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
5. Model Selection
• Description: Choose the appropriate machine learning algorithm based on the
problem (e.g., classification, regression).
• Tools:
o Scikit-learn: Various models such as RandomForestClassifier,
LogisticRegression, or SVC.
6. Model Training
• Description: Fit the selected model to the training data.
• Tools:
o Scikit-learn: Use the fit() method on the model.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

7. Model Evaluation
• Description: Assess the model's performance using appropriate metrics (accuracy,
precision, recall, etc.).
• Tools:
o Scikit-learn: Use accuracy_score, classification_report, or
confusion_matrix.

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')
8. Model Deployment
• Description: Deploy the trained model to a production environment for real-time
predictions.
• Tools:
o Use web frameworks like Flask or FastAPI to create APIs for model
predictions.
9. Monitoring and Maintenance
• Description: Continuously monitor model performance and retrain or update as
needed.
• Tools: Custom scripts and dashboards can be developed to track model metrics
over time.

Data Visualization Using Matplotlib

• Matplotlib is the whole python package/ library used to create 2D graphs and plots by
using python scripts .
• It is used for creating static, animated, and interactive visualizations in python.
• pyplot is a module in matplotlib, which supports a very wide variety of graphs and
plots namely - histogram, bar charts, power spectra, error charts etc.
• It is used along with NumPy to provide an environment for MatLab.
Interface
• The two main interfaces provided by matplotlib are:
• Pyplot interface: this is a state-based interface that mimics matlab's plotting
functions. It allows for quick and easy plotting with a simple command structure,
ideal for beginners and quick data exploration.
• Object-oriented interface: this approach gives more control and customization over
the plots. It involves creating figure and axes objects explicitly and is more suitable
for complex visualizations.
Example of Pyplot Interface
import matplotlib.pyplot as plt
import numpy as np
# Create data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create a simple plot using pyplot
plt.plot(x, y, label='Sine Wave')
plt.title('Simple Sine Wave')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid()
plt.show()

Example of Object oriented Interface

import matplotlib.pyplot as plt
import numpy as np
# Create data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create a figure and axes
fig, ax = plt.subplots()
# Plot using the axes object
ax.plot(x, y, label='Sine Wave')
ax.set_title('Object-Oriented Sine Wave')
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.legend()
ax.grid()
# Show the plot
plt.show()

Key Pyplot Functions

Function Description Usage
plt.plot() Creates a line plot. plt.plot(x, y, label='label')
plt.scatter() Creates a scatter plot. plt.scatter(x, y, color='r', marker='o')
plt.bar() Creates a bar chart. plt.bar(x, height, width=0.5)
plt.hist() Creates a histogram. plt.hist(data, bins=10, alpha=0.7)
Creates a box plot to visualize
plt.boxplot() plt.boxplot(data)
distribution.
plt.title() Sets the title of the current plot. plt.title('Title')
plt.xlabel() Sets the x-axis label. plt.xlabel('X-axis label')
plt.ylabel() Sets the y-axis label. plt.ylabel('Y-axis label')
plt.legend() Displays a legend for the plot. plt.legend()
plt.grid() Adds a grid to the plot. plt.grid(True)
plt.xlim() Sets the limits for the x-axis. plt.xlim(0, 10)
plt.ylim() Sets the limits for the y-axis. plt.ylim(-1, 1)
Saves the current figure to a
plt.savefig() plt.savefig('filename.png', dpi=300)
file.
plt.show() Displays the current figure. plt.show()
plt.subplot() Creates a grid of subplots. plt.subplot(2, 2, 1)
Automatically adjusts subplot
plt.tight_layout() plt.tight_layout()
parameters for better layout.
plt.annotate('Text', xy=(x, y),
plt.annotate() Adds annotations to the plot.
xytext=(x_offset, y_offset))

Types of Matplotlib
1. Line Chart
• The line chart is represented by a series of datapoints connected with a straight line.
• It is used to represent a relationship between two data X and Y on a different axis.
• Generally line charts are used to display trends over time.
• A line chart or line graph can be created using the plot() function available in pyplot
library.
import matplotlib.pyplot as plt
# initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# plotting the data
plt.plot(x, y)
# Adding title to the plot
plt.title("Line Chart")
# Adding label on the y-axis
plt.ylabel('Y-Axis')
# Adding label on the x-axis
plt.xlabel('X-Axis')
plt.show()
2. BAR Chart
• A bar chart is a graph that represents the category of data with rectangular bars with
lengths and heights that is proportional to the values which they represent.
• The bar plots can be plotted horizontally or vertically.
• A bar chart describes the comparisons between the discrete categories.
• It can be created using the bar() method.
import matplotlib.pyplot as plt
import pandas as pd
# Reading the tips.csv file
data = pd.read_csv('tips.csv')
# initializing the data
x = data['day']
y = data['total_bill']
# plotting the data
plt.bar(x, y)
# Adding title to the plot
plt.title("Tips Dataset")
# Adding label on the y-axis
plt.ylabel('Total Bill')
# Adding label on the x-axis
plt.xlabel('Day')
plt.show()

3. HISTOGRAM
• A histogram is basically used to represent data provided in a form of some groups.
• It is a type of bar plot where the X-axis represents the bin ranges while the Y-axis
gives information about frequency.
• The hist() function is used to compute and create histogram of x.
import matplotlib.pyplot as plt
import pandas as pd
# Reading the tips.csv file
data = pd.read_csv('tips.csv')
# initializing the data
x = data['total_bill']
# plotting the data
plt.hist(x)
# Adding title to the plot
plt.title("Tips Dataset")
# Adding label on the y-axis
plt.ylabel('Frequency')
# Adding label on the x-axis
plt.xlabel('Total Bill')
plt.show()

4.Scatter Plot
• Scatter plots are ideal for visualizing the relationship between two continuous
variables.

• A scatter plot uses dots to represent values for two different numeric variables.
• The position of each dot on the horizontal and vertical axis indicates values for an
individual data point.
import matplotlib.pyplot as plt
# data to display on plots
x = [3, 1, 3, 12, 2, 4, 4]
y = [3, 2, 1, 4, 5, 6, 7]
# This will plot a simple scatter chart
plt.scatter(x, y)
# Adding legend to the plot
plt.legend("A")
# Title to the plot
plt.title("Scatter chart")
plt.show()

5. Box Plot
• A box plot, also known as a box-and-whisker plot, provides a visual summary of the
distribution of a dataset.
• It represents key statistical measures such as the median, quartiles, and potential
outliers in a concise and intuitive manner.
• Box plots are particularly useful for comparing distributions across different groups or
identifying anomalies in the data.
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
6. Pie Chart
• A Pie Chart is a circular statistical plot that can display only one series of data.
• The area of the chart is the total percentage of the given data.
• The area of slices of the pie represents the percentage of the parts of the data. The
slices of pie are called wedges.
import matplotlib.pyplot as plt
# data to display on plots
x = [1, 2, 3, 4]
# this will explode the 1st wedge
# i.e. will separate the 1st wedge
# from the chart
e =(0.1, 0, 0, 0)
# This will plot a simple pie chart
plt.pie(x, explode = e)
# Title to the plot
plt.title("Pie chart")
plt.show()

Apply and explore various plotting functions on UCI data sets.

1. Import Libraries and Load Data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Iris dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris_data = pd.read_csv(url, header=None, names=columns)

2. Line Plot
A line plot isn’t typically used for this dataset, but we can visualize trends by plotting
petal length over sepal length for each species.
plt.figure(figsize=(10, 6))
for species in iris_data['species'].unique():
subset = iris_data[iris_data['species'] == species]
plt.plot(subset['sepal_length'], subset['petal_length'], marker='o', linestyle='',
label=species)
plt.title('Petal Length vs. Sepal Length')
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.legend()
plt.grid()
plt.show()

3. Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=iris_data, x='sepal_length', y='sepal_width', hue='species',
style='species', markers=["o", "s", "D"])
plt.title('Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.grid()
plt.show()

4. Bar Chart
plt.figure(figsize=(10, 6))
mean_petal_length = iris_data.groupby('species')['petal_length'].mean()
mean_petal_length.plot(kind='bar', color='orange')
plt.title('Average Petal Length by Species')
plt.xlabel('Species')
plt.ylabel('Average Petal Length')
plt.grid()
plt.show()
5. Histogram
plt.figure(figsize=(10, 6))
plt.hist(iris_data['petal_width'], bins=15, color='purple', alpha=0.7)
plt.title('Distribution of Petal Width')
plt.xlabel('Petal Width')
plt.ylabel('Frequency')
plt.grid()
plt.show()

6. Box Plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='species', y='petal_length', data=iris_data)
plt.title('Box Plot of Petal Length by Species')
plt.xlabel('Species')
plt.ylabel('Petal Length')
plt.grid()
plt.show()

Interactive Visualization with Advanced Data Learning Representation

Bokeh is a versatile and powerful Python library for creating interactive visualizations.
It’s designed to enable users to create elegant and informative graphics that can be
embedded in web applications
Key Features of Bokeh
• Interactivity: Easily add widgets like sliders, dropdowns, and buttons to create
interactive plots.
• Web-Based: Bokeh visualizations can be embedded in web applications or exported
as standalone HTML files.
• Rich Output: Supports various types of visualizations including line plots, scatter
plots, bar charts, heatmaps, and more.
• Customizable: Offers extensive customization options for aesthetics and
functionality.
Interactive Scatter Plot with Bokeh
pip install bokeh
from bokeh.plotting import figure, show, output_file
from bokeh.models import ColumnDataSource, Slider, HoverTool
from bokeh.layouts import column
import numpy as np
# Step 1: Generate sample data
N = 100
x = np.random.rand(N) * 10
y = np.random.rand(N) * 10
sizes = np.random.randint(5, 50, size=N)
# Create a ColumnDataSource
source = ColumnDataSource(data=dict(x=x, y=y, size=sizes))
# Step 2: Prepare output file
output_file("interactive_scatter_plot.html")
# Step 3: Create a figure
p = figure(title="Interactive Scatter Plot", tools="")
# Add scatter renderer
scatter = p.scatter('x', 'y', source=source, size='size', alpha=0.6)
# Add a hover tool
hover = HoverTool()
hover.tooltips = [("X", "@x"), ("Y", "@y")]
p.add_tools(hover)
# Step 4: Create a slider for point size
size_slider = Slider(start=5, end=50, value=10, step=1, title="Point Size")
# Update function for the slider
def update_size(attr, old, new):
source.data['size'] = np.random.randint(5, size_slider.value, size=N)
# Link the slider to the update function
size_slider.on_change('value', update_size)
# Step 5: Layout and show the plot
layout = column(size_slider, p)
show(layout)
Visualizations - Visual data analysis techniques
Visual data analysis techniques involve the use of graphical representations to explore,
understand, and communicate data insights. These techniques leverage visual elements to
highlight patterns, trends, relationships, and outliers within datasets.
Charts and Graphs
• Bar Charts: Display categorical data with rectangular bars. Useful for comparing
quantities across different categories.
• Histograms: Show the distribution of numerical data by grouping values into bins.
Ideal for visualizing frequency distributions.
• Line Charts: Connect data points with lines to illustrate trends over time. Commonly
used in time series analysis.
• Pie Charts: Represent parts of a whole. Useful for displaying proportions of
categories.
2. Statistical Visualizations
• Box Plots: Summarize data through quartiles, highlighting the median and potential
outliers. Useful for comparing distributions.
• Scatter Plots: Display relationships between two continuous variables, helping to
identify correlations and trends.
• Heatmaps: Use color gradients to represent values in a matrix format. Effective for
visualizing correlation matrices or frequency counts.
3. Multivariate Analysis
• Pair Plots: Show scatter plots for multiple pairs of variables, allowing for a
comprehensive view of relationships.
• 3D Scatter Plots: Visualize relationships involving three variables. Useful for
exploring complex datasets.
• Faceted Plots: Create a grid of plots based on the values of another variable, enabling
comparisons across groups.
4. Geospatial Visualization
• Choropleth Maps: Use color shading to represent data values across geographical
regions. Effective for showing demographic or economic data.
• Point Maps: Display individual data points on a map, often used for location-based
analysis.
• Heat Maps: Show density of events over a geographical area, highlighting hotspots.
5. Interactive Visualizations
• Dashboards: Combine multiple visualizations into one interface for dynamic
analysis. Useful for monitoring key performance indicators (KPIs).
• Tooltips and Annotations: Provide additional context when hovering over data
points or elements, enhancing understanding.
• Filters and Sliders: Allow users to manipulate data views dynamically, facilitating
exploratory analysis.
6. Advanced Visualizations
• Tree Maps: Visualize hierarchical data using nested rectangles, representing
proportions within a category.
• Sankey Diagrams: Show flow and relationships between categories using arrows
whose width indicates the flow quantity.
• Network Graphs: Represent relationships between entities using nodes and edges,
useful for social network analysis.
7. Data Storytelling
• Narrative Visualizations: Combine visual elements with narrative techniques to tell
a data-driven story, enhancing engagement and understanding.
Interaction techniques
Interaction techniques in visual data analysis are designed to enhance user engagement
and facilitate a deeper exploration of data. These techniques allow users to interact with
visualizations in various ways, making it easier to analyze and interpret complex datasets.
1. Tooltips
Tooltips are small pop-up boxes that display additional information when a user hovers
over a data point or visual element. It provide context-specific details without cluttering
the visualization. They can show values, labels, or descriptions that enhance
understanding. It is typically implemented using mouseover events in visualization
libraries, allowing for dynamic content display.
2. Hover Effects
Hover effects involve changing the appearance of data points or elements when the
mouse hovers over them, such as changing color, size, or opacity. To improve visibility
and focus attention on specific data elements, making it clear which data point is being
examined. It is Often achieved with CSS styling or JavaScript event listeners that trigger
visual changes on hover.
3. Filters and Selection
Filters allow users to narrow down the data displayed based on specific criteria (e.g.,
categories, date ranges, or numeric values. It is used to enable targeted analysis, helping
users focus on relevant data subsets and reducing information overload.
4.Sliders
Sliders are graphical controls that allow users to adjust numerical values or date ranges
dynamically.It provide a way to explore how changes in parameters affect the data
visualization in real-time.s

Data Analytics For Accounting, 2nd Edition Vernon Richardson PDF
91% (11)
Data Analytics For Accounting, 2nd Edition Vernon Richardson PDF
1,026 pages
AI Solutions Class 10 - Part B
0% (1)
AI Solutions Class 10 - Part B
33 pages
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
Unit 2
No ratings yet
Unit 2
178 pages
Cloud Enabling Technologies Service Oriented Architecture
No ratings yet
Cloud Enabling Technologies Service Oriented Architecture
53 pages
Programming in C - CS3251 - HandWritten Notes - Un - 250316 - 200237
No ratings yet
Programming in C - CS3251 - HandWritten Notes - Un - 250316 - 200237
38 pages
FDS IMPORTANT QUESTIONS EduEngg
100% (1)
FDS IMPORTANT QUESTIONS EduEngg
7 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
Applied Probability and Statistics For Computer Science Engineers
No ratings yet
Applied Probability and Statistics For Computer Science Engineers
1 page
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Unit 3 - Iot and Arduino Programming
No ratings yet
Unit 3 - Iot and Arduino Programming
55 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Research 7 Q3 W4
No ratings yet
Research 7 Q3 W4
9 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Lecture Notes-Cns by Suthoju Girija Rani
100% (1)
Lecture Notes-Cns by Suthoju Girija Rani
163 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
DV Lab Manual
No ratings yet
DV Lab Manual
88 pages
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
No ratings yet
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
24 pages
BP3 Usermanual
No ratings yet
BP3 Usermanual
40 pages
23ma2101 Advanced Mathematics For Scientific Computing
No ratings yet
23ma2101 Advanced Mathematics For Scientific Computing
10 pages
Data Analyst Roadmap by Shakra Shamim
0% (1)
Data Analyst Roadmap by Shakra Shamim
13 pages
IT DWDM Unit I New PPT
No ratings yet
IT DWDM Unit I New PPT
60 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
38 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Unit III Data Analysis and Reporting
No ratings yet
Unit III Data Analysis and Reporting
15 pages
Unit 3
No ratings yet
Unit 3
24 pages
ccs346 Eda
No ratings yet
ccs346 Eda
2 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
FDS Lesson Plan
No ratings yet
FDS Lesson Plan
8 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
R Language
No ratings yet
R Language
59 pages
Convergence in Big Data Analytics
No ratings yet
Convergence in Big Data Analytics
5 pages
Module 5
No ratings yet
Module 5
40 pages
Ccw331 Business Analytics
No ratings yet
Ccw331 Business Analytics
41 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
26 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Unit 4
No ratings yet
Unit 4
42 pages
ML Unit-3
No ratings yet
ML Unit-3
92 pages
Statistical Analysis Using Excel: Bca 2 Semester Course Code: Bca 136
No ratings yet
Statistical Analysis Using Excel: Bca 2 Semester Course Code: Bca 136
13 pages
MLQuestion-Bank (2) - For IA1
No ratings yet
MLQuestion-Bank (2) - For IA1
2 pages
CNS Bits
No ratings yet
CNS Bits
3 pages
CS8492-Database Management Systems
No ratings yet
CS8492-Database Management Systems
15 pages
Unit 1 (DMW)
No ratings yet
Unit 1 (DMW)
53 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
Beyond Excel:: Steps To A Visually Persuasive Dashboard
No ratings yet
Beyond Excel:: Steps To A Visually Persuasive Dashboard
19 pages
Python For Data Analysts - Quick Summary
No ratings yet
Python For Data Analysts - Quick Summary
6 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Unit V DVT
No ratings yet
Unit V DVT
20 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
CS8792 CNS Unit 1 - R1
No ratings yet
CS8792 CNS Unit 1 - R1
89 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
Chandu Zeroth Review
No ratings yet
Chandu Zeroth Review
15 pages
Clustering in Non-Euclidean Space
No ratings yet
Clustering in Non-Euclidean Space
4 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
24 pages
Effective Data Visualization
No ratings yet
Effective Data Visualization
4 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
BCA Internship Report JECRC UNIVERSITY
No ratings yet
BCA Internship Report JECRC UNIVERSITY
56 pages
Data-Business Jobs List - 17-Dec-2024
No ratings yet
Data-Business Jobs List - 17-Dec-2024
16 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
Knowledge Representation Issue
No ratings yet
Knowledge Representation Issue
18 pages
Lab Manual 01 (WT)
No ratings yet
Lab Manual 01 (WT)
10 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
03-Data Science Methodology
No ratings yet
03-Data Science Methodology
8 pages
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
No ratings yet
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
2 pages
The Role of Computers in Formula 1 Cars
No ratings yet
The Role of Computers in Formula 1 Cars
8 pages
PDS Viva
No ratings yet
PDS Viva
3 pages
2012 Data Mining Applications in The Oil and Gas Industry
No ratings yet
2012 Data Mining Applications in The Oil and Gas Industry
7 pages
QCAA Data Test
No ratings yet
QCAA Data Test
11 pages
Smriti Kumari Lab Record
No ratings yet
Smriti Kumari Lab Record
45 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
38 pages
Python For Exploratory Data Analysis
No ratings yet
Python For Exploratory Data Analysis
12 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
7.7 Sem. I 3a Elementary Statistical Techniques For Economics
No ratings yet
7.7 Sem. I 3a Elementary Statistical Techniques For Economics
6 pages
Cs3481 - Dbms Record
No ratings yet
Cs3481 - Dbms Record
63 pages
Unit 5 Fod (1) (Repaired)
No ratings yet
Unit 5 Fod (1) (Repaired)
28 pages
Assignment - Role of Data Information and Information Technology in Meeting SDGs
No ratings yet
Assignment - Role of Data Information and Information Technology in Meeting SDGs
6 pages
Unit 2 Fod
No ratings yet
Unit 2 Fod
27 pages
II Cse Cs3352 Fds QB Unit2
No ratings yet
II Cse Cs3352 Fds QB Unit2
5 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
NIHARIKA SRIVASTAVA Resume
No ratings yet
NIHARIKA SRIVASTAVA Resume
3 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet

Unit 5

Uploaded by

Unit 5

Uploaded by

UNIT -V DATA SCIENCE USING PYTHON

Introduction to Essential Data Science Packages: Numpy, Scipy, Jupyter, Stats

Introduction to Essential Data Science Packages:

Steps in creating a pipeline

# Handling missing values

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

Data Visualization Using Matplotlib

Example of Object oriented Interface

Key Pyplot Functions

Apply and explore various plotting functions on UCI data sets.

1. Import Libraries and Load Data

Interactive Visualization with Advanced Data Learning Representation

You might also like