DataAnalytics Lab Manual
DataAnalytics Lab Manual
Lab Manual
1
AM605PC: DATA ANALYTICS LAB
B.Tech. III Year II Sem.
Course Objectives:
To explore the fundamental concepts of data analytics.
To learn the principles and methods of statistical analysis
Discover interesting patterns, analyze supervised and unsupervised models
and estimate the accuracy of the algorithms.
To understand the various search methods and visualization techniques.
Course Outcomes:
Understand linear regression and logistic regression
Understand the functionality of different classifiers
Implement visualization techniques using different graphs
Apply descriptive and predictive analytics for different types of data
2
List of Experiments:
1. Data Preprocessing
a. Handling missing values
b. Noise detection removal
c. Identifying data redundancy and elimination
2. Implement any one imputation model
3. Implement Linear Regression
4. Implement Logistic Regression
5. Implement Decision Tree Induction for classification
6. Implement Random Forest Classifier
7. Implement ARIMA on Time Series data
8. Object segmentation using hierarchical based methods
9. Perform Visualization techniques (types of maps - Bar, Colum, Line, Scatter,
3D Cubes etc)
10. Perform Descriptive analytics on healthcare data
11. Perform Predictive analytics on Product Sales data
12. Apply Predictive analytics for Weather forecasting
TEXT BOOKS:
1. Student’s Handbook for Associate Analytics – II, III.
2. Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan Kaufmann
Publishers.
REFERENCE BOOKS:
1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira
3. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman
Milliway Labs Jeffrey D Ullman Stanford Univ
SOFTWARES:
1. Python IDLE 2. Pycharm 3.Visual Studio
3
1. Data Preprocessing
a. Handling missing values
b. Noise detection removal
c. Identifying data redundancy and elimination
a. Handling missing values
PROGRAM:
import pandas as pd
import numpy as np
# Sample data with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, np.nan, 5],
'C': ['foo', 'bar', 'baz', np.nan, 'qux']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Method 1: Remove rows with any missing values
df_dropna = df.dropna()
print("\nDataFrame after removing rows with any missing values:")
print(df_dropna)
# Method 2: Fill missing values with a specific value (e.g., 0)
df_fillna = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_fillna)
# Method 3: Fill missing values with the mean of the column (numerical columns only)
df_mean = df.copy()
df_mean['A'] = df_mean['A'].fillna(df_mean['A'].mean())
df_mean['B'] = df_mean['B'].fillna(df_mean['B'].mean())
4
print("\nDataFrame after filling missing values with the mean of the column:")
print(df_mean)
# Method 4: Fill missing values using forward fill
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_ffill)
# Method 5: Fill missing values using backward fill
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after backward fill:")
print(df_bfill)
# Method 6: Interpolation for numerical columns
df_interp = df.interpolate()
print("\nDataFrame after interpolation:")
print(df_interp)
output:
5
b. Noise detection removal
PROGRAM:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
# Sample data with noise
data = {
'A': [1, 2, 3, 4, 100, 6, 7, 8, 9, 10],
'B': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Detect noise using Isolation Forest
iso_forest = IsolationForest(contamination=0.1)
df['anomaly'] = iso_forest.fit_predict(df[['A', 'B']])
# Remove noise
df_clean = df[df['anomaly'] == 1].drop(columns=['anomaly'])
print("\nDataFrame after noise removal:")
print(df_clean)
6
OUTPUT:
7
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicate rows:")
print(df_no_duplicates)
# Remove duplicate columns
df_no_duplicates = df_no_duplicates.loc[:, ~df_no_duplicates.columns.duplicated()]
print("\nDataFrame after removing duplicate columns:")
print(df_no_duplicates)
# Calculate correlation matrix and drop highly correlated columns
correlation_matrix = df_no_duplicates.corr().abs()
upper = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape),
k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)
df_no_redundant_columns = df_no_duplicates.drop(columns=to_drop)
print("\nDataFrame after removing highly correlated columns:")
print(df_no_redundant_columns)
output:
8
2. Implement any one imputation model
PROGRAM:
import pandas as pd
from sklearn.impute import SimpleImputer
# Sample DataFrame with missing values
data = {
'A': [1, 2, None, 4, 5],
'B': [None, 2, 3, 4, None],
'C': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# Display original DataFrame
print("Original DataFrame:\n", df)
# Mean Imputation using SimpleImputer
mean_imputer = SimpleImputer(strategy='mean')
df_mean_imputed = pd.DataFrame(mean_imputer.fit_transform(df), columns=df.columns)
# Display DataFrame after Mean Imputation
print("\nDataFrame After Mean Imputation:\n", df_mean_imputed)
OUTPUT:
9
3. Implement Linear Regression
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Sample DataFrame
data = {
'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Feature2': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
'Target': [2.5, 3.5, 6, 7, 8.5, 10, 11.5, 13, 14.5, 16]
}
df = pd.DataFrame(data)
# Define features and target variable
X = df[['Feature1', 'Feature2']]
y = df['Target']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
10
# Visualize the results
plt.scatter(X_test['Feature1'], y_test, color='blue', label='Actual')
plt.scatter(X_test['Feature1'], y_pred, color='red', label='Predicted')
plt.xlabel('Feature1')
plt.ylabel('Target')
plt.title('Linear Regression Results')
plt.legend()
plt.show()
output:
11
4. Implement Logistic Regression
PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Sample DataFrame
data = {
'Hours_Studied': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
'Attendance': [1, 0, 1, 1, 0, 0, 1, 1, 0, 1],
'Passed': [0, 0, 1, 1, 0, 0, 1, 1, 0, 1] # 1: Passed, 0: Failed
}
df = pd.DataFrame(data)
# Define features and target variable
X = df[['Hours_Studied', 'Attendance']]
y = df['Passed']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
12
print("Classification Report:\n", class_report)
# Visualize the results (Optional)
import matplotlib.pyplot as plt
plt.scatter(df['Hours_Studied'], df['Passed'], color='blue', label='Actual')
plt.scatter(X_test['Hours_Studied'], y_pred, color='red', label='Predicted')
plt.xlabel('Hours Studied')
plt.ylabel('Passed')
plt.title('Logistic Regression Results')
plt.legend()
plt.show()
output:
13
14
5. Implement Decision Tree Induction for classification
PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
# Visualize the Decision Tree
plt.figure(figsize=(15, 10))
plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names,
filled=True)
15
plt.title('Decision Tree for Iris Classification')
plt.show()
output:
16
6. Implement Random Forest Classifier
PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Random Forest model
model = RandomForestClassifier(random_state=42, n_estimators=100)
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
17
output:
18
7. Implement ARIMA on Time Series data
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Sample time series data
data = {
'Date': pd.date_range(start='2022-01-01', periods=24, freq='M'),
'Value': [120, 130, 135, 140, 150, 160, 170, 175, 180, 190, 200, 210,
220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)
# Plot the time series data
plt.figure(figsize=(10, 6))
plt.plot(df, label='Original Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series Data')
plt.legend()
plt.show()
# Plot ACF and PACF
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
plot_acf(df['Value'], lags=20, ax=axes[0])
plot_pacf(df['Value'], lags=20, ax=axes[1])
plt.show()
# Fit ARIMA model
model = ARIMA(df['Value'], order=(2, 1, 2))
19
fit = model.fit()
# Summary of the model
print(fit.summary())
# Forecasting
forecast = fit.forecast(steps=12)
forecast_dates = pd.date_range(start=df.index[-1] + pd.DateOffset(months=1), periods=12,
freq='M')
forecast_df = pd.DataFrame(forecast, index=forecast_dates, columns=['Forecast'])
# Plot the forecast
plt.figure(figsize=(10, 6))
plt.plot(df, label='Original Data')
plt.plot(forecast_df, label='Forecast', color='red')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('ARIMA Forecast')
plt.legend()
plt.show()
output:
20
8. Object segmentation using hierarchical based methods
PROGRAM:
import matplotlib.pyplot as plt
from skimage import data
from skimage.segmentation import felzenszwalb
from skimage.color import label2rgb
# Load a sample image
image = data.astronaut()
# Apply Felzenszwalb's Graph-Based Segmentation
segments_fz = felzenszwalb(image, scale=100, sigma=0.5, min_size=50)
# Create an overlay of the original image and the segmented image
segmented_image = label2rgb(segments_fz, image, kind='avg')
# Plot the results
fig, ax = plt.subplots(1, 2, figsize=(15, 10), sharex=True, sharey=True)
ax[0].imshow(image)
ax[0].set_title('Original Image')
ax[0].axis('off')
ax[1].imshow(segmented_image)
ax[1].set_title('Felzenszwalb Segmentation')
ax[1].axis('off')
plt.tight_layout()
plt.show()
output:
21
9. Perform Visualization techniques (types of maps - Bar, Colum, Line, Scatter,
3D Cubes etc)
PROGRAM:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Sample data for bar and column charts
categories = ['A', 'B', 'C', 'D']
values = [10, 20, 15, 25]
# Sample data for line chart and scatter plot
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Sample data for 3D plot
z = [1, 4, 9, 16, 25]
# Plotting Bar Chart
plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
plt.bar(categories, values, color='blue')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')
# Plotting Column Chart
plt.subplot(2, 2, 2)
plt.bar(categories, values, color='green')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Column Chart')
# Plotting Line Chart
plt.subplot(2, 2, 3)
plt.plot(x, y, marker='o', linestyle='-', color='red')
plt.xlabel('X-axis')
22
plt.ylabel('Y-axis')
plt.title('Line Chart')
# Plotting Scatter Plot
plt.subplot(2, 2, 4)
plt.scatter(x, y, color='purple')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.tight_layout()
plt.show()
# Plotting 3D Plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z, color='orange')
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')
ax.set_title('3D Scatter Plot')
plt.show()
output:
23
10. Perform Descriptive analytics on healthcare data
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Sample healthcare dataset
data = {
'PatientID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Age': [25, 45, 35, 50, 65, 55, 40, 30, 70, 60],
'Gender': ['F', 'M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'M'],
'BloodPressure': [120, 140, 130, 150, 160, 155, 135, 125, 165, 150],
'Cholesterol': [200, 220, 215, 250, 240, 230, 210, 205, 255, 245],
'Diabetes': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)
# Display basic statistics
print("Basic Statistics:")
print(df.describe())
24
# Count of gender
gender_count = df['Gender'].value_counts()
print("\nGender Count:")
print(gender_count)
# Count of diabetes status
diabetes_count = df['Diabetes'].value_counts()
print("\nDiabetes Count:")
print(diabetes_count)
# Plot Age distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], bins=10, kde=True)
plt.xlabel('Age')
plt.title('Age Distribution')
plt.show()
# Plot Blood Pressure distribution by Gender
plt.figure(figsize=(10, 6))
sns.boxplot(x='Gender', y='BloodPressure', data=df)
plt.xlabel('Gender')
plt.ylabel('Blood Pressure')
plt.title('Blood Pressure Distribution by Gender')
plt.show()
# Plot Cholesterol levels by Diabetes status
plt.figure(figsize=(10, 6))
sns.boxplot(x='Diabetes', y='Cholesterol', data=df)
plt.xlabel('Diabetes')
plt.ylabel('Cholesterol')
plt.title('Cholesterol Levels by Diabetes Status')
plt.show()
25
output:
26
27
11. Perform Predictive analytics on Product Sales data
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
28
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
output:
29
12. Apply Predictive analytics for Weather forecasting.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Sample weather dataset
data = {
'Day': pd.date_range(start='2022-01-01', periods=30, freq='D'),
'Temperature': [25, 26, 27, 25, 28, 29, 30, 28, 27, 26,
25, 24, 23, 26, 27, 28, 29, 27, 25, 26,
28, 29, 30, 31, 32, 31, 29, 28, 27, 26]
}
df = pd.DataFrame(data)
df['Day_Num'] = df['Day'].dt.dayofyear
# Plot the historical temperature data
plt.figure(figsize=(10, 6))
30
plt.plot(df['Day'], df['Temperature'], marker='o', linestyle='-', color='blue')
plt.xlabel('Day')
plt.ylabel('Temperature'
plt.title('Historical Temperature Data')
plt.grid(True)
plt.show()
# Define features and target variable
X = df[['Day_Num']]
y = df['Temperature']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
# Forecast future temperatures
future_days = pd.date_range(start='2022-02-01', periods=7, freq='D')
future_days_num = future_days.dayofyear
future_temperatures = model.predict(future_days_num.reshape(-1, 1))
# Plot the forecasted temperatures
plt.figure(figsize=(10, 6))
plt.plot(df['Day'], df['Temperature'], marker='o', linestyle='-', color='blue', label='Historical
Temperature')
plt.plot(future_days, future_temperatures, marker='o', linestyle='--', color='red',
label='Forecasted
31
Temperature')
plt.xlabel('Day')
plt.ylabel('Temperature')
plt.title('Temperature Forecast')
plt.legend()
plt.grid(True)
plt.show()
output:
Manual Prepared by
Bhuvaneswari Beeram
Assistant Professor in ECET
32
33
34
35