0% found this document useful (0 votes)
41 views26 pages

TYCS Practical

Data science involves analyzing data to extract useful information and insights. Common techniques include wrangling, preprocessing, modeling, and visualizing data. This document discusses concepts like regression, clustering, principal component analysis and how to apply them in Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views26 pages

TYCS Practical

Data science involves analyzing data to extract useful information and insights. Common techniques include wrangling, preprocessing, modeling, and visualizing data. This document discusses concepts like regression, clustering, principal component analysis and how to apply them in Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data science

Vikas College Of Arts, Science and Commerce Page 1


INDEX

Sr
Title Date Sign
No

1 Introduction to Excel

2 Data Frames and Basic Data Pre-processing

3 Feature Scaling and Dummification

4 Hypothesis Testing

5 ANOVA (Analysis of Variance)

6 Regression and Its Types

7 Logistic Regression and Decision Tree

8 K-Means Clustering

9 Principal Component Analysis (PCA)

10 Data Visualization and Storytelling

Vikas College Of Arts, Science and Commerce Page 2


PRACTICAL 1
Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.

Steps
Step 1: Go to conditional formatting > Greater Than

Step 2: Enter the greater than filter value for example 2000.

Vikas College Of Arts, Science and Commerce Page 3


Step 3: Go to Data Bars > Solid Fill in conditional formatting.

B. Create a pivot table to analyse and summarize data.


Steps
Step 1: select the entire table and go to Insert tab PivotChart > Pivotchart Step 2:
Select “New worksheet” in the create pivot chart window.

Vikas College Of Arts, Science and Commerce Page 4


Step 3: Select and drag attributes in the below boxes.

C. Use VLOOKUP function to retrieve information from a different worksheet or table. Steps:
Step 1: click on an empty cell and type the following command.
=VLOOKUP(B3, B3:D3,1, TRUE)

Vikas College Of Arts, Science and Commerce Page 5


D. Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.

Step 2: Fill the information in the window accordingly and click ok.

Vikas College Of Arts, Science and Commerce Page 6


Vikas College Of Arts, Science and Commerce Page 7
PRACTICAL 2
Data Frames and Basic Data Pre-processing
A. Read data from CSV and JSON files into a data frame.
B. Perform basic data pre-processing tasks such as handling missing values and outliers. Code:
import pandas as pd

# Reading CSV file into DataFrame


df = pd.read_csv("samp.csv")
print("Our dataset:")
print(df)

# Reading JSON file into DataFrame


data = pd.read_json("sample.json")
print(data)

# Displaying the first 10 rows of the DataFrame


df.head(10)

# Filling missing values with 0


print("Dataset after filling NA values with 0:")
df2 = df.fillna(value=0)
print(df2)

# Dropping rows with any missing values


print("Dataset after dropping NA values:")
df.dropna(inplace=True)
print(df)

Vikas College Of Arts, Science and Commerce Page 8


C. Manipulate and transform data using functions like filtering, sorting, and grouping Code:
import pandas as pd

# Reading CSV file into DataFrame


df = pd.read_csv("samp.csv")

# Filtering data based on a condition (e.g., age greater than 25)


filtered_df = df[df["age"] > 25]

# Sorting data based on a column (e.g., sorting by age in descending order)


sorted_df = df.sort_values(by="age", ascending=False)

# Grouping data based on a column and applying an aggregation function (e.g., finding the average age per
city)
grouped_df = df.groupby("city").agg({"age": "mean"})

# Displaying the filtered DataFrame


print("Filtered DataFrame:")
print(filtered_df)

# Displaying the sorted DataFrame


print("\nSorted DataFrame:")
print(sorted_df)

# Displaying the grouped DataFrame


print("\nGrouped DataFrame:")
print(grouped_df)

Vikas College Of Arts, Science and Commerce Page 9


PRACTICAL 3
Feature Scaling and Dummification
A. Apply feature-scaling techniques like standardization and normalization to numerical
features.

Code:
# Standardization and normalization import pandas as pd
import numpy as np
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler

print("printing few data")


df = pd.read_csv("D:\TYCS\Data Science\SampleFile.csv")
print(df.head())

print("Max values")
max_vals = np.max(np.abs(df))
print(max_vals)
print((df - max_vals) / max_vals)

print("Normalization")
scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df.head())

print("Standardization")
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df.head())

Vikas College Of Arts, Science and Commerce Page 10


Vikas College Of Arts, Science and Commerce Page 11
B. Perform feature Dummification to convert categorical variables into numerical
representations.
Code:

import pandas as pd
data = pd.read_csv("data32.csv")
categorical_features = data.select_dtypes(include="object")
dummies = pd.get_dummies(categorical_features)
data = pd.concat([data, dummies], axis=1)
data.drop(categorical_features, axis=1, inplace=True)
data.to_csv("Output.csv")

Vikas College Of Arts, Science and Commerce Page 12


Practical 4 Hypothesis
Testing
Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test) # t-test
import numpy as np
import scipy.stats as stats

np.random.seed(42)
scoreA = np.random.normal(loc=70,scale=10,size=30)
scoreB = np.random.normal(loc=75,scale=10,size=30)

t_stat,pvalue = stats.ttest_ind(scoreA,scoreB)
print(f"T-Statistics: {t_stat}\nP-Value: {pvalue}")

alpha = 0.05
if pvalue < alpha:
print("Reject the null hypothesis. There is a significant difference in exam scores.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in exam scores.")

Output:

Chi-test
import numpy as np
import scipy.stats as stats
observed_data = np.array([[25, 15], [20, 40]])
chi2, pvalue, dof, expected = stats.chi2_contingency(observed_data)
print(f'Chi-Square Statistic: {chi2}\nPvalue: {pvalue}\nDegrees of Freedom: {dof}\nExpected
frequency:\n{expected}')
alpha = 0.05
if pvalue < alpha:
print("Reject the null hypothesis. There is a significant association between gender and job satisfaction.")
else:
print("Fail to reject the null hypothesis. Gender and job satisfaction are independent.")
Output:

Vikas College Of Arts, Science and Commerce Page 13


Practical 5
ANOVA (Analysis of Variance)
Perform one-way ANOVA to compare means across multiple groups.
from scipy.stats import f_oneway

# Define sample data for each group

group1 = [15, 20, 25, 30, 35]

group2 = [10, 18, 22, 28, 32]

group3 = [12, 16, 20, 24, 28]

f_statistic, p_value = f_oneway(group1, group2, group3)

print("One-way ANOVA results:")

print("F-statistic:", f_statistic)

print("P-value:", p_value)

alpha = 0.05

if p_value < alpha:

print(

"Reject null hypothesis: There are significant differences between the means of the groups."

else:

print(

"Fail to reject null hypothesis: There are no significant differences between the means of the groups."

Output:-

Vikas College Of Arts, Science and Commerce Page 14


Practical 6
Regression and its Types.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Dependent variable (predictor)


X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
# Independent variable (predictor)
y = np.array([[7], [9], [11], [13], [15], [17], [19], [21], [23], [25]])
# Dependent variable (response)

# Splitting the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Simple Linear Regression


model = LinearRegression()
model.fit(X_train, y_train) # Fitting the model

# Coefficients
print("Intercept:", model.intercept_[0])
print("Coefficient:", model.coef_[0][0])

# Predictions
y_pred = model.predict(X_test)

# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Vikas College Of Arts, Science and Commerce Page 15


# Plotting the regression line
plt.scatter(X_test, y_test, color="blue")
plt.plot(X_test, y_pred, color="red")
plt.title("Simple Linear Regression")
plt.xlabel("Independent Variable (X)")
plt.ylabel("Dependent Variable (y)")
plt.show()

Output:

Vikas College Of Arts, Science and Commerce Page 16


Practical 7
Logistic Regression and Decision Tree
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0)

# Determine the optimal number of clusters using the silhouette score


silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
score = silhouette_score(X, kmeans.labels_)
silhouette_scores.append(score)

# Plot the silhouette scores


plt.plot(range(2, 11), silhouette_scores, marker="o")
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score for Optimal Number of Clusters")
plt.show()

# Choose the optimal number of clusters based on the silhouette score


optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2

# Apply K-Means clustering with the optimal number of clusters


kmeans = KMeans(n_clusters=optimal_k, random_state=0).fit(X)

# Visualize the clustering results


plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap="viridis", s=50, alpha=0.7)
plt.scatter(
kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
Vikas College Of Arts, Science and Commerce Page 17
s=200,
c="red",
marker="X",
label="Centroids",
)
plt.title("K-Means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

# Analyze the cluster characteristics


silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg}")
Output:

Vikas College Of Arts, Science and Commerce Page 18


Vikas College Of Arts, Science and Commerce Page 19
Practical 8
K-Means clustering
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv("wholesale.csv")

# Display the first few rows of the dataset


data.head()

# Define categorical and continuous features


categorical_features = ["Channel", "Region"]
continuous_features = [
"Fresh",
"Milk",
"Grocery",
"Frozen",
"Detergents_Paper",
"Delicassen",
]

# Descriptive statistics for continuous features


data[continuous_features].describe()

# Convert categorical features into dummy variables


for col in categorical_features:
dummies = pd.get_dummies(data[col], prefix=col)
data = pd.concat([data, dummies], axis=1)
data.drop(col, axis=1, inplace=True)

Vikas College Of Arts, Science and Commerce Page 20


# Display the first few rows of the updated dataset
data.head()

# Normalize the data


mms = MinMaxScaler()
data_transformed = mms.fit_transform(data)

# Calculate the sum of squared distances for different values of k


sum_of_squared_distances = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)

# Plot the elbow method graph


plt.plot(K, sum_of_squared_distances, "bx-")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Sum of Squared Distances")
plt.title("Elbow Method for Optimal k")
plt.show()

Output:

Vikas College Of Arts, Science and Commerce Page 21


Practical 9
Principal Component Analysis (PCA)
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Perform PCA
pca = PCA(n_components=2) # Specify the number of components (dimensions)
X_r = pca.fit_transform(X)

# Create a DataFrame for visualization


df = pd.DataFrame(data=X_r, columns=['PC1', 'PC2'])
df['target'] = y

# Plot the data


plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
lw = 2

for color, i, target_name in zip(colors, [0, 1, 2], target_names):


plt.scatter(df.loc[df['target'] == i, 'PC1'], df.loc[df['target'] == i, 'PC2'], color=color, alpha=.8, lw=lw,
label=target_name)

plt.title('PCA of IRIS dataset')


plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Output:

Vikas College Of Arts, Science and Commerce Page 22


Vikas College Of Arts, Science and Commerce Page 23
Practical 10
Data Visualization and Storytelling

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


# Assume 'data.csv' contains your dataset
df = pd.read_csv("data.csv")

# Perform data analysis


# Example: Calculate summary statistics
summary_stats = df.describe()

# Create meaningful visualizations


# Example: Plot a histogram of a numerical variable
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x="numerical_variable", bins=20, kde=True)
plt.title("Histogram of Numerical Variable")
plt.xlabel("Numerical Variable")
plt.ylabel("Frequency")
plt.show()

# Example: Plot a bar chart of a categorical variable


plt.figure(figsize=(8, 6))
sns.countplot(data=df, x="categorical_variable", palette="viridis")
plt.title("Bar Chart of Categorical Variable")
plt.xlabel("Categories")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

# Present findings and insights in a clear and concise manner


# Example: Use Markdown to format text for presentation
print("# Data Analysis and Visualization Report\n")
print("## Summary Statistics:\n")
print(summary_stats)
print("\n## Insights:\n")
print(
"- The histogram shows that the distribution of the numerical variable is approximately normal."
)
print(
"- The bar chart indicates that category A is the most frequent in the categorical variable."
)
print(
"- The scatterplot suggests a positive correlation between numerical variables 1 and 2, with different
categories showing distinct patterns.\n"

Vikas College Of Arts, Science and Commerce Page 24


)

Output:

Vikas College Of Arts, Science and Commerce Page 25


Vikas College Of Arts, Science and Commerce Page 26

You might also like