0% found this document useful (0 votes)
16 views17 pages

ML Remaining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views17 pages

ML Remaining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

import pandas as pd

# Import data from different formats


data_csv = pd.read_csv('data.csv')
data_excel = pd.read_excel('data.xlsx')
data_json = pd.read_json('data.json')

# Display the imported data


print("Imported CSV Data:")
print(data_csv)

print("Imported Excel Data:")


print(data_excel)

print("Imported JSON Data:")


print(data_json)

# Combine data from different sources (assuming columns are the same)
combined_data = pd.concat([data_csv, data_excel, data_json], ignore_index=True)

# Perform data manipulation or analysis


# For example, calculating the average of a column
average_column = combined_data['column_name'].mean()

# Export data to different formats


combined_data.to_csv('combined_data.csv', index=False)
combined_data.to_excel('combined_data.xlsx', index=False)
combined_data.to_json('combined_data.json', orient='records')

print("Data exported to combined_data.csv, combined_data.xlsx, combined_data.json")


Certainly! Data preprocessing is a crucial step in data analysis and machine learning.
Here's an example of various data preprocessing techniques applied to a sample dataset
using the Pandas library:

Assuming you have a dataset named `'sample_data.csv'` with columns like `'age'`, `'gender'`,
`'income'`, and `'education'`:

```python
import pandas as pd

# Load the dataset


data = pd.read_csv('sample_data.csv')

# Display the first few rows of the dataset


print("Original Data:")
print(data.head())

# Handling Missing Values


data.dropna(inplace=True) # Remove rows with missing values

# Encoding Categorical Variables


data_encoded = pd.get_dummies(data, columns=['gender', 'education'], drop_first=True)

# Scaling Numerical Variables (Standardization)


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_encoded[['age', 'income']] = scaler.fit_transform(data_encoded[['age', 'income']])

# Handling Outliers (using Z-score)


z_scores = (data_encoded - data_encoded.mean()) / data_encoded.std()
data_no_outliers = data_encoded[(z_scores < 3).all(axis=1)]
# Splitting Data into Features and Target
X = data_no_outliers.drop('target_column', axis=1)
y = data_no_outliers['target_column']

# Splitting Data into Training and Testing Sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Dimensionality Reduction (PCA)


from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)

# Feature Scaling (Normalization)


from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X_train)

# Text Data Preprocessing (if applicable)


from sklearn.feature_extraction.text import CountVectorizer
text_data = data['text_column']
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(text_data)

# Time Series Data Resampling (if applicable)


time_series_data = data.set_index('timestamp_column')
resampled_data = time_series_data.resample('D').sum()

# Displaying Preprocessed Data


print("Preprocessed Data:")
print(X_train.head())
print(y_train.head())
print(X_pca[:5])
print(X_normalized[:5])
print(X_text[:5])
print(resampled_data.head())
```

Remember to replace `'sample_data.csv'`, column names, and other placeholders with your
actual data and features. This example covers various preprocessing techniques like handling
missing values, encoding categorical variables, scaling numerical variables, handling outliers,
splitting data, dimensionality reduction, feature scaling, text data preprocessing, and time series
resampling. Adapt these techniques based on the nature of your dataset and analysis goals.
Example of how to implement dimensionality reduction using the PCA (Principal
Component Analysis) method using the `sklearn` library in Python:

```python
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load the Iris dataset as an example


iris = load_iris()
X = iris.data
y = iris.target

# Standardize the features


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create a DataFrame for visualization


pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
pca_df['target'] = y

# Visualize the results


plt.figure(figsize=(10, 6))
targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets, colors):
indices = pca_df['target'] == target
plt.scatter(pca_df.loc[indices, 'PC1'], pca_df.loc[indices, 'PC2'], c=color, s=50)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(targets, title='Target')
plt.title('PCA of Iris Dataset')
plt.show()
```

In this example, we use the Iris dataset as an example and apply PCA to reduce the
dimensionality from 4 features to 2 principal components. The code first standardizes the
features to have zero mean and unit variance. Then, it applies PCA to transform the
standardized data into lower-dimensional components. Finally, the reduced data is visualized
using a scatter plot.

Replace the `iris.data` and `iris.target` with your own dataset's features and target variables.
Also, adjust the code as needed to fit your specific dataset and requirements.
Certainly! Here's an example of how to implement both Simple and Multiple Linear
Regression models using the `sklearn` library in Python:

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Generate some sample data


np.random.seed(0)
X_simple = np.random.rand(100, 1) * 10
y_simple = 2 * X_simple + 1 + np.random.randn(100, 1)

# Split the data into training and testing sets


X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(X_simple,
y_simple, test_size=0.2, random_state=42)

# Create and train the Simple Linear Regression model


simple_model = LinearRegression()
simple_model.fit(X_train_simple, y_train_simple)

# Make predictions using the Simple Linear Regression model


y_pred_simple = simple_model.predict(X_test_simple)

# Calculate metrics for the Simple Linear Regression model


mse_simple = mean_squared_error(y_test_simple, y_pred_simple)
r2_simple = r2_score(y_test_simple, y_pred_simple)
# Print metrics for the Simple Linear Regression model
print("Simple Linear Regression:")
print("Mean Squared Error:", mse_simple)
print("R-squared:", r2_simple)

# Plot the Simple Linear Regression model's predictions


plt.scatter(X_test_simple, y_test_simple, color='blue')
plt.plot(X_test_simple, y_pred_simple, color='red', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.show()

# Generate some sample data for Multiple Linear Regression


X_multiple = np.random.rand(100, 2) * 10
y_multiple = 2 * X_multiple[:, 0] + 3 * X_multiple[:, 1] + 1 + np.random.randn(100)

# Split the data into training and testing sets


X_train_multiple, X_test_multiple, y_train_multiple, y_test_multiple =
train_test_split(X_multiple, y_multiple, test_size=0.2, random_state=42)

# Create and train the Multiple Linear Regression model


multiple_model = LinearRegression()
multiple_model.fit(X_train_multiple, y_train_multiple)

# Make predictions using the Multiple Linear Regression model


y_pred_multiple = multiple_model.predict(X_test_multiple)

# Calculate metrics for the Multiple Linear Regression model


mse_multiple = mean_squared_error(y_test_multiple, y_pred_multiple)
r2_multiple = r2_score(y_test_multiple, y_pred_multiple)

# Print metrics for the Multiple Linear Regression model


print("Multiple Linear Regression:")
print("Mean Squared Error:", mse_multiple)
print("R-squared:", r2_multiple)
```

In this example, we first generate sample data for both Simple and Multiple Linear
Regression. We then split the data into training and testing sets, create and train the Linear
Regression models, make predictions, calculate metrics (Mean Squared Error and R-squared),
and visualize the results.

Remember to replace the generated sample data with your actual data for real-world
scenarios. The code demonstrates how to use `LinearRegression` from `sklearn` for both
Simple and Multiple Linear Regression.
An Example of how to develop a Decision Tree Classification model using the `sklearn`
library in Python, and then use it to classify a new sample:

```python
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset as an example


iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree Classification model


tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

# Make predictions using the trained model


y_pred = tree_model.predict(X_test)

# Calculate the accuracy of the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Classify a new sample using the trained model
new_sample = np.array([[5.1, 3.5, 1.4, 0.2]]) # Replace with your own sample data
predicted_class = tree_model.predict(new_sample)
predicted_species = iris.target_names[predicted_class][0]
print("Predicted Species:", predicted_species)
```

In this example, we use the Iris dataset for demonstration. The code splits the data into
training and testing sets, creates and trains a Decision Tree Classification model, makes
predictions, calculates the accuracy of the model, and finally classifies a new sample using
the trained model.

Replace `new_sample` with the feature values of the new sample you want to classify. The
code uses the Decision Tree Classifier from `sklearn.tree` and showcases the process of
training and using the model for classification.
Example of how to implement Naïve Bayes Classification using the `sklearn` library in
Python:

```python
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Iris dataset as an example


iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Naïve Bayes Classification model


nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Make predictions using the trained model


y_pred = nb_model.predict(X_test)

# Calculate the accuracy of the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classify a new sample using the trained model


new_sample = np.array([[5.1, 3.5, 1.4, 0.2]]) # Replace with your own sample data
predicted_class = nb_model.predict(new_sample)
predicted_species = iris.target_names[predicted_class][0]
print("Predicted Species:", predicted_species)
```

In this example, we use the Iris dataset for demonstration. The code splits the data into training
and testing sets, creates and trains a Naïve Bayes Classification model (specifically, Gaussian
Naïve Bayes), makes predictions, calculates the accuracy of the model, and classifies a new
sample using the trained model.

Replace `new_sample` with the feature values of the new sample you want to classify. The
code uses the Gaussian Naïve Bayes classifier from `sklearn.naive_bayes` and showcases the
process of training and using the model for classification.
Example of how to build a k-Nearest Neighbors (KNN) Classification model using the
`sklearn` library in Python:

```python
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset as an example


iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the KNN Classification model


k = 3 # Number of neighbors
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)

# Make predictions using the trained model


y_pred = knn_model.predict(X_test)

# Calculate the accuracy of the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classify a new sample using the trained model


new_sample = np.array([[5.1, 3.5, 1.4, 0.2]]) # Replace with your own sample data
predicted_class = knn_model.predict(new_sample)
predicted_species = iris.target_names[predicted_class][0]
print("Predicted Species:", predicted_species)
```

In this example, we use the Iris dataset for demonstration. The code splits the data into training
and testing sets, creates and trains a KNN Classification model with a specified number of
neighbors (`k`), makes predictions, calculates the accuracy of the model, and classifies a new
sample using the trained model.

Replace `new_sample` with the feature values of the new sample you want to classify. The
code uses the `KNeighborsClassifier` from `sklearn.neighbors` and showcases the process of
training and using the KNN model for classification.
Example of how to implement the K-Means clustering algorithm from scratch in Python:

```python
import numpy as np

class KMeans:
def __init__(self, n_clusters, max_iters=100):
self.n_clusters = n_clusters
self.max_iters = max_iters

def fit(self, X):


self.centroids = X[np.random.choice(len(X), self.n_clusters, replace=False)]
for _ in range(self.max_iters):
distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
labels = np.argmin(distances, axis=1)
new_centroids = np.array([X[labels == k].mean(axis=0) for k in
range(self.n_clusters)])
if np.allclose(new_centroids, self.centroids):
break
self.centroids = new_centroids

self.labels = labels

# Generate some sample data


np.random.seed(0)
X = np.random.rand(200, 2)

# Create KMeans instance and fit the data


kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
# Get cluster labels and centroids
labels = kmeans.labels
centroids = kmeans.centroids

print("Cluster Labels:")
print(labels)
print("Centroids:")
print(centroids)
```

In this example, the `KMeans` class is implemented with a `fit` method that takes the data `X`
and iteratively updates the centroids to cluster the data. The algorithm stops when either the
maximum number of iterations is reached or when the centroids do not change significantly.

Replace `X` with your own dataset, and adjust the `n_clusters` parameter to the desired number
of clusters. The example demonstrates how to implement the core K-Means clustering
algorithm using NumPy for data manipulation and calculations. Keep in mind that there are
more efficient and optimized libraries (like `scikit-learn`) available for K-Means clustering.

You might also like