0% found this document useful (0 votes)
26 views11 pages

Lab Report 8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

Lab Report 8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Code

from google.colab import drive


drive.mount('/content/drive')
!pip install pyspark

import seaborn as sns


import matplotlib.pyplot as plt

from pyspark.ml import Pipeline # pipeline to transform data


from pyspark.sql import SparkSession # to initiate spark
from pyspark.sql.types import FloatType
from pyspark.ml.feature import RegexTokenizer # tokenizer
from pyspark.ml.feature import HashingTF, IDF # vectorizer
from pyspark.ml.feature import StopWordsRemover # to remove stop words
from pyspark.sql.functions import concat_ws, col # to concatinate cols
from pyspark.ml.classification import LogisticRegression # ml model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator # to
evaluate the model
from pyspark.mllib.evaluation import MulticlassMetrics # # performance
metrics

spark = SparkSession.builder.appName("Lab-8").getOrCreate()
df= spark.read.csv(header= True, inferSchema= True,
path="/content/drive/MyDrive/train.csv")
df.show()
Output

Description
The code starts by mounting Google Drive to access files stored in Google
Drive.Then the code installs the PySpark library.After that Import necessary Python
libraries, including Seaborn, Matplotlib, and various components from PySpark and
create a Spark session named "Lab-8."Then read a CSV file ("train.csv") into a
PySpark DataFrame named df. The header=True indicates that the first row
contains column names, and inferSchema=True automatically infers the data types
of each column.
Code

# Renaming 'Class Index' col to 'label'


df = df.withColumnRenamed('Class Index', 'label')

# Add a new column 'Text' by concatinating 'Title' and 'Description'


df = df.withColumn("Text", concat_ws(" ", "Title", 'Description'))

# Remove old text columns


df = df.select('label', 'Text')

# Shows top 10 rows


df.show(10)

Output

Description
This code is performing data preprocessing tasks on a PySpark DataFrame named
df, for preparing the data for a text-based analysis or natural language processing
(NLP) tasks.
Code

stopwords_remover = StopWordsRemover(inputCol="words",
outputCol="filtered")

# adds a column 'filtered' to df without stopwords


df = stopwords_remover.transform(df)

df.select(['label','Text', 'words', 'filtered']).show(5)

Output

Description
This code continues the data preprocessing for text analysis. This code applies a
stop words removal step to the text data in the DataFrame. It adds a new column
('filtered') containing the text data without common stop words.

Code
hashing_tf = HashingTF(inputCol="filtered",
outputCol="raw_features",
numFeatures=10000)

# adds raw tf features to df


featurized_data = hashing_tf.transform(df)
# Inverse document frequency
idf = IDF(inputCol="raw_features", outputCol="features")

idf_vectorizer = idf.fit(featurized_data)

# converting text to vectors


rescaled_data = idf_vectorizer.transform(featurized_data)

# top 20 rows
rescaled_data.select("label",'Text', 'words', 'filtered',
"features").show()

Output

Description
This code segment converts the text data into numerical features using the TF
(Term Frequency) and IDF transformations. The resulting DataFrame
(rescaled_data) contains a new column ("features") with TF-IDF vectors, which are
often used as input features for machine learning models in text analysis tasks.
Code
# Split Train/Test data
(train, test) = rescaled_data.randomSplit([0.75, 0.25], seed = 202)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Output

Description
This code performs a random split of the rescaled_data DataFrame into training
and test datasets, allocating 75% of the data for training and 25% for testing.

Code
lr = LogisticRegression(featuresCol='features',
labelCol='label',
family="multinomial",
regParam=0.3,
elasticNetParam=0,
maxIter=50)

# train model with default parameters


lrModel = lr.fit(train)

# get predictions for test set


predictions = lrModel.transform(test)

# show top 20 predictions


predictions.select("Text", 'probability','prediction', 'label').show()

Output
Description
This code segment trains a logistic regression model using the training set, makes
predictions on the test set, and displays the top 20 rows of the predictions
DataFrame.

Code
# to evalute model
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")

# print test accuracy


print("Test-set Accuracy is : ", evaluator.evaluate(predictions))
Output

Description
This code segment calculates and prints the accuracy of the logistic regression
model on the test dataset.

Code
labels = ["World", "Sports", "Business","Science"]

# important: need to cast to float type, and order by prediction, else it


won't work
preds_and_labels = predictions.select(['prediction','label']) \
.withColumn('label', col('label') \
.cast(FloatType())) \
.orderBy('prediction')
# generate metrics
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

# figure object
_ = plt.figure(figsize=(7, 7))

# plot confusion matrix


sns.heatmap(metrics.confusionMatrix().toArray(),
cmap='viridis',
annot=True,fmt='0',
cbar=False,
xticklabels=labels,
yticklabels=labels)
plt.show()
Output

Description
This code is to visualizing the confusion matrix for the classification model's predictions on the test
dataset.
Code
# load dataset
df = spark.read.csv("/content/drive/MyDrive/Big data lab/Lab-5/train.csv",
inferSchema=True, header=True)

# Renaming 'Class Index' col to 'label'


df = df.withColumnRenamed('Class Index', 'label')

# Add a new column 'Text' by concatinating 'Title' and 'Description'


df = df.withColumn("Text", concat_ws(" ", "Title", 'Description'))

# Select new text feature and labels


df = df.select('label', 'Text')

# tokenizer
tokenizer = RegexTokenizer(inputCol="Text", outputCol="words",
pattern="\\W")

# stopwords
stopwords_remover = StopWordsRemover(inputCol="words",
outputCol="filtered")

# term frequency
hashing_tf = HashingTF(inputCol="filtered",
outputCol="raw_features",
numFeatures=10000)

# Inverse Document Frequency - vectorizer


idf = IDF(inputCol="raw_features", outputCol="features")

# model
lr = LogisticRegression(featuresCol='features',
labelCol='label',
family="multinomial",
regParam=0.3,
elasticNetParam=0,
maxIter=50)

# Put everything in pipeline


pipeline = Pipeline(stages=[tokenizer,
stopwords_remover,
hashing_tf,
idf,
lr])

# Fit the pipeline to training documents.


pipelineFit = pipeline.fit(df)

# transform add train


dataset = pipelineFit.transform(df)

# show top 10 predictions


dataset.show(10)

Output

Description
This code sets up a pipeline to preprocess text data and train a logistic regression
model for classification. The pipeline includes tokenization, stopwords removal, TF-
IDF vectorization, and model training stages.

You might also like