Lab Report 8
Lab Report 8
spark = SparkSession.builder.appName("Lab-8").getOrCreate()
df= spark.read.csv(header= True, inferSchema= True,
path="/content/drive/MyDrive/train.csv")
df.show()
Output
Description
The code starts by mounting Google Drive to access files stored in Google
Drive.Then the code installs the PySpark library.After that Import necessary Python
libraries, including Seaborn, Matplotlib, and various components from PySpark and
create a Spark session named "Lab-8."Then read a CSV file ("train.csv") into a
PySpark DataFrame named df. The header=True indicates that the first row
contains column names, and inferSchema=True automatically infers the data types
of each column.
Code
Output
Description
This code is performing data preprocessing tasks on a PySpark DataFrame named
df, for preparing the data for a text-based analysis or natural language processing
(NLP) tasks.
Code
stopwords_remover = StopWordsRemover(inputCol="words",
outputCol="filtered")
Output
Description
This code continues the data preprocessing for text analysis. This code applies a
stop words removal step to the text data in the DataFrame. It adds a new column
('filtered') containing the text data without common stop words.
Code
hashing_tf = HashingTF(inputCol="filtered",
outputCol="raw_features",
numFeatures=10000)
idf_vectorizer = idf.fit(featurized_data)
# top 20 rows
rescaled_data.select("label",'Text', 'words', 'filtered',
"features").show()
Output
Description
This code segment converts the text data into numerical features using the TF
(Term Frequency) and IDF transformations. The resulting DataFrame
(rescaled_data) contains a new column ("features") with TF-IDF vectors, which are
often used as input features for machine learning models in text analysis tasks.
Code
# Split Train/Test data
(train, test) = rescaled_data.randomSplit([0.75, 0.25], seed = 202)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))
Output
Description
This code performs a random split of the rescaled_data DataFrame into training
and test datasets, allocating 75% of the data for training and 25% for testing.
Code
lr = LogisticRegression(featuresCol='features',
labelCol='label',
family="multinomial",
regParam=0.3,
elasticNetParam=0,
maxIter=50)
Output
Description
This code segment trains a logistic regression model using the training set, makes
predictions on the test set, and displays the top 20 rows of the predictions
DataFrame.
Code
# to evalute model
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
Description
This code segment calculates and prints the accuracy of the logistic regression
model on the test dataset.
Code
labels = ["World", "Sports", "Business","Science"]
# figure object
_ = plt.figure(figsize=(7, 7))
Description
This code is to visualizing the confusion matrix for the classification model's predictions on the test
dataset.
Code
# load dataset
df = spark.read.csv("/content/drive/MyDrive/Big data lab/Lab-5/train.csv",
inferSchema=True, header=True)
# tokenizer
tokenizer = RegexTokenizer(inputCol="Text", outputCol="words",
pattern="\\W")
# stopwords
stopwords_remover = StopWordsRemover(inputCol="words",
outputCol="filtered")
# term frequency
hashing_tf = HashingTF(inputCol="filtered",
outputCol="raw_features",
numFeatures=10000)
# model
lr = LogisticRegression(featuresCol='features',
labelCol='label',
family="multinomial",
regParam=0.3,
elasticNetParam=0,
maxIter=50)
Output
Description
This code sets up a pipeline to preprocess text data and train a logistic regression
model for classification. The pipeline includes tokenization, stopwords removal, TF-
IDF vectorization, and model training stages.