0% found this document useful (0 votes)

29 views11 pages

Lab Report 8

Uploaded by

SADIA JANNAT 201-15-3136

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views11 pages

Lab Report 8

Uploaded by

SADIA JANNAT 201-15-3136

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Code

from google.colab import drive

drive.mount('/content/drive')
!pip install pyspark

import seaborn as sns

import matplotlib.pyplot as plt

from pyspark.ml import Pipeline # pipeline to transform data

from pyspark.sql import SparkSession # to initiate spark
from pyspark.sql.types import FloatType
from pyspark.ml.feature import RegexTokenizer # tokenizer
from pyspark.ml.feature import HashingTF, IDF # vectorizer
from pyspark.ml.feature import StopWordsRemover # to remove stop words
from pyspark.sql.functions import concat_ws, col # to concatinate cols
from pyspark.ml.classification import LogisticRegression # ml model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator # to
evaluate the model
from pyspark.mllib.evaluation import MulticlassMetrics # # performance
metrics

spark = SparkSession.builder.appName("Lab-8").getOrCreate()
df= spark.read.csv(header= True, inferSchema= True,
path="/content/drive/MyDrive/train.csv")
df.show()
Output

Description
The code starts by mounting Google Drive to access files stored in Google
Drive.Then the code installs the PySpark library.After that Import necessary Python
libraries, including Seaborn, Matplotlib, and various components from PySpark and
create a Spark session named "Lab-8."Then read a CSV file ("train.csv") into a
PySpark DataFrame named df. The header=True indicates that the first row
contains column names, and inferSchema=True automatically infers the data types
of each column.
Code

# Renaming 'Class Index' col to 'label'

df = df.withColumnRenamed('Class Index', 'label')

# Add a new column 'Text' by concatinating 'Title' and 'Description'

df = df.withColumn("Text", concat_ws(" ", "Title", 'Description'))

# Remove old text columns

df = df.select('label', 'Text')

# Shows top 10 rows

df.show(10)

Output

Description
This code is performing data preprocessing tasks on a PySpark DataFrame named
df, for preparing the data for a text-based analysis or natural language processing
(NLP) tasks.
Code

stopwords_remover = StopWordsRemover(inputCol="words",
outputCol="filtered")

# adds a column 'filtered' to df without stopwords

df = stopwords_remover.transform(df)

df.select(['label','Text', 'words', 'filtered']).show(5)

Output

Description
This code continues the data preprocessing for text analysis. This code applies a
stop words removal step to the text data in the DataFrame. It adds a new column
('filtered') containing the text data without common stop words.

Code
hashing_tf = HashingTF(inputCol="filtered",
outputCol="raw_features",
numFeatures=10000)

# adds raw tf features to df

featurized_data = hashing_tf.transform(df)
# Inverse document frequency
idf = IDF(inputCol="raw_features", outputCol="features")

idf_vectorizer = idf.fit(featurized_data)

# converting text to vectors

rescaled_data = idf_vectorizer.transform(featurized_data)

# top 20 rows
rescaled_data.select("label",'Text', 'words', 'filtered',
"features").show()

Output

Description
This code segment converts the text data into numerical features using the TF
(Term Frequency) and IDF transformations. The resulting DataFrame
(rescaled_data) contains a new column ("features") with TF-IDF vectors, which are
often used as input features for machine learning models in text analysis tasks.
Code
# Split Train/Test data
(train, test) = rescaled_data.randomSplit([0.75, 0.25], seed = 202)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Output

Description
This code performs a random split of the rescaled_data DataFrame into training
and test datasets, allocating 75% of the data for training and 25% for testing.

Code
lr = LogisticRegression(featuresCol='features',
labelCol='label',
family="multinomial",
regParam=0.3,
elasticNetParam=0,
maxIter=50)

# train model with default parameters

lrModel = lr.fit(train)

# get predictions for test set

predictions = lrModel.transform(test)

# show top 20 predictions

predictions.select("Text", 'probability','prediction', 'label').show()

Output
Description
This code segment trains a logistic regression model using the training set, makes
predictions on the test set, and displays the top 20 rows of the predictions
DataFrame.

Code
# to evalute model
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")

# print test accuracy

print("Test-set Accuracy is : ", evaluator.evaluate(predictions))
Output

Description
This code segment calculates and prints the accuracy of the logistic regression
model on the test dataset.

Code
labels = ["World", "Sports", "Business","Science"]

# important: need to cast to float type, and order by prediction, else it

won't work
preds_and_labels = predictions.select(['prediction','label']) \
.withColumn('label', col('label') \
.cast(FloatType())) \
.orderBy('prediction')
# generate metrics
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

# figure object
_ = plt.figure(figsize=(7, 7))

# plot confusion matrix

sns.heatmap(metrics.confusionMatrix().toArray(),
cmap='viridis',
annot=True,fmt='0',
cbar=False,
xticklabels=labels,
yticklabels=labels)
plt.show()
Output

Description
This code is to visualizing the confusion matrix for the classification model's predictions on the test
dataset.
Code
# load dataset
df = spark.read.csv("/content/drive/MyDrive/Big data lab/Lab-5/train.csv",
inferSchema=True, header=True)

# Renaming 'Class Index' col to 'label'

df = df.withColumnRenamed('Class Index', 'label')

# Add a new column 'Text' by concatinating 'Title' and 'Description'

df = df.withColumn("Text", concat_ws(" ", "Title", 'Description'))

# Select new text feature and labels

df = df.select('label', 'Text')

# tokenizer
tokenizer = RegexTokenizer(inputCol="Text", outputCol="words",
pattern="\\W")

# stopwords
stopwords_remover = StopWordsRemover(inputCol="words",
outputCol="filtered")

# term frequency
hashing_tf = HashingTF(inputCol="filtered",
outputCol="raw_features",
numFeatures=10000)

# Inverse Document Frequency - vectorizer

idf = IDF(inputCol="raw_features", outputCol="features")

# model
lr = LogisticRegression(featuresCol='features',
labelCol='label',
family="multinomial",
regParam=0.3,
elasticNetParam=0,
maxIter=50)

# Put everything in pipeline

pipeline = Pipeline(stages=[tokenizer,
stopwords_remover,
hashing_tf,
idf,
lr])

# Fit the pipeline to training documents.

pipelineFit = pipeline.fit(df)

# transform add train

dataset = pipelineFit.transform(df)

# show top 10 predictions

dataset.show(10)

Output

Description
This code sets up a pipeline to preprocess text data and train a logistic regression
model for classification. The pipeline includes tokenization, stopwords removal, TF-
IDF vectorization, and model training stages.

Classification
No ratings yet
Classification
3 pages
Minor Project
No ratings yet
Minor Project
21 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Personalized Cancer Diagnosis
No ratings yet
Personalized Cancer Diagnosis
100 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Machine Learning Code Explanation
No ratings yet
Machine Learning Code Explanation
33 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Deep Learning Practical File
No ratings yet
Deep Learning Practical File
18 pages
ML Lab Manual
No ratings yet
ML Lab Manual
13 pages
DL & AI - Lab Manual
No ratings yet
DL & AI - Lab Manual
33 pages
Deep Learning
No ratings yet
Deep Learning
13 pages
Edx Course Lab Programs
No ratings yet
Edx Course Lab Programs
19 pages
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
No ratings yet
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
25 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Oishee Dey NNFL Project Report
No ratings yet
Oishee Dey NNFL Project Report
20 pages
Assignment 2: Hive
No ratings yet
Assignment 2: Hive
11 pages
ML New Record
No ratings yet
ML New Record
51 pages
School of Engineering: Lab Manual On Machine Learning Lab
No ratings yet
School of Engineering: Lab Manual On Machine Learning Lab
23 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
QA 06 Ratio-2
No ratings yet
QA 06 Ratio-2
34 pages
Multi-Output Classification With Machine Learning
No ratings yet
Multi-Output Classification With Machine Learning
10 pages
Ritesh Mangla ML PracticalFile
No ratings yet
Ritesh Mangla ML PracticalFile
55 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
CSEC Mathematics January 2015 P2 PDF
No ratings yet
CSEC Mathematics January 2015 P2 PDF
36 pages
Chapter07 Working-With-Keras
No ratings yet
Chapter07 Working-With-Keras
12 pages
Documentation ML
No ratings yet
Documentation ML
10 pages
Home Work
No ratings yet
Home Work
12 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Report On - Social Media Research Topic Modeling
No ratings yet
Report On - Social Media Research Topic Modeling
26 pages
Logistic Regression
No ratings yet
Logistic Regression
2 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
ADS - Phase 3
No ratings yet
ADS - Phase 3
34 pages
Data Science
No ratings yet
Data Science
25 pages
Betbot
No ratings yet
Betbot
2 pages
Report
No ratings yet
Report
2 pages
Code Structure
No ratings yet
Code Structure
6 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Morphological Analysis: Natural Language Processing (CSE 5321)
No ratings yet
Morphological Analysis: Natural Language Processing (CSE 5321)
23 pages
Exp 5
No ratings yet
Exp 5
9 pages
ML Manual With Outputs
No ratings yet
ML Manual With Outputs
30 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
No ratings yet
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
35 pages
D. Inverse Trigonometric Functions: One-To-One Onto
No ratings yet
D. Inverse Trigonometric Functions: One-To-One Onto
69 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
eBook Tặng 1 - Em Tự Tin Vào Lớp 1 Với Mighty Math Singapore - Full - 224 Trang
No ratings yet
eBook Tặng 1 - Em Tự Tin Vào Lớp 1 Với Mighty Math Singapore - Full - 224 Trang
226 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
Manual
No ratings yet
Manual
48 pages
DL 3
No ratings yet
DL 3
5 pages
Lines and Angles
No ratings yet
Lines and Angles
3 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
8 pages
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
Maths
No ratings yet
Maths
222 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Solution Manual For Trigonometry 3rd Edition by Young
No ratings yet
Solution Manual For Trigonometry 3rd Edition by Young
90 pages
Academic Internship Final Report
No ratings yet
Academic Internship Final Report
11 pages
ML Lab 01999676272
No ratings yet
ML Lab 01999676272
12 pages
Bitsat Paper 4
No ratings yet
Bitsat Paper 4
19 pages
Verilog Operators Manual
No ratings yet
Verilog Operators Manual
27 pages
M Stage 8 p110 02 Afp PDF
67% (3)
M Stage 8 p110 02 Afp PDF
14 pages
E4 DS203 2023 Sem2
No ratings yet
E4 DS203 2023 Sem2
2 pages
ML Lab Exercise - 9
No ratings yet
ML Lab Exercise - 9
4 pages
Dsbda 5
No ratings yet
Dsbda 5
4 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
IGCSE1 Test Geometry
No ratings yet
IGCSE1 Test Geometry
13 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
The Koch Snowflake
75% (8)
The Koch Snowflake
16 pages
Capstone Project - Jaro-Prof. Babji
No ratings yet
Capstone Project - Jaro-Prof. Babji
5 pages
Object Detection and Ship Classification Using YOLO and Amazon Rekognition
No ratings yet
Object Detection and Ship Classification Using YOLO and Amazon Rekognition
11 pages
Test Paper 3
No ratings yet
Test Paper 3
8 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Topic 4.3 - Wave Characteristics
No ratings yet
Topic 4.3 - Wave Characteristics
55 pages
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
No ratings yet
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
11 pages
Starmine Sovereign Risk Model Final
No ratings yet
Starmine Sovereign Risk Model Final
12 pages
Online Quiz 1
100% (1)
Online Quiz 1
9 pages
Umbrello Handbook X
No ratings yet
Umbrello Handbook X
41 pages
Typees of Graph
No ratings yet
Typees of Graph
13 pages
Presentation OF MINI PROJECT PDF
No ratings yet
Presentation OF MINI PROJECT PDF
32 pages
CBSE Class 8 Maths Activity 4
No ratings yet
CBSE Class 8 Maths Activity 4
2 pages
Trading Strategies Market Colour Ravi Kashyap 2018
No ratings yet
Trading Strategies Market Colour Ravi Kashyap 2018
26 pages
(Download) SSC - CGL Tier-II Exam Paper-I (Arithmetical Ability) Held On - 16-09-2012 - SSCPORTAL PDF
No ratings yet
(Download) SSC - CGL Tier-II Exam Paper-I (Arithmetical Ability) Held On - 16-09-2012 - SSCPORTAL PDF
12 pages
Ansys Workbench 13: Theory - Applications - Case Studies
No ratings yet
Ansys Workbench 13: Theory - Applications - Case Studies
4 pages
SURPAC Model Filling
No ratings yet
SURPAC Model Filling
13 pages
Sinusoidal Steady State Circuit Analysis (3.1 Study The Ac Basic Circuits)
No ratings yet
Sinusoidal Steady State Circuit Analysis (3.1 Study The Ac Basic Circuits)
66 pages
FatFree User Manual
100% (1)
FatFree User Manual
41 pages
ME2134 Summary Feedback
No ratings yet
ME2134 Summary Feedback
13 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet

Lab Report 8

Uploaded by

Lab Report 8

Uploaded by

Code

from google.colab import drive

import seaborn as sns

from pyspark.ml import Pipeline # pipeline to transform data

# Renaming 'Class Index' col to 'label'

# Add a new column 'Text' by concatinating 'Title' and 'Description'

# Remove old text columns

# Shows top 10 rows

# adds a column 'filtered' to df without stopwords

df.select(['label','Text', 'words', 'filtered']).show(5)

# adds raw tf features to df

# converting text to vectors

# train model with default parameters

# get predictions for test set

# show top 20 predictions

# print test accuracy

# important: need to cast to float type, and order by prediction, else it

# plot confusion matrix

# Renaming 'Class Index' col to 'label'

# Add a new column 'Text' by concatinating 'Title' and 'Description'

# Select new text feature and labels

# Inverse Document Frequency - vectorizer

# Put everything in pipeline

# Fit the pipeline to training documents.

# transform add train

# show top 10 predictions

You might also like