0% found this document useful (0 votes)

19 views3 pages

Pyspark NLP From Scratch

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views3 pages

Pyspark NLP From Scratch

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

# This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# For example, here's several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory

# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

!pip install pyspark

import pandas as pd
import seaborn as sns

from pyspark.sql import SparkSession

from pyspark.sql import functions as f

spark = (
SparkSession.builder.appName("ModelTraining")
.config("spark.executor.memory", "6g")
.getOrCreate()
)

import html
schema = "polarity FLOAT, id LONG, date_time TIMESTAMP, query STRING, user STRING, text STRING"
timestampformat = "EEE MMM dd HH:mm:ss zzz yyyy"
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

IN_PATH_RAW = "/kaggle/input/twitter-nlp/training.1600000.processed.noemoticon.csv"
IN_PATH_TEST = "/kaggle/input/twitter-nlp/testdata.manual.2009.06.14.csv"
#OUT_PATH_CLEAN = "CLEAN"

spark_reader = spark.read.schema(schema)

user_regex = r"(@\w{1,15})"
hashtag_regex = "(#\w{1,})"
url_regex=r"((https?|ftp|file):\/{2,3})+([-\w+&@#/%=~|$?!:,.]*)|(www.)+([-\w+&@#/%=~|$?!:,.]*)"
email_regex=r"[\w.-]+@[\w.-]+\.[a-zA-Z]{1,}"

@f.udf
def html_unescape(s: str):
if isinstance(s, str):
return html.unescape(s)
return s
def clean_data(df):
df = (
df
.withColumn("original_text", f.col("text"))
.withColumn("text", f.regexp_replace(f.col("text"), url_regex, ""))
.withColumn("text", f.regexp_replace(f.col("text"), email_regex, ""))
.withColumn("text", f.regexp_replace(f.col("text"), user_regex, ""))
.withColumn("text", f.regexp_replace(f.col("text"), "#", " "))
.withColumn("text", html_unescape(f.col("text")))
.filter("text != ''")
)
return df

df_train_raw = spark_reader.csv(IN_PATH_RAW, timestampFormat=timestampformat)

df_train_clean = clean_data(df_train_raw)
df_test_raw = spark_reader.csv(IN_PATH_TEST, timestampFormat=timestampformat)
df_test_clean = clean_data(df_test_raw)

df_train_clean.show(10,True)
df_test_clean.show(10,True)

traindf = (
df_train_clean
# Remove all numbers
.withColumn("text", f.regexp_replace(f.col("text"), "[^a-zA-Z']", " "))
# Remove all double/multiple spaces
.withColumn("text", f.regexp_replace(f.col("text"), " +", " "))
# Remove leading and trailing whitespaces
.withColumn("text", f.trim(f.col("text")))
# Ensure we don't end up with empty rows
.filter("text != ''")
)

traindata = traindf.select("text", "polarity").coalesce(1).cache()

df_test = (
df_test_clean
# Remove all numbers
.withColumn("text", f.regexp_replace(f.col("text"), "[^a-zA-Z']", " "))
# Remove all double/multiple spaces
.withColumn("text", f.regexp_replace(f.col("text"), " +", " "))
# Remove leading and trailing whitespaces
.withColumn("text", f.trim(f.col("text")))
# Ensure we don't end up with empty rows
.filter("text != ''")
)

testdata = df_test.select("text", "polarity").coalesce(1).cache()

%%time
from pyspark.ml.feature import (
StopWordsRemover,
Tokenizer,
HashingTF,
g ,
IDF,
)
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

tokenizer = Tokenizer(inputCol="text", outputCol="words1")

stopwords_remover = StopWordsRemover(
inputCol="words1",
outputCol="words2",
stopWords=StopWordsRemover.loadDefaultStopWords("english")
)
hashing_tf = HashingTF(
inputCol="words2",
outputCol="term_frequency",
)
idf = IDF(
inputCol="term_frequency",
outputCol="features",
minDocFreq=5,
)
lr = LogisticRegression(labelCol="polarity")

semantic_analysis_pipeline = Pipeline(
stages=[tokenizer, stopwords_remover, hashing_tf, idf, lr]
)

semantic_analysis_model = semantic_analysis_pipeline.fit(traindata)

semantic_analysis_model.transform(testdata).show()

Colab paid products - Cancel contracts here

82-P01.91.300096-07 GE300 GE320 Operation Manual
No ratings yet
82-P01.91.300096-07 GE300 GE320 Operation Manual
126 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Chemistry Investigatory Project
33% (3)
Chemistry Investigatory Project
11 pages
CS3361 Data Science Lab Manual (II CYS)
100% (1)
CS3361 Data Science Lab Manual (II CYS)
40 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
Fork of Brain Tumour Last
No ratings yet
Fork of Brain Tumour Last
115 pages
Ant Seedlings Classification Viu Grupo01
No ratings yet
Ant Seedlings Classification Viu Grupo01
118 pages
Disk and Drum Scheduling
100% (2)
Disk and Drum Scheduling
19 pages
Spark 5
No ratings yet
Spark 5
2 pages
Heart Disease Prediction - Colab
No ratings yet
Heart Disease Prediction - Colab
18 pages
Exp 1
No ratings yet
Exp 1
5 pages
1911 Encyclopædia Britannica
No ratings yet
1911 Encyclopædia Britannica
301 pages
ML Manual
No ratings yet
ML Manual
21 pages
Lab 3 ML
No ratings yet
Lab 3 ML
19 pages
20AI16 - ML Record
No ratings yet
20AI16 - ML Record
24 pages
Vishnu. ML
No ratings yet
Vishnu. ML
26 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
11 pages
Normal Abnormal Ear - Ipynb - Colab
No ratings yet
Normal Abnormal Ear - Ipynb - Colab
10 pages
BAET Record
No ratings yet
BAET Record
19 pages
Ai Project File
No ratings yet
Ai Project File
11 pages
ML Record
No ratings yet
ML Record
19 pages
Introduction
No ratings yet
Introduction
10 pages
ISO 37001 New
No ratings yet
ISO 37001 New
13 pages
Introduction To Popular-1
No ratings yet
Introduction To Popular-1
15 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
f22 Yolov5s
No ratings yet
f22 Yolov5s
5 pages
1data Preprocessing
No ratings yet
1data Preprocessing
4 pages
Analysis On Weight Capacity
No ratings yet
Analysis On Weight Capacity
4 pages
Data Science Toc Srinivas
No ratings yet
Data Science Toc Srinivas
4 pages
Notebook40476b8e91 Ipynb
No ratings yet
Notebook40476b8e91 Ipynb
1 page
Code
No ratings yet
Code
6 pages
Desine Data Struectres
No ratings yet
Desine Data Struectres
3 pages
Project Code
No ratings yet
Project Code
2 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
AI Generated Vs Human Text (95% Accuracy)
No ratings yet
AI Generated Vs Human Text (95% Accuracy)
1 page
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
Python Letter Recognition
No ratings yet
Python Letter Recognition
1 page
Project Kaggle Intro
No ratings yet
Project Kaggle Intro
1 page
Top 10 Solar O&M KPIs To Track - Arbox Renewable Energy
No ratings yet
Top 10 Solar O&M KPIs To Track - Arbox Renewable Energy
4 pages
Astm E1269 - 11 (2018)
No ratings yet
Astm E1269 - 11 (2018)
2 pages
310-A STO FY 2024 TIER 1
No ratings yet
310-A STO FY 2024 TIER 1
12 pages
3 Simple Habits To Improve Your Critical Thinking
No ratings yet
3 Simple Habits To Improve Your Critical Thinking
6 pages
TESDA Crim
No ratings yet
TESDA Crim
2 pages
Signal Integrity Measurements and Network Analysis
No ratings yet
Signal Integrity Measurements and Network Analysis
55 pages
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
No ratings yet
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
41 pages
English Periodic Test Class XII Mock
No ratings yet
English Periodic Test Class XII Mock
3 pages
Structural Calculations - Cal PDF
No ratings yet
Structural Calculations - Cal PDF
117 pages
0193 01
No ratings yet
0193 01
22 pages
Creative Strategies of Local Resources in Managing Geotourism in The Ijen Geopark Bondowoso, E
No ratings yet
Creative Strategies of Local Resources in Managing Geotourism in The Ijen Geopark Bondowoso, E
20 pages
STS Reviewer
No ratings yet
STS Reviewer
23 pages
Dissertation Zusammenfassung Schreiben
100% (2)
Dissertation Zusammenfassung Schreiben
6 pages
Homework Riddles
100% (1)
Homework Riddles
5 pages
Full Download Electromagnetic Waves and Lasers Second Edition Kimura Wayne D PDF
100% (3)
Full Download Electromagnetic Waves and Lasers Second Edition Kimura Wayne D PDF
49 pages
Lab1 Syarifuddin 2016490588 PDF
No ratings yet
Lab1 Syarifuddin 2016490588 PDF
14 pages
American Culture and Drug Abuse
No ratings yet
American Culture and Drug Abuse
1 page
Safety Data Sheet: Section 1. Product and Company Identification
No ratings yet
Safety Data Sheet: Section 1. Product and Company Identification
10 pages
Avalanche Formation and Characteristics
No ratings yet
Avalanche Formation and Characteristics
13 pages
Business Etiquette in South Korea - 20230908 - 122053 - 0000
No ratings yet
Business Etiquette in South Korea - 20230908 - 122053 - 0000
8 pages
Section 1-Short Cantilever ST
No ratings yet
Section 1-Short Cantilever ST
5 pages
Raghuvaran 2020 IOP Conf. Ser. Mater. Sci. Eng. 995 012040
No ratings yet
Raghuvaran 2020 IOP Conf. Ser. Mater. Sci. Eng. 995 012040
9 pages
BachHoang FritoLay Memo
No ratings yet
BachHoang FritoLay Memo
4 pages
Iaad 2023
No ratings yet
Iaad 2023
4 pages
8 TQ Quarter4
No ratings yet
8 TQ Quarter4
2 pages
Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Python Reference: An Alphabetical Guide
From Everand
Python Reference: An Alphabetical Guide
Jo Foster
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Dive Into Sea of C
From Everand
Dive Into Sea of C
M Ashok
No ratings yet
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
NgRx SignalStore: An effortless solution for state management
From Everand
NgRx SignalStore: An effortless solution for state management
Abdelfattah Ragab
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Introduction to PHP, Part 5, Second Edition
From Everand
Introduction to PHP, Part 5, Second Edition
Adam Majczak
No ratings yet
C++ Functions and tutorial
From Everand
C++ Functions and tutorial
Nino Paiotta
No ratings yet
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Pyspark NLP From Scratch

Uploaded by

Pyspark NLP From Scratch

Uploaded by

# This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np # linear algebra

# Input data files are available in the read-only "../input/" directory

!pip install pyspark

from pyspark.sql import SparkSession

df_train_raw = spark_reader.csv(IN_PATH_RAW, timestampFormat=timestampformat)

traindata = traindf.select("text", "polarity").coalesce(1).cache()

testdata = df_test.select("text", "polarity").coalesce(1).cache()

tokenizer = Tokenizer(inputCol="text", outputCol="words1")

Colab paid products - Cancel contracts here

You might also like