Apriori Algorithm in SQL, PL/SQL and Spark SQL
Apriori Algorithm in SQL, PL/SQL and Spark SQL
********************************************************************************
This script is provided for educational purpose only.
start_time = time.time()
"""
Previously I presented a document with the title "Apriori Algorithm in pl/sql",
wherein I used
for loops/if then else logic to process data in pl/sql and fetch frequent patterns
from a
transactional dataset.
I was curious as to how to pull frequent pattern data from transactional data, just
using sql commands.
Without using any programatical logic.
This table join logic works fine on smaller datsets to process data to find
frequent patterns.
It is taking more time and not efficient to process large amounts of data with this
option.
using python supplied libraries, large amounts of data can be processed in less
time, efficiently.
I am sure, in Oracle, the tables' join logic can be fine tuned using hints, tables'
join order.
Created the file apriori_spark.csv, with the following data and saved it in my
local folder "f:/data".
tran_id,item
1,curd
1,apple
1,fruit
1,lamp
1,pot
1,rose
1,jam
2,spoon
2,detergent
2,glue
2,hat
2,tape
2,umbrella
2,jam
3,jam
4,curd
4,apple
4,fruit
4,pot
4,rose
4,tape
5,fruit
5,banana
5,spoon
5,hat
5,ink
5,lamp
5,umbrella
6,curd
6,detergent
6,glue
6,hat
6,ink
6,jam
7,fruit
7,glue
7,tape
8,spoon
8,ink
8,rose
8,umbrella
9,curd
9,banana
9,ink
10,apple
10,spoon
10,ink
10,umbrella
10,jam
"""
#spark sql code starts
here################################################################
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName("Apriori").setMaster("local[*]")
sc = SparkContext(conf=conf)
df =
spark.read.format("csv").option("infer_schema","false").option("header","true").opt
ion("sep",",").load("f:/data/apriori_spark.csv")
print(df.columns)
print(df.describe())
print(tranCount)
df.createOrReplaceTempView("data")
#convert array type to str type using concat_ws in spark sql and search for
frequent patterns
spark.sql("""with a as (select distinct item item from data order by 1),
t2 as (select t1.item item1,t2.item item2 from a t1, a t2 where
t1.item < t2.item order by 1,2)
select t2.item1,t2.item2,count(*) from t2,item_agg
where concat_ws(' ',item_agg.item) like '%'||t2.item1||'%'
and concat_ws(' ',item_agg.item) like '%'||t2.item2||'%'
group by t2.item1,t2.item2 having count(*)>1 order by 3
desc,1,2""").show(80,False)
print(time.time() - start_time) #check time taken for the script to run
#spark sql code ends
here################################################################
"""
When I run the above script with the following command
spark-submit.cmd python/pysparkapriori.py
I get the output as shown below:
['tran_id', 'item']
DataFrame[summary: string, tran_id: string, item: string]
49 2
+-------+---------+
|tran_id| item|
+-------+---------+
| 1| curd|
| 1| apple|
| 1| fruit|
| 1| lamp|
| 1| pot|
| 1| rose|
| 1| jam|
| 2| spoon|
| 2|detergent|
| 2| glue|
| 2| hat|
| 2| tape|
| 2| umbrella|
| 2| jam|
| 3| jam|
| 4| curd|
| 4| apple|
| 4| fruit|
| 4| pot|
| 4| rose|
+-------+---------+
only showing top 20 rows
49
+-------+------------+
|tran_id|num_products|
+-------+------------+
| 1| 7|
| 10| 5|
| 2| 7|
| 4| 6|
| 5| 7|
| 6| 6|
| 7| 3|
| 8| 4|
| 9| 3|
+-------+------------+
+-------+---------+
|tran_id| item|
+-------+---------+
| 1| apple|
| 1| curd|
| 1| fruit|
| 1| jam|
| 1| lamp|
| 1| pot|
| 1| rose|
| 10| apple|
| 10| ink|
| 10| jam|
| 10| spoon|
| 10| umbrella|
| 2|detergent|
| 2| glue|
| 2| hat|
| 2| jam|
| 2| spoon|
| 2| tape|
| 2| umbrella|
| 4| apple|
+-------+---------+
only showing top 20 rows
+--------------------------------------------------+--------+
|item |count(1)|
+--------------------------------------------------+--------+
|[apple, ink, jam, spoon, umbrella] |1 |
|[banana, fruit, hat, ink, lamp, spoon, umbrella] |1 |
|[apple, curd, fruit, pot, rose, tape] |1 |
|[banana, curd, ink] |1 |
|[apple, curd, fruit, jam, lamp, pot, rose] |1 |
|[detergent, glue, hat, jam, spoon, tape, umbrella]|1 |
|[ink, rose, spoon, umbrella] |1 |
|[fruit, glue, tape] |1 |
|[curd, detergent, glue, hat, ink, jam] |1 |
+--------------------------------------------------+--------+
+---------+--------+--------+
|item1 |item2 |count(1)|
+---------+--------+--------+
|spoon |umbrella|4 |
|ink |spoon |3 |
|ink |umbrella|3 |
|apple |curd |2 |
|apple |fruit |2 |
|apple |jam |2 |
|apple |pot |2 |
|apple |rose |2 |
|banana |ink |2 |
|curd |fruit |2 |
|curd |ink |2 |
|curd |jam |2 |
|curd |pot |2 |
|curd |rose |2 |
|detergent|glue |2 |
|detergent|hat |2 |
|detergent|jam |2 |
|fruit |lamp |2 |
|fruit |pot |2 |
|fruit |rose |2 |
|fruit |tape |2 |
|glue |hat |2 |
|glue |jam |2 |
|glue |tape |2 |
|hat |ink |2 |
|hat |jam |2 |
|hat |spoon |2 |
|hat |umbrella|2 |
|ink |jam |2 |
|jam |spoon |2 |
|jam |umbrella|2 |
|pot |rose |2 |
+---------+--------+--------+
Here I will be generating the Apriori Algorithm in sql using tables' joins.
connect veeksha/saketh
The csv file can be read into an Oracle external table and process data.
Instead, I preferred to insert the data into a table using insert commands.
You may load the csv file via sql*loader option also.
49 rows selected.
32 rows selected.
Verified the final results from spark-sql script output and sql query output and
they are matching.
Here I used the 19c feature of creating a function with "with" clause inside my
select query and pulled the data:
WITH
FUNCTION get_str(id IN NUMBER) RETURN varchar2 IS
v_str varchar2(200):='';
BEGIN
for c1 in (select item from apriori where tran_id = id order by 1) loop
v_str:=v_str||' '||c1.item;
end loop;
RETURN v_str;
END;
t1 as (SELECT get_str(tran_id) item
FROM (select distinct tran_id from apriori order by tran_id)),
t2 as (select distinct item item from apriori order by item),
t3 as (select a1.item item1,a2.item item2 from t2 a1,t2 a2 where a1.item <
a2.item)
select t3.item1, t3.item2,count(*) from t3,t1
where t1.item like '%'||t3.item1||'%'||t3.item2||'%'
group by t3.item1,t3.item2 having count(*)>1 order by 3 desc,1,2;
32 rows selected.
I don't like using complex aggregate functions or the 19c WITH CLAUSE FUNCTION
feature to process the transaction data
and find frequent patterns using apriori algorithm.
Can I just use the transactional table's data, as it is and produce the frequent
patterns.
The following is the simplest solution to produce frequent patterns using apriori
algorithm.
32 rows selected.
Pls note, the three above sql queries are producing identical results, and they are
matching.
17 rows selected.
Sql is very powerful and you can do most complex joins and solve complicated logic.
6 rows selected.
This simplified tables' join is used to find frequent pattern of 4 items sold
together.
This simplified tables' join can be used in a package to find the frequent pattern
of n items sold together in a dataset.
end if;
v_nut:=c1.tran_id;
v_stt:=c1.item;
v_str:=c1.item;
end if;
end loop;
end if;
if (v_num > 1) then
dbms_output.put_line(v_num||' -- '||c0.item1||' '||c0.item2);
end if;
end loop;
end;
Statement processed.
2 -- apple curd
2 -- apple fruit
2 -- apple jam
2 -- apple pot
2 -- apple rose
2 -- banana ink
2 -- curd fruit
2 -- curd ink
2 -- curd jam
2 -- curd pot
2 -- curd rose
2 -- detergent glue
2 -- detergent hat
2 -- detergent jam
2 -- fruit lamp
2 -- fruit pot
2 -- fruit rose
2 -- fruit tape
2 -- glue hat
2 -- glue jam
2 -- glue tape
2 -- hat ink
2 -- hat jam
2 -- hat spoon
2 -- hat umbrella
2 -- ink jam
3 -- ink spoon
3 -- ink umbrella
2 -- jam spoon
2 -- jam umbrella
2 -- pot rose
4 -- spoon umbrella
Pls note, if we are mining for frequent patterns for three or more items together,
some of the above scripts' logic changes.
Most of these scripts holds good for frequent pattern search of two items in a
given dataset.
Pls note, I had chosen my transactional table with items' details in multiple rows,
one item per row.
If we need to process the data of multiple items from a single row, then the
scripts need to be modified accordingly.
Next steps:
I tested these scripts with a smaller dataset.
Bench marking all the available options on a larger dataset and choosing the best
method.
Fine tuning the sql queries to perform better, using hints and modifying the
tables' join order etc.
Implement code change to generate the support,confidence numbers.
I implemented the sql query with listagg option in spark sql. Using other query
options in spark sql and generate output.
Using pl/sql in spark sql is not supported. Convert the pl/sql logic into hpl/sql
and use it inside spark sql to generate the output.
Happy scripting.
Reference:
http://www.orafaq.com/node/3138
http://www.orafaq.com/node/3222 To install spark and setup spark on my desktop.
"""