18-Sub-Modular Functions
18-Sub-Modular Functions
edu 1
Date:
▪ Monday, March 11 2:00 PM –
Wednesday, March 13, 2:00 PM Pacific Time
▪ Logistics:
▪ Administered on Gradescope
▪ 3 hours long (timer starts once you open the exam)
▪ Submitting answers (all questions visible at the same time):
One PDF for the entire exam (uploaded at the top of the exam)
One PDF for each question (uploaded to each question)
▪ You can do this as you go through the questions (do not need to
wait until the end)
Write answers directly in text boxes
▪ Please budget your time for submission (~10 min) and solve
questions you find easy first – the exam tends to be on the
longer side
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 2
If you think a question isn't clear on the
exam...
▪ Ask on Ed or state your (reasonable
and valid) assumptions in your answer
▪ We will actively monitor Ed on...
▪ Monday: 2 PM – 10 PM PT
▪ Tuesday: 8 AM – 3 PM, 5 PM – 10 PM PT
▪ Wednesday: 8 AM – 2 PM PT
▪ We will answer clarifying questions only
Exam Review Session: Friday, 6 PM – 7 PM PT
via Zoom (see Ed, Canvas for details)
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 3
Final exam is open book and open notes
A calculator or computer is REQUIRED
▪ You may only use your computer to do arithmetic
calculations (i.e., the buttons found on a standard
scientific calculator)
▪ You may also use your computer to read course
notes or the textbook
▪ No use of AI chatbots (including, but not limited
to, ChatGPT)
▪ No collaboration with other students
Practice finals are posted on Ed, Gradescope
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 4
Good luck with the exam! ☺
You Have Done a Lot!!!
And (hopefully) learned a lot!!!
▪ Answered questions and
proved many interesting results
▪ Implemented a number of methods
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 10
Q: What is being covered?
A: Concepts (In our case: Named entities)
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
𝑭 𝑨 = ራ 𝑿𝒊
𝒊∈𝑨
Goal: We want to
max 𝑭(𝑨)
𝑨 ≤𝒌
Note: F(A) is a set function: 𝑭 𝑨 : 𝐒𝐞𝐭𝐬 → ℕ
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 13
Given universe of elements 𝑾 = {𝒘𝟏, … , 𝒘𝒏 }
and sets 𝑿𝟏, … , 𝑿𝒎 𝑾
X3
X1 W
X2 X4
Example:
𝒅∈𝑨
▪ Eval. 𝑭 𝒅𝟏 , … , 𝑭({𝒅𝒎}), pick best (say 𝒅𝟏 )
▪ Eval. 𝑭 𝒅𝟏 } ∪ {𝒅𝟐 , … , 𝑭({𝒅𝟏 } ∪ {𝒅𝒎 }), pick best (say 𝒅𝟐 )
▪ Eval. 𝑭({𝒅𝟏 , 𝒅𝟐 } ∪ {𝒅𝟑 }), … , 𝑭({𝒅𝟏 , 𝒅𝟐 } ∪ {𝒅𝒎}), pick best
▪ And so on…
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 15
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 16
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 17
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 18
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 19
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 20
A
B C
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 21
Greedy produces a solution A
where: F(A) (1-1/e)*OPT (F(A) 0.63*OPT)
[Nemhauser, Fisher, Wolsey ’78]
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 22
Definition:
Set function F(·) is called submodular if:
For all A,B W:
F(A) + F(B) F(A B) + F(A B)
+ +
A A B B
A B
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 23
Diminishing returns characterization
Equivalent definition:
Set function F(·) is called submodular if:
For all A B:
B A + d Large improvement
+ d Small improvement
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 24
F(·) is submodular: A B
F(A {d}) – F(A) ≥ F(B {d}) – F(B)
Gain of adding d to a small set Gain of adding d to a large set
Natural example: A
▪ Sets 𝑑1, … , 𝑑𝑚
d
▪ 𝐹 𝐴 = 𝑖𝑑 𝐴∈𝑖ڂ
(size of the covered area)
B
▪ Claim:
𝑭(𝑨) is submodular! d
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 25
Submodularity is discrete analogue of
concavity
F(·)
F(B {d}) A B
F(B)
F(A {d})
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu
|A| 27
Let 𝑭𝟏 … 𝑭𝒎 be submodular and 𝝀𝟏 … 𝝀𝒎 > 𝟎
then 𝑭 𝑨 = σ𝒎 𝒊=𝟏 𝝀𝒊 𝑭𝒊 𝑨 is submodular
▪ Submodularity is closed under non-negative
linear combinations!
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 28
Q: What is being covered?
A: Concepts (In our case: Named entities)
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Obama Romney
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 34
Document coverage function:
probability document d covers
concept c
▪ Coverd(c) can also model how relevant is concept c for user u
Set coverage function:
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 35
The objective function is also submodular
▪ Intuitively, it has a diminishing returns property
▪ Greedy algorithm leads to a (1 – 1/e) ~ 63%
approximation, i.e., a near-optimal solution
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 36
Objective: pick k docs that cover most concepts
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 39
[Leskovec et al., KDD ’07]
By submodularity property:
𝐹 𝐴𝑖 ∪ 𝑑 − 𝐹 𝐴𝑖 ≥ 𝐹 𝐴𝑗 ∪ 𝑑 − 𝐹 𝐴𝑗 for 𝑖 < 𝑗
Observation: By submodularity:
For every 𝒅 ∈ 𝑫
𝚫𝒊 (𝒅) ≥ 𝚫𝒋 (𝒅) for 𝒊 < 𝒋 since 𝑨𝒊 𝑨𝒋 i(d) j(d)
Marginal benefits 𝚫𝒊 (𝒅) only shrink! d
(as i grows) Selecting document d in step i covers
more words than selecting d at step j (j>i)
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 40
[Leskovec et al., KDD ’07]
Idea:
(Upper bound on)
▪ Use i as upper-bound on j (j > i) Marginal gain 1
Lazy Greedy: a A1={a}
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 41
[Leskovec et al., KDD ’07]
Idea:
Upper bound on
▪ Use i as upper-bound on j (j > i) Marginal gain 2
Lazy Greedy: a A1={a}
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 42
[Leskovec et al., KDD ’07]
Idea:
Upper bound on
▪ Use i as upper-bound on j (j > i) Marginal gain 2
Lazy Greedy: a A1={a}
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 43
Summary so far:
▪ Diversity can be formulated as a set cover
▪ Set cover is submodular optimization problem
▪ Can be (approximately) solved using greedy algorithm
▪ Lazy-greedy gives significant speedup
400
exhaustive search
300 (all subsets)
Lower is better
naive
200 greedy
100
Lazy
0
1 2 3 4 5 6 7 8 9 10
number of blogs selected
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 44
But what about
personalization?
Election trouble
Sandy delays
Recommendations
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 45
We assumed same concept weighting for all users
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
France intervenes
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
politico
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
movie buff
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 47
Assume each user u has different preference
vector wc(u) over concepts c
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 48
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
France intervenes
Chuck for Defense
Argo wins big
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 49
Multiplicative Weights algorithm
▪ Assume each concept 𝒄 has weight 𝒘𝒄
▪ We recommend document 𝒅 and receive feedback,
say 𝒓 = +1 or -1
▪ Update the weights:
▪ For each 𝒄 ∈ 𝑿𝒅 set 𝒘𝒄 = 𝜷𝒓𝒘𝒄
▪ If concept c appears in doc d and we received positive feedback r=+1
then we increase the weight wc by multiplying it by 𝜷 (𝜷 > 𝟏)
otherwise we decrease the weight (divide by 𝜷)
▪ Normalize weights so that σ𝒄 𝒘𝒄 = 𝟏
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 50
Steps of the algorithm:
1. Identify items to recommend from
2. Identify concepts [what makes items redundant?]
3. Weigh concepts by general importance
4. Define item-concept coverage function
5. Select items using probabilistic set cover
6. Obtain feedback, update weights
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 51