Problem: Calculate word co-occurrence.
Explanation: In a big corpus, there are many words. Calculating word co-occurrence in some well-defined context is meaningful. For instance, I want to know in a sentence, what the times, word A appears while word B exists as well. It is especially useful in natural language computing. This statistic will show the semantic correlations between words.
The definition of word co-occurrence may be a little confused to freshman. So I will give an example later.
Solution:
As the corpus may be too huge that it cannot be processed in single machine memory, so we only talk about MapReduce solutions here.
Given a string with words: AABCDABD, set window size equals 2.
1. Pair solution
we scan the string and get the neighbour words of each word in the string. (Is it a little unclear?) That is:
current word neighbour words pairs with times
A (<A,A>, 1) (<A,B>, 1)
A (<A,A>, 1) (<A,B>, 1) (<A,C>, 1)
B (<B,A>, 2) (<B,C>, 1) (<B,D>, 1)
C (<C,A>, 2) (<C,B>, 1) (<C,D>, 1)
D (<D,B>, 2) (<D,C>, 1) (<D,A>, 1)
A (<A,C>, 1) (<A,D>, 2) (<A,B>, 1)
B (<B,D>, 2) (<B,A>, 1)
D (<D,A>, 1) (<D,B>, 1)
Namely, the key-value pair design in Pairs Pattern is: key=(word_i, word_j), value=count(like 1)
Then, in the Reducer, pairs with same key will be processed on the same reducer.
Thus, final result would be:
A B C D
A 2 3 2 2
B 3 0 1 3
C 2 1 0 1
D 2 3 1 0
Above is the final word co-occurrence matrix with window size equals 2.
2. Stripes solution
we scan the string and get the neighbour words of each word in the string as did in Pair solution.
But we design a different key-value pair pattern to keep relation between words.
Here comes the details:
current word neighbour words pairs with times
A (A:1, B:1)
A (A:1, B:1, C:1)
B (A:2, C:1, D:1)
C (A:2, B:1, D:1)
D (A:1, B:2, C:1)
A (B:1, C:1, D:2)
B (A:1, D:2)
D (A:1, B:1)
Namely, the key-value pair design in Pairs Pattern is: key=(current term), value=(w1:c1, w2:c2....)
Key-value pairs with the same key will be distributed to the same reducer, then get the following output:
A -> (A:2, B:3, C:2, D:2)
B -> (A:3, C:1, D:3)
C -> (A:2, B:1, D:1)
D -> (A:2, B:3, C:1)
Therefore, the final word co-occurrence matrix is
A B C D
A 2 3 2 2
B 3 0 1 3
C 2 1 0 1
D 2 3 1 0
Comparison
Pairs pattern is easy to understand and implement. But the key is complex, that it is hard to reduce the input of reducers by conducting local combiner. Because if two key-pair value can be combined locally only if their **complex** key is exactly the same. In this case, it must cost much I/O time.
In contrast, Stripes pattern is a little harder than Pairs to understand. The key is simple than Pairs's. Thus it is easy to combine many local key-value pairs. Because it is much possible some pairs has the same word as Key. In this way, when the Mapper's output ships to Reducers, much fewer key-value pairs will be processed(I/O). It saves disk read time. So Stripes is faster than Pairs.
However, as the value in Stripes can be very long if the context is very big, for instance, set window size equals 1000,000,000.
When it comes to scalability, Pairs is also better than Strips.
So we should find a middle ground between these two solutions.
Appendix, pseudo-code of these two patterns:
Pairs:
Stripes:
Explanation: In a big corpus, there are many words. Calculating word co-occurrence in some well-defined context is meaningful. For instance, I want to know in a sentence, what the times, word A appears while word B exists as well. It is especially useful in natural language computing. This statistic will show the semantic correlations between words.
The definition of word co-occurrence may be a little confused to freshman. So I will give an example later.
Solution:
As the corpus may be too huge that it cannot be processed in single machine memory, so we only talk about MapReduce solutions here.
Given a string with words: AABCDABD, set window size equals 2.
1. Pair solution
we scan the string and get the neighbour words of each word in the string. (Is it a little unclear?) That is:
current word neighbour words pairs with times
A (<A,A>, 1) (<A,B>, 1)
A (<A,A>, 1) (<A,B>, 1) (<A,C>, 1)
B (<B,A>, 2) (<B,C>, 1) (<B,D>, 1)
C (<C,A>, 2) (<C,B>, 1) (<C,D>, 1)
D (<D,B>, 2) (<D,C>, 1) (<D,A>, 1)
A (<A,C>, 1) (<A,D>, 2) (<A,B>, 1)
B (<B,D>, 2) (<B,A>, 1)
D (<D,A>, 1) (<D,B>, 1)
Namely, the key-value pair design in Pairs Pattern is: key=(word_i, word_j), value=count(like 1)
Then, in the Reducer, pairs with same key will be processed on the same reducer.
Thus, final result would be:
A B C D
A 2 3 2 2
B 3 0 1 3
C 2 1 0 1
D 2 3 1 0
Above is the final word co-occurrence matrix with window size equals 2.
2. Stripes solution
we scan the string and get the neighbour words of each word in the string as did in Pair solution.
But we design a different key-value pair pattern to keep relation between words.
Here comes the details:
current word neighbour words pairs with times
A (A:1, B:1)
A (A:1, B:1, C:1)
B (A:2, C:1, D:1)
C (A:2, B:1, D:1)
D (A:1, B:2, C:1)
A (B:1, C:1, D:2)
B (A:1, D:2)
D (A:1, B:1)
Namely, the key-value pair design in Pairs Pattern is: key=(current term), value=(w1:c1, w2:c2....)
Key-value pairs with the same key will be distributed to the same reducer, then get the following output:
A -> (A:2, B:3, C:2, D:2)
B -> (A:3, C:1, D:3)
C -> (A:2, B:1, D:1)
D -> (A:2, B:3, C:1)
Therefore, the final word co-occurrence matrix is
A B C D
A 2 3 2 2
B 3 0 1 3
C 2 1 0 1
D 2 3 1 0
Comparison
Pairs pattern is easy to understand and implement. But the key is complex, that it is hard to reduce the input of reducers by conducting local combiner. Because if two key-pair value can be combined locally only if their **complex** key is exactly the same. In this case, it must cost much I/O time.
In contrast, Stripes pattern is a little harder than Pairs to understand. The key is simple than Pairs's. Thus it is easy to combine many local key-value pairs. Because it is much possible some pairs has the same word as Key. In this way, when the Mapper's output ships to Reducers, much fewer key-value pairs will be processed(I/O). It saves disk read time. So Stripes is faster than Pairs.
However, as the value in Stripes can be very long if the context is very big, for instance, set window size equals 1000,000,000.
When it comes to scalability, Pairs is also better than Strips.
So we should find a middle ground between these two solutions.
Appendix, pseudo-code of these two patterns:
Pairs:
class
Mapper
method Map(docid a, doc d)
for
all term w ∈ doc d
do
for
all term u ∈ Neighbors(w)
do
Emit(pair (w, u), count
1
)
Emit count
for
each co-occurrence
class
Reducer
method Reduce(pair p, counts [c1 , c2 , . . .])
s←
0
for
all count c ∈ counts [c1 , c2 , . . .]
do
s←s+c
Emit(pair p, count s)
Sum co-occurrence counts
Stripes:
class
Mapper
method Map(docid a, doc d)
for
all term w ∈ doc d
do
H ←
new
AssociativeArray
for
all term u ∈ Neighbors(w)
do
H{u} ← H{u} +
1
Emit(Term w, Stripe H)
Tally words co-occurring with w
class
Reducer
method Reduce(term w, stripes [H1 , H2 , H3 , . . .])
Hf ←
new
AssociativeArray
for
all stripe H ∈ stripes [H1 , H2 , H3 , . . .]
do
Sum(Hf , H)
Element-wise sum
Emit(term w, stripe Hf )