Saturday, September 14, 2013

MapReduce Design Pattern -- Pairs and Stripes

Problem: Calculate word co-occurrence.
Explanation: In a big corpus, there are many words. Calculating word co-occurrence in some well-defined context is meaningful. For instance, I want to know in a sentence, what the times, word A appears while word B exists as well. It is especially useful in natural language computing. This statistic will show the semantic correlations between words.
The definition of word co-occurrence may be a little confused to freshman. So I will give an example later.
Solution:
As the corpus may be too huge that it cannot be processed in single machine memory, so we only talk about MapReduce solutions here.
Given a string with words: AABCDABD, set window size equals 2.
1. Pair solution
we scan the string and get the neighbour words of each word in the string. (Is it a little unclear?) That is:
current word        neighbour words pairs with times
A                         (<A,A>, 1) (<A,B>, 1)
A                         (<A,A>, 1) (<A,B>, 1) (<A,C>, 1)
B                         (<B,A>, 2) (<B,C>, 1) (<B,D>, 1)
C                         (<C,A>, 2) (<C,B>, 1) (<C,D>, 1)
D                         (<D,B>, 2) (<D,C>, 1) (<D,A>, 1)
A                         (<A,C>, 1) (<A,D>, 2) (<A,B>, 1)
B                         (<B,D>, 2) (<B,A>, 1)
D                         (<D,A>, 1) (<D,B>, 1)

Namely, the key-value pair design in Pairs Pattern is: key=(word_i, word_j), value=count(like 1)
Then, in the Reducer, pairs with same key will be processed on the same reducer.
Thus, final result would be:
      A   B   C  D
A   2    3    2   2
B   3    0    1   3
C   2    1    0   1
D   2    3    1   0
Above is the final word co-occurrence matrix with window size equals 2.

2. Stripes solution
we scan the string and get the neighbour words of each word in the string as did in Pair solution.
But we design a different key-value pair pattern to keep relation between words.
Here comes the details:
current word        neighbour words pairs with times
A                         (A:1, B:1)
A                         (A:1, B:1, C:1)
B                         (A:2, C:1, D:1)
C                         (A:2, B:1, D:1)
D                         (A:1, B:2, C:1)
A                         (B:1, C:1, D:2)
B                         (A:1, D:2)
D                         (A:1, B:1)
Namely, the key-value pair design in Pairs Pattern is: key=(current term), value=(w1:c1, w2:c2....)
Key-value pairs with the same key will be distributed to the same reducer, then get the following output:
A -> (A:2, B:3, C:2, D:2)
B -> (A:3, C:1, D:3)
C -> (A:2, B:1, D:1)
D -> (A:2, B:3, C:1)
Therefore, the final word co-occurrence matrix is
      A   B   C  D
A   2    3    2   2
B   3    0    1   3
C   2    1    0   1
D   2    3    1   0

Comparison
Pairs pattern is easy to understand and implement. But the key is complex, that it is hard to reduce the input of reducers by conducting local combiner. Because if two key-pair value can be combined locally only if their **complex** key is exactly the same. In this case, it must cost much I/O time.
In contrast, Stripes pattern is a little harder than Pairs to understand. The key is simple than Pairs's. Thus it is easy to combine many local key-value pairs. Because it is much possible some pairs has the same word as Key. In this way, when the Mapper's output ships to Reducers, much fewer key-value pairs will be processed(I/O). It saves disk read time. So Stripes is faster than Pairs.
However, as the value in Stripes can be very long if the context is very big, for instance, set window size equals 1000,000,000.
When it comes to scalability, Pairs is also better than Strips.
So we should find a middle ground between these two solutions.

Appendix, pseudo-code of these two patterns:
Pairs:
class Mapper
method Map(docid a, doc d)
for all term w ∈ doc d do
for all term u ∈ Neighbors(w) do
Emit(pair (w, u), count 1)
Emit count for each co-occurrence
class Reducer
method Reduce(pair p, counts [c1 , c2 , . . .])
s←0
for all count c ∈ counts [c1 , c2 , . . .] do
s←s+c
Emit(pair p, count s)
Sum co-occurrence counts

Stripes:
class Mapper
method Map(docid a, doc d)
for all term w ∈ doc d do
H ← new AssociativeArray
for all term u ∈ Neighbors(w) do
H{u} ← H{u} + 1
Emit(Term w, Stripe H)
Tally words co-occurring with w
class Reducer
method Reduce(term w, stripes [H1 , H2 , H3 , . . .])
Hf ← new AssociativeArray
for all stripe H ∈ stripes [H1 , H2 , H3 , . . .] do
Sum(Hf , H)
Element-wise sum
Emit(term w, stripe Hf )




Friday, April 5, 2013

基于LDA的博客分类算法


题目:给定一篇acmer的博客,输出博客类型。分三种:1.acm有关(包括解题报告,参赛感言,退役贴等)2.acm无关的技术贴(如网络数据库等)3.其它(比如聊谁谁谁的音乐之类)
已知信息5000篇博客的内容和题目(无类标)

思路
由于待分类的博客都是acmer的博客,所以其实范围比较小,拥有acmer自身的特点,比如OJ, 算法之类的词便是常客,代码也会经常出现在文章中。

笔者给出的解题方法中,包括2个层次:1直接根据某些特征判断类别(只对12类有效)21)无效时,采用第二种统计概率的方法(也可以装X地叫做数据挖掘的方法)。即通过2个条件投票:
    a. 代码部分占文章总体的比例。算法认为代码所占比例越高,越可能是12
   b. 利用训练数据训练出的model,来预测新的blog文章的类别

1)中所述,实质上就是作者根据自身经验assert一些1类或2类的条件,一旦文章满足某个条件,则该博客分类结束。
2)中所述,a是直接在读入文章时,统计代码字符占博客总字符的比例,笔者设定的条件是此比例高于50%时,算作第1类(此处仅是为成为第1类投一票,并非就此可决定它就是第1类,下同);若此比例在20%~50%之间,算作第2类;否则算作第3类。若某一博客的1 2 3类票数相等,优先判断为第1类。或许看上去不合理,但是由于待分类博客的特点,绝大多数的博文都是第1类,此举在实际中的效果并不差。

以下讲述训练model的过程。
   1.训练数据是无类标的博文,需要首先预处理数据,包括:1)去掉html标签。2)去掉中文stop words
    2.对于经过步骤1clear后的数据,去掉代码部分(判断代码部分可用html标签),再进行分词,以此作为训练modelInput data.
    3.笔者利用Mallet中的Topic Modeling来标注训练数据(Mallet中的TMLDA实现)。笔者给LDA设置了10topic,即所有训练数据都会被分为这10topic中的某一个topic中去。再人工review10topic中各自属于哪个类(123类),可见其实工作量很小。说到此,看上去,其实LDA的主要作用就是给训练数据标类标。没错,笔者只用到了这点。
  4.当数据有了类标就好办了,此时就是一个supervised的分类问题的训练model部分。笔者采用的是Naïve Bayesian classification. 说白了就是统计出一堆概率,再用在预测时。统计的是:P[word | classID],即某词word在类别classID下的概率。笔者只用了相对高频的词,所为高频的定义随手设的。还有一个概率是p[classID],即某个类的概率。
到此,model就建完了。就是一堆概率算完了。

以下讲述利用model来预测类标的过程。
    1.  对于一篇输入的blog,首先clear(去html, code,不需分词,分词就太复杂了)
    2. 统计code比例,得出第一票
    3. model中的高频词,遍历blog,若出现高频词,则更新p[word|classID],最后取使∏p[word|classID] * p[classID]得到最大值的classID作为此blog的类标。
最后,此blog的类标就出来了。笔者最后测试,准确率约70%(挺低的,但上述方法的改进空间也很大)

读到这里,可能你还有部分疑问:
    1. LDA是什么?
    2. Naïve Bayesian Classification是什么情况?
    3. 怎样用Mallet处理中文的?
Mallet 不支持中文的,尤其不能分词。笔者采用的方法是先自己分词分好,再将其编码为数字,作为Mallet的输入。
关于MalletTopic Modeling的用法,详见:http://lillianxiong.blogspot.com/2013/02/new-to-mallet-machine-learning.html

最后的最后,我做这个题目,其实只是想要那个Moleskine的本子。求不卡~

Monday, February 25, 2013

New to Mallet: A Machine Learning Repository

In order to make an excellent exploratory based query recommendation, I have tried several methods to reach it. Today, under the guide of Xian Wu, I start to try Mallet to have a try. It is a open source ml repository developed by Umass.

Some points:
*As I use it in Windows, all the following are based on Windows.
(0)What is Topic Modeling?
It is a way to understand hundreds of documents based on the frequency of words in them. Note, the only information we directly know from the documents is the term frequency.
It is a probability model to cluster text into several topics.
More information to know it: Probabilistic Topic Models.
(1)Install Mallet.
As it said on mallet's homepage, first we download it from its sites, then extract all. You must put the extracted file (i.e. mallet) in C directory. And you also need to set the environment variable %MALLET_HOME%='the path of mallet_2.0.7'. Note: in C:\..\..(later, I found that it is not necessary)
(2)Data Format:
    There are two ways to import training files.
    <1>import-dir 'the path', in this way, your training text files are viewed as per file a document.
    <2>import-file 'path', in this way, your training text is viewed as per line a document.
(3)The commands:
    a. import training data
        bin\mallet import-file --input  path\filename  --output path\outfilename.mallet --keep-sequence --remove-stopwords
    b.train a model
       bin\mallet train-topics  --input path\outfilename.mallet --num-topics 100 --output-state path\topic-state.gz --output-topic-keys path\name_keys.txt --output-doc-topics pat\hname_compostion.txt --inferencer-filename path\inferencer_name.mallet
    c.infer topic use trained model
      bin\mallet infer-topics --input path\test_instance.mallet --inferencer path\model.mallet --output-doc-topics path\result.doc

Notes:
1.in step c, the input file, namely, the instance you want to be inferred topics must be first process as a featuresequence file by the command: import-file --input ... --keep-sequence. Otherwise, there will  be a Java IO excepetion.
2.The output is vectors which includes the instanceID, lable and topicID+topicProbability
3.The commands mallet provides not limited to the above listed. use xxx --help can learn more.

Reference
[1]http://mallet.cs.umass.edu/topics.php
[2]http://programminghistorian.org/lessons/topic-modeling-and-mallet






Saturday, February 23, 2013

How to encode the output files to UTF-8 format with Java

Default, I mean if you use FileWriter, the output file is ANSI format. If you want your output file with another encoding format, just use FileOutputStream.
Like this:

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outpath),"UTF-8"));

For it can appoint the encoding format of the output files.

Tuesday, February 19, 2013

[C#] List Sort with Two Keywords

In C#, there is a collection List you can use to put in any kind of data type in it. Of course, all the items in it are with the same data type. Here is my topic: How to sort a list<T> where T is an user defined class and with two or more keywords?
I will show one way to reach it.
Sample:

  class TopicCandidateIComparable<TopicCandidate>
  {
   public string topic;
   public int length;
   public int weight;
   public TopicCandidate(string t, int l, int w)
   {
    this.topic = t;
    this.length = l;
    this.weight = w;
   }
   #region IComparable<Employee> Members
   public int CompareTo(TopicCandidate other) 
   {
    if (this.length == other.length)
     return this.weight.CompareTo(other.weight);
    else
     return this.length.CompareTo(other.length);
    
   }
   #endregion
  }

This is a class defined by me. I would like to sort list<TopicCandidate> cand order by fist length, and second weight. Then I can implement the interface IComparable to my class and implement the method Comparable(..) in my class. In this way, I can sort my list cand by just call the Sort() method of List, like cand.Sort(). It is ok.

Sure, there are other ways to sort a list<T>, like lamada expression and LINQ. After I learnt them, I will add them to this blog.

ps. Thanks this article from which I learnt this way. That is a very clear and readable article about List sort.