Monday, February 25, 2013

New to Mallet: A Machine Learning Repository

In order to make an excellent exploratory based query recommendation, I have tried several methods to reach it. Today, under the guide of Xian Wu, I start to try Mallet to have a try. It is a open source ml repository developed by Umass.

Some points:
*As I use it in Windows, all the following are based on Windows.
(0)What is Topic Modeling?
It is a way to understand hundreds of documents based on the frequency of words in them. Note, the only information we directly know from the documents is the term frequency.
It is a probability model to cluster text into several topics.
More information to know it: Probabilistic Topic Models.
(1)Install Mallet.
As it said on mallet's homepage, first we download it from its sites, then extract all. You must put the extracted file (i.e. mallet) in C directory. And you also need to set the environment variable %MALLET_HOME%='the path of mallet_2.0.7'. Note: in C:\..\..(later, I found that it is not necessary)
(2)Data Format:
    There are two ways to import training files.
    <1>import-dir 'the path', in this way, your training text files are viewed as per file a document.
    <2>import-file 'path', in this way, your training text is viewed as per line a document.
(3)The commands:
    a. import training data
        bin\mallet import-file --input  path\filename  --output path\outfilename.mallet --keep-sequence --remove-stopwords
    b.train a model
       bin\mallet train-topics  --input path\outfilename.mallet --num-topics 100 --output-state path\topic-state.gz --output-topic-keys path\name_keys.txt --output-doc-topics pat\hname_compostion.txt --inferencer-filename path\inferencer_name.mallet
    c.infer topic use trained model
      bin\mallet infer-topics --input path\test_instance.mallet --inferencer path\model.mallet --output-doc-topics path\result.doc

Notes:
1.in step c, the input file, namely, the instance you want to be inferred topics must be first process as a featuresequence file by the command: import-file --input ... --keep-sequence. Otherwise, there will  be a Java IO excepetion.
2.The output is vectors which includes the instanceID, lable and topicID+topicProbability
3.The commands mallet provides not limited to the above listed. use xxx --help can learn more.

Reference
[1]http://mallet.cs.umass.edu/topics.php
[2]http://programminghistorian.org/lessons/topic-modeling-and-mallet






Saturday, February 23, 2013

How to encode the output files to UTF-8 format with Java

Default, I mean if you use FileWriter, the output file is ANSI format. If you want your output file with another encoding format, just use FileOutputStream.
Like this:

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outpath),"UTF-8"));

For it can appoint the encoding format of the output files.

Tuesday, February 19, 2013

[C#] List Sort with Two Keywords

In C#, there is a collection List you can use to put in any kind of data type in it. Of course, all the items in it are with the same data type. Here is my topic: How to sort a list<T> where T is an user defined class and with two or more keywords?
I will show one way to reach it.
Sample:

  class TopicCandidateIComparable<TopicCandidate>
  {
   public string topic;
   public int length;
   public int weight;
   public TopicCandidate(string t, int l, int w)
   {
    this.topic = t;
    this.length = l;
    this.weight = w;
   }
   #region IComparable<Employee> Members
   public int CompareTo(TopicCandidate other) 
   {
    if (this.length == other.length)
     return this.weight.CompareTo(other.weight);
    else
     return this.length.CompareTo(other.length);
    
   }
   #endregion
  }

This is a class defined by me. I would like to sort list<TopicCandidate> cand order by fist length, and second weight. Then I can implement the interface IComparable to my class and implement the method Comparable(..) in my class. In this way, I can sort my list cand by just call the Sort() method of List, like cand.Sort(). It is ok.

Sure, there are other ways to sort a list<T>, like lamada expression and LINQ. After I learnt them, I will add them to this blog.

ps. Thanks this article from which I learnt this way. That is a very clear and readable article about List sort.