Monday, February 25, 2013

New to Mallet: A Machine Learning Repository

In order to make an excellent exploratory based query recommendation, I have tried several methods to reach it. Today, under the guide of Xian Wu, I start to try Mallet to have a try. It is a open source ml repository developed by Umass.

Some points:
*As I use it in Windows, all the following are based on Windows.
(0)What is Topic Modeling?
It is a way to understand hundreds of documents based on the frequency of words in them. Note, the only information we directly know from the documents is the term frequency.
It is a probability model to cluster text into several topics.
More information to know it: Probabilistic Topic Models.
(1)Install Mallet.
As it said on mallet's homepage, first we download it from its sites, then extract all. You must put the extracted file (i.e. mallet) in C directory. And you also need to set the environment variable %MALLET_HOME%='the path of mallet_2.0.7'. Note: in C:\..\..(later, I found that it is not necessary)
(2)Data Format:
    There are two ways to import training files.
    <1>import-dir 'the path', in this way, your training text files are viewed as per file a document.
    <2>import-file 'path', in this way, your training text is viewed as per line a document.
(3)The commands:
    a. import training data
        bin\mallet import-file --input  path\filename  --output path\outfilename.mallet --keep-sequence --remove-stopwords
    b.train a model
       bin\mallet train-topics  --input path\outfilename.mallet --num-topics 100 --output-state path\topic-state.gz --output-topic-keys path\name_keys.txt --output-doc-topics pat\hname_compostion.txt --inferencer-filename path\inferencer_name.mallet
    c.infer topic use trained model
      bin\mallet infer-topics --input path\test_instance.mallet --inferencer path\model.mallet --output-doc-topics path\result.doc

Notes:
1.in step c, the input file, namely, the instance you want to be inferred topics must be first process as a featuresequence file by the command: import-file --input ... --keep-sequence. Otherwise, there will  be a Java IO excepetion.
2.The output is vectors which includes the instanceID, lable and topicID+topicProbability
3.The commands mallet provides not limited to the above listed. use xxx --help can learn more.

Reference
[1]http://mallet.cs.umass.edu/topics.php
[2]http://programminghistorian.org/lessons/topic-modeling-and-mallet






No comments:

Post a Comment