(i) The application parses (DOM) one by one the input xml files selecting the nodes that interests for an n-gram construction according to the XPath select expression.

(ii) It splits the series of nodes at the stop nodes identified by an XPath stop expression. This procedure is particularly useful when the punctuation marks are not needed but they represent important information for optimizing the construction of n-grams (the n-gram does not span over the stop nodes).

(iii) The document fragment, consisting in the series of selected nodes (number of items ranging from a Minimum and a Maximum), is filtered by an xsl stylesheet. The results are added to a hash table.

(iv) The items in hash tables that exceed the floor number are written to xml files of specified maximum size. Every file has a root element named “N-GRAMS” with attributes as ID (for unique identification of the file) and COUNT (to store the number of the total n-grams of a kind, ex., for the 2-gram will have COUNT=2500). The COUNT attribute is not influenced by the floor number. Every n-gram could have an ID unique over the entire output of the program and could be written as CDATA to speed up the DOM parsing of the output files.

Examples

The distribution comes with several files to exemplify the n-gram extraction.

For the input file 1984en1_00.xml the select expression will be “//W”, the stop expression “//W[@POS=’PUNCT’]” and the xsl filter “en-filter.xsl”.

The n-gram dialog box

Files group box

Input files

A read only text box storing the list of the input files. The files are inserted via the open dialog invoked by the menu item File / Select input files.

Root string of the output files

Determines how output files will be named (ex.: "2-gram_file00_" + root string, "2-gram_file01_" + root string, "3-gram_file0_" + root string).

Maximum file size

The number of maximum file size (Kb). If it is 0, then the program won’t split the results in different files.

View output as html

Add processing instruction <?xml-stylesheet type="text/xsl" href="n-gram2html.xsl"?> to the output files.

Behavior group box

Stylesheet used to filter n-gram data

The stylesheet used to pass the node as the context node to filter the data to be written.

Write attribute ID

Specify if an ID attribute should be written for an n-gram.

Write attribute HASH

Associate a hash value with every n-gram extracted.

Selected node written as CDATA

Write the n-gram as a CDATA node or as a document fragment.

Node selection

Select the nodes from the input files (ex: nodes as paragraph and sentence are not needed).

Stop node identification

Used to identify the nodes that fragment the series from which the n-grams are extracted (ex.: punctuation marks).

Floor

Minimum number of occurrences needed for an n-gram to be written to the output.

Minimum

Minimum items in a gram.

Maximum

Maximum items in a gram.

Progress group box

It consists of a Text Box and a Progress Bar used to display the progress of each item in the log. The contents of the log could be saved to a file from the menu item Action / Save progress log.

Developer information

Author: Alexandru Ceausu (alceausu@yahoo.com)

www.geocities.com\alceausu

Master of Computational Linguistics (2^nd year)

Tel 04 0232 223731

Address: Toma Cozma 73, 6600, Iasi, Romania