Tool for n-gram extraction from xml files

Overview.. 1

How it works. 2

The Algorithm.. 2

Examples. 2

The n-gram dialog box. 3

Menu. 3

File / Select input files. 3

File / Select filter stylesheet 3

Action / Extract n-grams. 3

Action / Save progress log. 4

Files group box. 4

Input files. 4

Root string of the output files. 4

Maximum file size. 4

View output as html 4

Behavior group box. 4

Stylesheet used to filter n-gram data. 4

Write attribute ID.. 4

Write attribute HASH.. 4

Selected node written as CDATA.. 4

Node selection. 4

Stop node identification. 4

Floor 5

Minimum.. 5

Maximum.. 5

Progress group box. 5

Developer information. 5

Overview

N-GRAM is a tool for extraction of n-grams from xml files. The application was coded in C# using .Net Framework 1.2.

Some of the application’s features are:

*      Selection of the nodes for n-grams construction with an XPath expression;

*      Identification of stop nodes also with an XPath expression (ex.: used for exclusion of the punctuation marks) – optional;

*      Filtering of the data to write to the output files using an xsl stylesheet;

*      ID and HASH attributes associated with n-grams – optional;

*      N-gram data written as CDATA (doesn’t stress memory on DOM parsing) – optional;

*      Automatic splitting of the output file – optional.

How it works

The Algorithm

     (i)            The application parses (DOM) one by one the input xml files selecting the nodes that interests for an n-gram construction according to the XPath select expression.

   (ii)            It splits the series of nodes at the stop nodes identified by an XPath stop expression. This procedure is particularly useful when the punctuation marks are not needed but they represent important information for optimizing the construction of n-grams (the n-gram does not span over the stop nodes).

 (iii)            The document fragment, consisting in the series of selected nodes (number of items ranging from a Minimum and a Maximum), is filtered by an xsl stylesheet. The results are added to a hash table.

  (iv)            The items in hash tables that exceed the floor number are written to xml files of specified maximum size. Every file has a root element named “N-GRAMS” with attributes as ID (for unique identification of the file) and COUNT (to store the number of the total n-grams of a kind, ex., for the 2-gram will have COUNT=2500). The COUNT attribute is not influenced by the floor number. Every n-gram could have an ID unique over the entire output of the program and could be written as CDATA to speed up the DOM parsing of the output files.

Examples

The distribution comes with several files to exemplify the n-gram extraction.

*      For the input file 1984en1_00.xml the select expression will be “//W”, the stop expression “//W[@POS=’PUNCT’]” and the xsl filter “en-filter.xsl”.

*      For the input file 1984roPOS_00.xml the select expression will be “(TOK|//COMP|//RSPLIT|//LSPLIT|//DIG|//ABBR|//DATE|//PUNCT|//PTERM|//PTERM_P|//OPUNCT|//CPUNCT)”, the stop expression “PUNCT|PTERM|PTERM_P|OPUNCT|CPUNCT” and the xsl filter “ro-filter.xsl”.

The n-gram dialog box

Menu

File / Select input files

The menu item opens a dialog to select the xml files that will be used as input.

File / Select filter stylesheet

The menu item opens a dialog to select xsl or xslt stylesheet used to filter n-gram data.

Action / Extract n-grams

Process the input files and writes the result to the output xml files.

Action / Save progress log

It saves the log (the text in the box above the progress bar) as plain text file.

Files group box

Input files

A read only text box storing the list of the input files. The files are inserted via the open dialog invoked by the menu item File / Select input files.

Root string of the output files

Determines how output files will be named (ex.: "2-gram_file00_" + root string, "2-gram_file01_" + root string, "3-gram_file0_" + root string).

Maximum file size

The number of maximum file size (Kb). If it is 0, then the program won’t split the results in different files.

View output as html

Add processing instruction <?xml-stylesheet type="text/xsl" href="n-gram2html.xsl"?> to the output files.

Behavior group box

Stylesheet used to filter n-gram data

The stylesheet used to pass the node as the context node to filter the data to be written.

Write attribute ID

Specify if an ID attribute should be written for an n-gram.

Write attribute HASH

Associate a hash value with every n-gram extracted.

Selected node written as CDATA

Write the n-gram as a CDATA node or as a document fragment.

Node selection

Select the nodes from the input files (ex: nodes as paragraph and sentence are not needed).

Stop node identification

Used to identify the nodes that fragment the series from which the n-grams are extracted (ex.: punctuation marks).

Floor

Minimum number of occurrences needed for an n-gram to be written to the output.

Minimum

Minimum items in a gram.

Maximum

Maximum items in a gram.

Progress group box

It consists of a Text Box and a Progress Bar used to display the progress of each item in the log. The contents of the log could be saved to a file from the menu item Action / Save progress log.

Developer information

Author: Alexandru Ceausu (alceausu@yahoo.com)

www.geocities.com\alceausu

Master of Computational Linguistics (2nd year)

Tel 04 0232 223731

Address: Toma Cozma 73, 6600, Iasi, Romania