|
Tool for n-gram extraction from xml files |
File
/ Select filter stylesheet
Root
string of the output files
Stylesheet
used to filter n-gram data
Selected
node written as CDATA
N-GRAM is a tool for extraction of n-grams from xml files. The application was coded in C# using .Net Framework 1.2.
Some of the application’s features are:
Selection of the nodes for n-grams construction
with an XPath expression;
Identification of stop nodes also with an XPath expression (ex.: used for
exclusion of the punctuation marks) – optional;
Filtering of the data to write to the output
files using an xsl stylesheet;
ID and HASH attributes associated with n-grams –
optional;
N-gram data written as CDATA (doesn’t stress
memory on DOM parsing) – optional;
Automatic splitting of the output file –
optional.
(i) The application parses (DOM) one by one the input xml files selecting the nodes that interests for an n-gram construction according to the XPath select expression.
(ii) It splits the series of nodes at the stop nodes identified by an XPath stop expression. This procedure is particularly useful when the punctuation marks are not needed but they represent important information for optimizing the construction of n-grams (the n-gram does not span over the stop nodes).
(iii) The document fragment, consisting in the series of selected nodes (number of items ranging from a Minimum and a Maximum), is filtered by an xsl stylesheet. The results are added to a hash table.
(iv) The items in hash tables that exceed the floor number are written to xml files of specified maximum size. Every file has a root element named “N-GRAMS” with attributes as ID (for unique identification of the file) and COUNT (to store the number of the total n-grams of a kind, ex., for the 2-gram will have COUNT=2500). The COUNT attribute is not influenced by the floor number. Every n-gram could have an ID unique over the entire output of the program and could be written as CDATA to speed up the DOM parsing of the output files.
The distribution comes with several files to exemplify the n-gram extraction.
For the input file 1984en1_00.xml the select expression
will be “//W”, the stop expression “//W[@POS=’PUNCT’]”
and the xsl filter “en-filter.xsl”.
For the input file 1984roPOS_00.xml the select expression
will be “(TOK|//COMP|//RSPLIT|//LSPLIT|//DIG|//ABBR|//DATE|//PUNCT|//PTERM|//PTERM_P|//OPUNCT|//CPUNCT)”,
the stop expression “PUNCT|PTERM|PTERM_P|OPUNCT|CPUNCT” and the xsl filter “ro-filter.xsl”.
The menu item opens a dialog to select the xml files that will be used as input.
The menu item opens a dialog to select xsl or xslt stylesheet used to filter n-gram data.
Process the input files and writes the result to the output xml files.
It saves the log (the text in the box above the progress bar) as plain text file.
A read only text box storing the list of the input files. The files are inserted via the open dialog invoked by the menu item File / Select input files.
Determines how output files will be named (ex.: "2-gram_file00_" + root string, "2-gram_file01_" + root string, "3-gram_file0_" + root string).
The number of maximum file size (Kb). If it is 0, then the program won’t split the results in different files.
Add processing instruction <?xml-stylesheet type="text/xsl" href="n-gram2html.xsl"?> to the output
files.
The stylesheet used to pass the node as the context node to filter the data to be written.
Specify if an ID attribute should be written for an n-gram.
Associate a hash value with every n-gram extracted.
Write the n-gram as a CDATA node or as a document fragment.
Select the nodes from the input files (ex: nodes as paragraph and sentence are not needed).
Used to identify the nodes that fragment the series from which the n-grams are extracted (ex.: punctuation marks).
Minimum number of occurrences needed for an n-gram to be written to the output.
Minimum items in a gram.
Maximum items in a gram.
It consists of a Text Box and a Progress Bar used to display the progress of each item in the log. The contents of the log could be saved to a file from the menu item Action / Save progress log.
Author: Alexandru Ceausu (alceausu@yahoo.com)
Master of Computational Linguistics (2nd year)
Tel 04 0232 223731
Address: Toma Cozma 73,
6600,