the next-generation search engine for parse trees
Like its predecessor tgrep, which
was written by Richard Pito, Tgrep2 is a search engine for finding structures
in a corpus of trees. The most common application of these programs is in
extracting data from the Penn
Treebank corpora of parsed sentences.
TGrep2 is mostly backward compatible with tgrep, but it also offers a number
of new features, including the following major enhancements:
- Rather than simply having a set of required relationships and a set of
prohibited relationships, nodes can have full boolean expressions of
relationships to other nodes.
- Nodes can be given unique labels and may then be referred to by those labels
in the pattern specification or in selecting trees for printing.
- Patterns are no longer restricted to simple tree architectures. The use of
node labels and segmented patterns allows links in a pattern to form
back-edges as well, permitting cycles of links.
- Customizable output formats allow a variety of information to be reported
in a flexible manner.
- Multiple search patterns may be specified and one can retrieve the first
subtree matching any pattern, the first subtree matching each pattern, or all
subtrees matching all patterns.
- Subtrees can be reported using a code rather than by printing the whole
structure. The trees themselves can later be retrieved using the codes.
- A variety of new links have been added and the immediately-precedes
link now has a more conventional meaning.
- TGrep2 corpus files are substantially smaller than tgrep corpora.
The TGrep2 manual is available in postscript and PDF formats. It explains how to use tgrep2 and how to
build tgrep2 corpus files from tgrep files.
TGrep2 comes precompiled for most up-to-date Linux machines. Simply download the tar file and unpack it with "tar xvzf
tgrep2.tgz". Try running tgrep2 with no arguments. If it is correctly
compiled for your machine, you should see a help message.
If the tgrep2 executable does not work on your machine, you will need to
compile your own. This involves two steps:
First you must compile the libdrutils.a library. To do this, go to this web page and follow the instructions.
Next, assuming you have already unpacked the tgrep2
tar file, you should edit the first line of the Makefile to point to the
location of the DRUtils header files (drutils.h).
Now run make and with any luck it will compile nicely. There is no built-in
mechanism for installing it someplace special on your machine. But it's only
a single, stand-alone executable, so you should be able to handle that
- Added a mode for working with Combinatory Categorial Grammar (CCG).
- Converted to the GNU General Public License.
- Fixed a bug when using %Na without %Nb. Thanks to Neal Snider for
finding this one.
- Fixed a parsing bug in segmented patterns introduced in 1.08.
- Added the %k, %d, %y, and %z tree-printing styles to report the length of
a subtree in terminals symbols, its maximum depth, and the terminal index of
the first and last terminals in the tree.
- The -f and -a options are now independent and can be combined to report,
without duplication, subtrees that match one or more patterns.
- Fixed a potential crash when reading badly formatted subtree code files.
- Thanks to Eric Joanis for fixing a potential crash in reading commented
- Patterns of the following form are now all well-formed and equivalent:
A < B < C
(A < B < C)
((A < B) < C)
((((A) < B)) < C)
- Fixed some problems with subtree copies. Copied subtrees can no longer
have any back edges.
- If a pattern segment is empty, it will now be ignored.
- Fixed a bug introduced in 1.06.
- Macros can now be defined and used to simplify pattern specification.
- Multiple patterns or pattern files can be given on the command-line.
- Optional links have been added, thanks to Eric Joanis, to allow
certain features of a tree to be reported if they occur, without
preventing a match when they don't.
- A bug was fixed in which nodes could be printed out even when a
match did not occur below a disjoined or optional link.
- Added the = operator, which means that the pattern on the right
must match the node matched by the one on the left. This has various
uses, one of which is the fact that it can be combined with ! to rule
out certain patterns. For example, this matches any node starting
with NP, except for NP-TMP: (/^NP/ != NP-TMP)
- You can also append = to any other link to mean that the identity
match is allowed (in addition to the standard meaning of the link).
For example, A >>= B means B dominates A or is equal to A.
- I added support for reading and writing bzip2 files. Thanks to
Eric Joanis for writing the first implementation of this.
- Fixed a bug that caused extra newlines in -l (long format) output mode.
- Added the -i flag, which causes node-name matching to be case-insensitive.
- Added the ability to complement node-name matching by preceding the pattern
with !. !DT|RB now matches all nodes whose name is not DT nor RB.
- Note that version 1.04 of the manual supercedes, 1.3, which was named
- Added support for sentence-level comments. When compiling a corpus,
comments can appear in the input provided the line starts with a #. The -C
option causes the comments to be stored in the corpus file or printed when a
match is made while searching. The %c field allows comments to be printed in
- Corpora can now be constructed for trees with a branching factor higher
than 255, using the -K flag.
- The new link types <: >: <<: and >>: have been added to match
only children or parents of only children.
- If a labeled node is printed in formatted output and the label is not
bound to any node, it will print "<none>" rather than generate an error.
- Added the -u option and %u... formatting flag to allow the printing of
just the top symbol in a subtree.
- Removed an unaligned memory access that caused problems under Solaris.
- Thanks to Eric Joanis
for suggesting the ability to handle comments and more than 255
children per node and for writing the first implementation of those
TGrep2 is Copyright (C) 2001-2005, Douglas L. T. Rohde. It is available
free of charge and is governed by the GNU General Public License, version 2.
Comments, questions, and bug reports should be addressed to firstname.lastname@example.org.
Doug Rohde, email@example.com,
formerly of the Department of Brain and Cognitive Science,
Massachusetts Institute of Technology