Penn Treebank

In this section, we will not give you detailed information about the Treebank’s history or the concepts behind its creation. You will do that in class or you can visit the corresponding websites or search for some information via a search engine. We will concentrate on what is important to know for your query.

As you might have already seen, The Penn Treebank has a rather small tagset compared to the SUSANNE Corpus. While this may make it your favorite for now, that opinion will surely change soon. The reason for that is that since the Penn Treebank has few options to describe a particular node and its functions, a lot of the information about the structure of a sentence can only be derived from the hierarchy in the Treebank. Thus, the Penn Treebank has a very deep structure and accordingly requires rather complex queries in order to describe the same grammatical construction. It is very well possible that you will need a query with multiple links to describe a pattern in the Penn Treebank for which a single regular expression might suffice in the SUSANNE Corpus. In return, the regular expressions will be quite easy. Most often, you can simply use the tag enclosed by slashes and you will do fine. There are exceptions to this, but with some basic knowledge (see the corresponding section for further help) of regular expressions, these should be easy enough to see.

By far the most difficult part in formulating a good query for the Penn Treebank will be finding out how the Penn Treebank does actually encode the grammatical construction you are looking for. It is hard to give some good advice for that, but a good starting point always is to try finding a couple of sentences which contain what you try to describe and then looking for similarities in their structure. In order to find these examples, just use a word that is either unique to the construction or at least often occurs in it. Obviously, this is a procedure that is useful not only for the Penn Treebank, but for the SUSANNE Corpus as well.