SUSANNE Corpus

As with the Penn Treebank, we will not talk about the history of the SUSANNE Corpus or its aims since this is not really relevant for your queries. For further information about topics we do not cover here, you will get texts during the seminar and you can of course search for more information online if you are interested.

The SUSANNE Corpus has an intimidating tagset which will probably scare you off initially. Do not worry though, once the initial scare wears off and you get accustomed to it, you will see that it really is helpful. Thanks to its extensive tagset, there is a lot of redundancy and for our purposes, redundancy is a good thing.

We will take a short look at the passive in the SUSANNE Corpus. In contrast to the Penn Treebank, where an ordinary query for passive sentences actually is quite complicated, it is really easy with the SUSANNE Corpus because we do not have to find out in what way the passive is implicitly encoded and describe that with a query since there actually is a subcategory symbol for any passive verb group. Subcategories are appended to a general phrasetag like V (for verb group). In the case of passive, that symbol is a p. Thus, any sentence that includes a verb group whose tag has a lower case p somewhere after the capital V has a passive in it. This makes querying many structures a lot easier for the SUSANNE Corpus.

There is, however, a price to pay for that convenience. Since there are many subcategories, the regular expressions need to describe your tags will be more difficult. An example for this would be the verb group must have been noticed. If you take a loot at the list of possible subcategories (Section 4.10; you will be also be given one in class), you will see that the conditions for several subcategory symbols apply. The verb group begins with a modal, so we will have to add a c. Having a look at the tense there will also have to be an f and since it also is passive, we may not forget the p. Thus, our verb group will be tagged Vcfp.

As you can see, the p can be anywhere in the tag. This is something that you have to pay attention to when formulating your regular expressions.