Grammar Induction for Musical MelodiesUniversiy of Southern California, Spring 2007 in ISE575/EE675/CSCI575/PSYCH675 by Reid Swanson (2007) |
|||||||||||
|
Constituent Context ModelKlein & Manning's constituent context model is inspired by the distributional hypothesis (Harris 1954) that states the meaning of a word is highly dependent on the context in which it appears. In this case we are not particularly interesting in meaning but in syntactic structure. However, an analogous hypothesis about constituent structure being highly dependent on context seems reasonable and intuitive. In this model two features based on this hypothesis are considered most important in determining if a node is a constituent. The first is the actual span of terminals that are dominated by the candidate node (the constituent part). The second are the terminals on either side of the candidate span (the context part). The left figure is the same typical English parse tree from before and the right figure is a table, also from Klein & Manning, of the corresponding features associated with that tree. The probability of a tree given these features is represented as an exponential model of the form: where σ represents the constituent features and c represents the context features. Formed this way the problem becomes to maximize the sum of the probabilities over all trees. This is done, similar to the EM algorithm, by first making an initial guess at the lambda values, parsing each sentence using these values, and then reestimating the lambdas from the resulting set of trees. Initially all lambdas are set to 0 so that each resulting tree is equally likely. Since this will result in many equally probabl trees all ties are broken randomly. To parse the sentence a modified version of the CYK algorithm is employed. Finding the best set of lambdas can be solved using a variety of optimization techniques such as conjugate gradient ascent. However, this optimization is slow and in practice was not always found to produce the desired results so another method is used; simply λi = λi/T where T is the total number of features (count(σn) + count(cm)). The major problem with this approach is data sparsity. Spans of words longer than 3 or 4 are likely to only be seen a handful of times, if at all, making any statistics based on these features highly unreliable. In the original work words were replaced by part of speech tags, which reduces the vocabulary size by several orders of magnitude but still does not completely alleviate the problem. Smoothing the data can also help, but it is unclear exactly how this was performed in their work or how much this will really help for longer spans. |
||||||||||
| Previous Next |