Class EarleyParser
The main class responsible for parsing using the Earley algorithm (Earley 1970, Stolcke 1995).
Inherited Members
Namespace: EarleyParser
Assembly: EarleyParser.dll
Syntax
public class EarleyParser
Remarks
This is a chart parser with dynamic programming that builds a parse table where:
- Each column corresponds to a position in the input
- Each state represents a partially matched rule
- Three operations (predict, scan, complete) fill the table
The parser implements the following key features:
- Dynamic programming through a chart structure
- Support for both context-free and linear indexed grammars
- Handling of ambiguous parses
- Generation of all possible parse trees
- Analysis of parse tree properties (recursion, basic trees)
The parser can be used in two modes:
- Parser mode: Parses a given input sentence according to the grammar
- Generator mode: Produces all possible sentences of a grammar up to a specified length
The implementation follows Stolcke's 1995 optimizations for handling ambiguity efficiently.
Constructors
| Edit this page View SourceEarleyParser(Grammar, Vocabulary, string[], int)
Initializes a new instance of the EarleyParser class.
Declaration
public EarleyParser(Grammar g, Vocabulary v, string[] text, int maxWords = 0)
Parameters
| Type | Name | Description |
|---|---|---|
| Grammar | g | The context-free grammar to use for parsing |
| Vocabulary | v | The vocabulary that maps words to their possible parts of speech |
| string[] | text | The input tokens to parse |
| int | maxWords | Maximum length when generating sentences. Only used when calling GenerateSentence(). Default is 0, which means no generation. |
Fields
| Edit this page View SourceGrammar
Gets the grammar used for parsing.
Declaration
public Grammar Grammar
Field Value
| Type | Description |
|---|---|
| Grammar |
Voc
Gets the vocabulary that maps words to their possible parts of speech.
Declaration
protected Vocabulary Voc
Field Value
| Type | Description |
|---|---|
| Vocabulary |
_table
Gets the Earley parsing table containing columns for each position in the input.
Declaration
protected EarleyColumn[] _table
Field Value
| Type | Description |
|---|---|
| EarleyColumn[] |
Methods
| Edit this page View SourceAddLexicalizedRules(List<Rule>, int, bool)
Adds lexicalized rules to the grammar.
Declaration
protected virtual void AddLexicalizedRules(List<Rule> lexicalRules, int i, bool preprocess = true)
Parameters
| Type | Name | Description |
|---|---|---|
| List<Rule> | lexicalRules | The list of lexical rules to add. |
| int | i | The index of the current column. |
| bool | preprocess | Whether to preprocess the rules before adding them. |
CountDerivations()
Counts the number of distinct parse trees that can be derived for the input.
Declaration
public int CountDerivations()
Returns
| Type | Description |
|---|---|
| int | The total number of valid parse trees for the input |
CountLengthsOfBasicTrees()
Analyzes the parse trees to determine the counts of lengths of all "basic trees" in the derivation. A basic tree is a non-pumped tree according to the pumping lemma.
Declaration
public List<int> CountLengthsOfBasicTrees()
Returns
| Type | Description |
|---|---|
| List<int> | A list of lengths for all basic trees found in the parse, or null if no parse exists |
GenerateSentence()
Generates all possible trees for the grammar up to input length n.
Declaration
public void GenerateSentence()
Remarks
This method builds the complete parse forest for all possible sentences up to the length specified in the constructor. The generated trees can be retrieved using GetFormattedString() to get either bracketed representations or sequences of parts of speech.
Exceptions
| Type | Condition |
|---|---|
| TooManyEarleyItemsGeneratedException | Thrown when the number of generated states exceeds the maximum allowed threshold. |
GetCompletedStartNonterminal(int)
Retrieves the completed parse node for the start symbol of the grammar.
Declaration
public EarleySpan GetCompletedStartNonterminal(int columnIndex = 0)
Parameters
| Type | Name | Description |
|---|---|---|
| int | columnIndex | The offset from the end of the input to look for the completed parse. Default is 0, which means the entire input. |
Returns
| Type | Description |
|---|---|
| EarleySpan | The EarleySpan representing the completed parse, or null if no such parse exists |
GetFormattedString(int, bool)
Gets a formatted string representation of the parse tree.
Declaration
public string[] GetFormattedString(int columnIndex = 0, bool onlyPartsOfSpeechSequences = false)
Parameters
| Type | Name | Description |
|---|---|---|
| int | columnIndex | The index of the column to get the string from. |
| bool | onlyPartsOfSpeechSequences | Whether to only include parts of speech in the output. |
Returns
| Type | Description |
|---|---|
| string[] | Array of string representations of the parse tree. |
GetPOSRulesInEachDerivation(int)
Extracts all part-of-speech rules used in each possible parse derivation.
Declaration
public List<HashSet<Rule>> GetPOSRulesInEachDerivation(int i)
Parameters
| Type | Name | Description |
|---|---|---|
| int | i | The derivation index to analyze (when there are multiple parses) |
Returns
| Type | Description |
|---|---|
| List<HashSet<Rule>> | A list of rule sets, where each set contains the POS rules for one derivation |
GetPossibleSyntacticCategoriesForToken(string)
Gets the possible syntactic categories for a given token.
Declaration
protected virtual HashSet<string> GetPossibleSyntacticCategoriesForToken(string nextScannableTerm)
Parameters
| Type | Name | Description |
|---|---|---|
| string | nextScannableTerm | The token to get categories for. |
Returns
| Type | Description |
|---|---|
| HashSet<string> | Set of possible syntactic categories for the token. |
HasDerivation()
Determines whether there exists a valid derivation for the input.
Declaration
public bool HasDerivation()
Returns
| Type | Description |
|---|---|
| bool | True if there is at least one valid derivation; otherwise, false. |
ParseSentence()
The main parsing loop that processes all states in the Earley chart.
Declaration
public (bool, string[]) ParseSentence()
Returns
| Type | Description |
|---|---|
| (bool, string[]) | A tuple containing:
|
Remarks
The method traverses the chart in two phases:
- First processes completed states (priority queue ordered by decreasing start indices)
- Then processes predicted states (queue of non-terminals to predict)
Epsilon transitions may cause looping back to completion phase. Parsing is rejected if the number of completed states in any column exceeds the maximum threshold.
PrepareEarleyTable(string[], int)
Creates and initializes the Earley parsing table with columns for each position in the input. The first column (index 0) is initialized with an empty token.
Declaration
protected virtual EarleyColumn[] PrepareEarleyTable(string[] text, int maxWord)
Parameters
| Type | Name | Description |
|---|---|---|
| string[] | text | The input tokens to be parsed |
| int | maxWord | The maximum length when generating sentences (used in generation mode) |
Returns
| Type | Description |
|---|---|
| EarleyColumn[] | An array of EarleyColumn objects representing the parse table |