Axis.Pulsar.Core 0.0.1

There is a newer version of this package available.
See the version list below for details.
dotnet add package Axis.Pulsar.Core --version 0.0.1                
NuGet\Install-Package Axis.Pulsar.Core -Version 0.0.1                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Axis.Pulsar.Core" Version="0.0.1" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Axis.Pulsar.Core --version 0.0.1                
#r "nuget: Axis.Pulsar.Core, 0.0.1"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Axis.Pulsar.Core as a Cake Addin
#addin nuget:?package=Axis.Pulsar.Core&version=0.0.1

// Install Axis.Pulsar.Core as a Cake Tool
#tool nuget:?package=Axis.Pulsar.Core&version=0.0.1                

Axis.Pulsar.Core

Pulsar.Core is a library built to recognize context-free grammars.

Contents

  1. Introduction
  2. Production
  3. Atom Symbols
  4. Composite Symbols
  5. Syntax Validation
  6. Recognition
  7. Language IO

Introduction

<a id="Introduction"></a> Pulsar.Core is at it's core a grammar recognition library. It does it's business in two steps:

  1. It presents units that when assembled together, can define any context-free grammar. Each of these units represents specific recognition logic, to be applied according to the assembling order.
  2. Given any input string, the assembled grammar recognition structure is executed against the input, and a result, success or failure, is reported.

This simplified process is what underpins the working of Pulser.Core. Internally, it is modelled to closely resemble the canonical representation of context-free grammar, the famous 4-element tuple: (N, T, P, S), where:

  • N is a finite set of non-terminals symbols, which Pulsar.Core refers loosely to as Composites;
  • T is a finite set of terminals, which Pulsar.Core refers to as Atoms;
  • P is the finite set of productions;
  • S is the starting non-terminal symbol, an element of N.

Tying all of these structues together is the concept of a Rule, represented internally by the interface Axis.Pulsar.Core.Grammar.IRule. A rule is a unit that can ingest a Token, report if it contains a sequence of characters that conform to a pattern. Pulsar.Core implements all elements of the context-free grammar's 4-element tuple as rules.

Tokens

<a id="Tokens"></a> A good time to mention the Tokens structure is now. Strings are ingested into Pulsar.Core in the form of token instances, represented internally by the Axis.Pulsar.Core.Utils.Tokens struct. This is akin to the ReadOnlySpan<char> struct type, without the restriction of being a ref struct, and with the addition that the Tokens instance publicly holds on to the original input string. This affords Pulsar.Core a lot of performance benefits as, rather than having new string instances generated by all of the string comparison operations, the Tokens are passed around, much like ReadOnlySpan<char> instances.

Symbol

<a id="Symbol"></a> A symbol is a string that conforms to the pattern: ^[a-zA-Z]([a-zA-Z0-9-])*\\z.

Concrete Syntax Trees (CST)

<a id="Concrete-Syntax-Trees-(CST)"></a> The result of recognizing a string as a product of a grammar, is an instance of Pulsar.Core's implimentation of a concrete syntax tree. The Tree is made of nodes which can be one of two types:

  1. Atom Node: a successfully recognized terminal.
  2. Composite Node: a successfully recognized non-terminal.

Each Node holds at a minimum, the symbol it is designated to. Individually, Atoms hold a reference to the Tokens instance, a section of the input string that was recognized by the terminal; while Composites hold a reference to all the other nodes that were recognized by the rule of it's designated symbol. See composites for more details.

Walking the Concrete Syntax Tree

<a id="Walking-the-Concrete-Syntax-Tree"></a> Pulsar.Core provides a minimalistic API for searching for nodes within the CST. Rather than using a Tree Walking/Visiting implementation, Interaction with the CST is achieved by calling:

  1. FindNodes(...): This takes a NodePath instance, which is converted from it's Node Query Language string representation. If the query succeeds, all nodes matching the query are returned.
  2. FindAllNodes(...): This takes a string representing the node symbol. And the entire tree is searched for all nodes with the given symbol.
Node Query Langauage (NQL)

<a id="Node-Query-Language-(NQL)"></a> NQL is an extremely simple query language used to describe conditions to match nodes against. Considering it is used against a tree of nodes to retrieve nodes that match conditions, it is expressed in heirarchical segments, each segment describing conditions for nodes at that level to meet. It is akin to the file-path format, where each file name can be thought of as a node symbol name, or, conditions that nodes at that level must match.

E.g, node-a/node-b/node-c. This is a simple query that checks the given node if it has direct children nodes with the name node-a, whatever is returned is in turn checked if they have direct children nodes with node-b as the symbol name, and so on. Executing the query is done in the following fashion:

var node = ...
var resultNodes = node.FindNodes("node-a/node-b/node-c");

In the example above, the string "node-a/node-b/node-c" is implicitly converted into an instance of Axis.Pulsar.Core.CST.NodePath, and the search operation is run using the instance.

NQL Syntax

<a id="NQL-Syntax"></a> NQL can be visualized as a Path, which consists of Segment instances, each of which in turn consists of Filter instances.

  • Path: A path is the overall query expression. It contains one or more Segment instances, separated by a / character. Paths represent criteria that the descendants of the target node must meet.

    Syntax: 'segment'[/'segment']*

  • Segment: A segment is just a group of one or more filters. The filters in a segment represent alternative criteria against which each node is tested - in layman's terms, segments constitute a logical OR combination of predicates, executed in the order they appear.

    Syntax: 'filter'[|'filter']*

  • Filter: A filter is a predicate that has at least one of a few optional components that the nodes will be tested against. These components include:

    • Node type: a character depicting the type of the node being sought. There 3 possible values for the node type:

      • t or T - for Atom/Termianl nodes.
      • n or N - for Composite/Non Terminal nodes.
      • u or U - for when the node type is unspecified, and as such all nodes are accepted.

      Syntax: @'node-character' e.g @n or @U.

    • Symbol name: a literal string value that the name of the symbol must exactly match. Valid names conform to the pattern described in the Symbol section. Note that the delimiter : is optional when there is no node-type specified.

      Syntax: :'symbol-name' e.g :'symbol-1', 'symbol-2'

    • Tokens: a literal string value that the tokens of the symbol must exactly match. Considering that tokens are wrapped in the delimiters < and >, if the character > must be included in the tokens, it must be escaped using \>.

      Syntax: <'token text'> e.g <the quick brown fox>.

    Complete filter syntax examples:

    • @n:decimal-part<.000034> - matches nodes that are non-terminal, with the name decimal-part, and have the tokens .000034.
    • day-of-the-week or :day-of-the-week - matches any node with the symbol name day-of-the-week.
    • @t - matches any terminal/atomic node.
    • <protected>, or @u<protected> - matches any node with the token protected.

Production

<a id="Production"></a> A production is, simply put, a rule that maps from a symbol name, to either an AtomicRule, or a CompositeRule.

Atomic Symbols

<a id="Atomic-Symbols"></a> An atom is a symbol whose rule matches the ingested Tokens to an internally encapsulated pattern. Pulsar.Core comes with 5 Atom symbols/rules out of the box, all of which are located in the Axis.Pulsar.Core.Grammar.Atomic namespace:

  1. EOF: This matches the end of the input Tokens - meaning, an attempt to read another character from the input will yield "nothing".
  2. Literal: This matches from the current position in the input Tokens, a string of characters that are equal to a literal string encapsulated by the symbol/rule. By default, this uses case-sensitive matching, but this can be turned off by passing the appropriate argument to the symbol upon creation.
  3. Pattern: This matches input characters against an internally encapsulated regular expression. The pattern rule operates under two configurations:
    1. Open Match: In this state, characters are read from the input, one at a time, and matched against the regular expression, until a given number of failures is encountered, at which point, whatever was matched prior to the first failure is reported as the matched tokens.
    2. Closed Match: In this state, a given max number of characters are read from the input, then matched against the encapsulated regular expression. With each failure, a character is removed from the end of the tokens, till either a match is found, or a minimum number of characters is reached.
  4. CharRange: This is setup of with 2 lists, each of disjointed ranges of characters; one list represents an "include" list, while the other represetns an "exclude" list. The CharRange rule works by reading a character from the input, and making sure it is in the include list, and not in the exclude list. Note that either of the lists may be absent, but not both. A character range is simply a 2-element tuple of characters, with one character being a lower boundary, and the other being the upper boundary.
  5. DelimitedString: This rule matches matches tokens that conform to a pattern that contains the following elements:
    1. Start Delimiter: a sequence of characters that marks the start of this parttern.
    2. End Delimiter: a sequence of characters that marks the end of this pattern.
    3. Accepts empty value: boolean indicating if the start and end delimiters can exist without any characters between them.
    4. End Delimiter Escape Sequence: a sequence of characters that "escape" the end-delimiter, allowing the rule to recognize cases where an existence of the end-delimiter is not recognized as the end-delimiter.
    5. Legal/Illegal sequeces: sequences of characters that must be found or must not be found bewteen the start and end delimiters.
    6. Legal/Illegal character ranges: ranges of characters that must or must not be found between the start and end delimiters.

The point of extension is the Axis.Pulsar.Core.Grammar.Atomic.IAtomicRule interface. Any instance of this interface is accepted while building up the Grammer via the Axis.Pulsar.Core.Lang.ILanguageImporter API.

Composite and Group Symbols

<a id="Composite-and-Group-Symbols"></a> A composite symbol is a rule that recognizes one or more rules according to a specified order. Composite rules are Pulsar.Core's implementation of Non-Terminal symbols, and as such, there is only one implementation of the ICompositeRule interface, the Axis.Pulsar.Core.Grammar.Composite.NonTerminal class.

The non-terminal simply delegates recognition responsibilities to it's encapculated IGroupRule instance.

Group Symbols

<a id="Group-Symbols"></a> Group symbols are what implement the specialized ordering of encapsulated symbols; they each contain one or more Group Element. Each element possesses a Cardinality property that facilitates recognizing repetitions of the element. Pulsar.Core comes with 5 types of groups:

  1. Choice: A choice functions by searching for the first of it's element rules to successsfully recognize characters from the input tokens. Soon as a successful recognition is found, it is reported.
  2. Sequence: A sequence functions by ensuring all element rules successfully recognize characters from the input string, in the order that they exist internally.
  3. Set: A set functions by ensureing all element rules successfully recognize characters from the input string, but unlike it's sibling, Sequence, Set does not enforce any order. It only enforces that a given minimum recognition-count be successful.
  4. Production Ref: A production ref is a reference to another production. This is how Pulsar.Core is able to nest symbols inside other symbols, constituting the composition of symbols to make a production.
  5. Atom Ref: Pulsar.Core enables inline-expression of atomic symbols/rules. What this means is, rather than ONLY having atomic symbols appearing as the single symbol of a production, atoms can be placed right within a composite rule.
Cardinality

<a id="Cardinality"></a> Cardinality expresses the number of occurences expected for a Group Element. Cardinality provides two values, a min occurence, and a max occurence value. Obviously, the min cannot be greater than the max. However, the max value can also indicate "infinity", signifying a situation where no limit is specified. This is modeled similar to cardinality in regular expressions.

Syntax Validation

<a id="Syntax-Validation"></a> A simple validation API is provided by Pulsar.Core, enabling users plug into the recognition execution and provide validation services. Valitation is triggered after every Production is successfully recognized, and the recognized CST Node is passed into the validation function, for which any exceptions thrown signals a an invalid recognition.

Recognition

<a id="Recognition"></a> Recognition within Pulsar.Core is a process where every Rule reads a sufficient number of characters from the TokenBuffer that conforms to the rules internal expectations. The following assumptions/invariants MUST always be true for every recognition operation:

  1. The TokenBuffer is initially at a position where the next read character has not be recognized by any previous rule.
  2. Upon successful recognition, the TokenBuffer is in a position pointing to the last recognized character of the current rule.
  3. Upon failed recognition, the TokenBuffer is left in a state exactly as it was initially passed into the rule.
  4. There are 3 recognition results
    1. Success - returns the designated success object, e.g the CST Node.

    2. FailedRecognition - signifies that a recognition was not possible, and other alternative rules may be tried if possible.

    3. PartialRecognition - signifies that recognition was only partially possible. This is a fatal error and all further recongition operations are aborted. This usually happens when the unique "delimiters" for symbols were recognized, but further characters failed to be recognized. E.g, a rule that must match a signed decimal integer defined by the following pseudo-production:

      decimal-integer = {/[+-]/, \d+\}

      Given the input "+abc", the production will return a PartialRecognition result, because a + was recognized, but the rest of the expected characters weren't found.

Language IO

<a id="Language-IO"></a> Pular.Core only facilitates building grammars with the programatic structures described above, as such, it has no language of its own to enable the declaritive expression of context-free grammar. It however provides an extension point for any grammar-expression language to plug into it's APIs. This happens through the interfaces in the Axis.Pulsar.Core.Lang namespace.

  • ILanguageContext: At the heart of Pulsar.Core is the language context. This contains all information needed to ingest a string of characters, and return a CST representation of the recognized grammar.

  • ILanguageExporter: This interface provides an API for grammar-expression languages to convert the IGrammar within a ILanguageContext into a string representation of it's grammar-expression langauge.

  • ILanguageImporter: This interfaces provides an API for grammar-expression languages to convert their string representations into the various constructs provided by Pulsar.Core.

Product Compatible and additional computed target framework versions.
.NET net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (2)

Showing the top 2 NuGet packages that depend on Axis.Pulsar.Core:

Package Downloads
Axis.Pulsar.Core.XBNF

eXtended BNF implementation of the Axis.Pulsar.Core grammar recognizer

Axis.Dia.PathQuery

Package Description

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
0.3.4 208 7/27/2024
0.3.3 101 7/17/2024
0.3.1 93 7/16/2024
0.3.0 97 7/15/2024
0.2.0 140 3/16/2024
0.1.5 149 1/27/2024
0.1.4 129 1/27/2024
0.1.3 121 1/26/2024
0.1.2 112 1/26/2024
0.1.1 116 1/25/2024
0.0.1 171 1/5/2024
0.0.1-testing-preview-trigg... 102 1/13/2024