BlingFireNuget 0.1.7

There is a newer version of this package available.
See the version list below for details.

dotnet add package BlingFireNuget --version 0.1.7

NuGet\Install-Package BlingFireNuget -Version 0.1.7

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="BlingFireNuget" Version="0.1.7" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add BlingFireNuget --version 0.1.7

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: BlingFireNuget, 0.1.7"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install BlingFireNuget as a Cake Addin
#addin nuget:?package=BlingFireNuget&version=0.1.7

// Install BlingFireNuget as a Cake Tool
#tool nuget:?package=BlingFireNuget&version=0.1.7

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Bling Fire

Introduction

Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.

Bling Fire Tokenizer Overview

Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization. Bling Fire supports the following tokenization algorithms:

Pattern-based tokenization
WordPiece tokenization
SentencePiece Unigram LM
SentencePiece BPE
Induced/learned syllabification patterns (identifies possible hyphenation points within a token)

Bling Fire provides uniform interface for working with all four algorithms so there is no difference for the client whether to use tokenizer for XLNET, BERT or your own custom model.

Model files describe the algorithms they are built for and are loaded on demand from external file. There are also two default models for NLTK-style tokenization and sentence breaking, which does not need to be loaded. The default tokenization model follows logic of NLTK, except hyphenated words are split and a few "errors" are fixed.

Normalization can be added to each model, but is optional.

Diffrences between algorithms are summarized here.

Bling Fire Tokenizer high level API designed in a way that it requires minimal or no configuration, or initialization, or additional files and is friendly for use from languages like Python, Ruby, Rust, C#, JavaScript (via WASM), etc.

Oh yes, it is also the fastest! We did a comparison of Bling Fire with tokenizers from Hugging Face, Bling Fire runs 4-5 times faster than Hugging Face Tokenizers, see also Bing Blog Post. We did comparison of Bling Fire Unigram LM and BPE implementaion to the same one in SentencePiece library and our implementation is ~2x faster, see XLNET benchmark and BPE benchmark. Not to mention our default models are 10x faster than the same functionality from SpaCy, see benchmark wiki and this Bing Blog Post.

So if low latency inference is what you need then you have to try Bling Fire!

See Bling Fire Github page for API description and examples.

A C# example, calling XLM Roberta tokenizer and getting ids and offsets

Note, everything that is supported in Python is supported by C# API as well. C# also has ability to use parallel computations since all models and functions are stateless you can share the same model across the threads without locks. Let's load XLM Roberta model and tokenize a string, for each token let's get ID and offsets in the original text.

using System;
using BlingFire;

namespace BlingUtilsTest
{
    class Program
    {
        static void Main(string[] args)
        {
            // load XLM Roberta tokenization model
            var h = BlingFireUtils.LoadModel("./xlm_roberta_base.bin");


            // input string
            string input = "Autophobia, also called monophobia, isolophobia, or eremophobia, is the specific phobia of isolation. I saw a girl with a telescope. Я увидел девушку с телескопом.";
            // get its UTF8 representation
            byte[] inBytes = System.Text.Encoding.UTF8.GetBytes(input);


            // allocate space for ids and offsets
            int[] Ids =  new int[128];
            int[] Starts =  new int[128];
            int[] Ends =  new int[128];

            // tokenize with loaded XLM Roberta tokenization and output ids and start and end offsets
            outputCount = BlingFireUtils.TextToIdsWithOffsets(h, inBytes, inBytes.Length, Ids, Starts, Ends, Ids.Length, 0);
            Console.WriteLine(String.Format("return length: {0}", outputCount));
            if (outputCount >= 0)
            {
                Console.Write("tokens from offsets: [");
                for(int i = 0; i < outputCount; ++i)
                {
                    int startOffset = Starts[i];
                    int surfaceLen = Ends[i] - Starts[i] + 1;

                    string token = System.Text.Encoding.UTF8.GetString(new ArraySegment<byte>(inBytes, startOffset, surfaceLen));
                    Console.Write(String.Format("'{0}'/{1} ", token, Ids[i]));
                }
                Console.WriteLine("]");
            }

            // free loaded models
            BlingFireUtils.FreeModel(h);
        }
    }
}

This code will print the following output:

return length: 49
tokens from offsets: ['Auto'/4396 'pho'/22014 'bia'/9166 ','/4 ' also'/2843 ' called'/35839 ' mono'/22460 'pho'/22014 'bia'/9166 ','/4 ' is'/83 'olo'/7537 'pho'/22014 'bia'/9166 ','/4 ' or'/707 ' '/6 'eremo'/102835 'pho'/22014 'bia'/9166 ','/4 ' is'/83 ' the'/70 ' specific'/29458 ' pho'/53073 'bia'/9166 ' of'/111 ' '/6 'isolation'/219488 '.'/5 ' I'/87 ' saw'/24124 ' a'/10 ' girl'/23040 ' with'/678 ' a'/10 ' tele'/5501 'scope'/70820 '.'/5 ' Я'/1509 ' увидел'/79132 ' дев'/29513 'у'/105 'шку'/46009 ' с'/135 ' теле'/18293 'скоп'/41333 'ом'/419 '.'/5 ]

See this project for more C# examples: https://github.com/microsoft/BlingFire/tree/master/nuget/test .

License

Licensed under the MIT License.

There are no supported framework assets in this package.

Learn more about Target Frameworks and .NET Standard.

.NETCoreApp 3.1
- No dependencies.

NuGet packages (2)

Showing the top 2 NuGet packages that depend on BlingFireNuget:

Package	Downloads
SS.SemanticKernel.Extensions This is a SemanticKernel extension built on the Embedding codebase.	1.0K
BlingFireNetStandard BlingFire wrapper for .Net Standard, see https://github.com/microsoft/BlingFire for details.	205

GitHub repositories (1)

Showing the top 1 popular GitHub repositories that depend on BlingFireNuget:

Repository	Stars
Azure-Samples/semantic-kernel-rag-chat Tutorial for ChatGPT + Enterprise Data with Semantic Kernel, OpenAI, and Azure Cognitive Search	132

Version	Downloads	Last updated
0.1.8	21,771	9/24/2021
0.1.7	1,311	5/25/2021
0.1.6	1,669	4/7/2021
0.1.5	840	3/11/2021
0.1.4	2,273	7/7/2020

BlingFire wrapper for .Net Core, see https://github.com/microsoft/BlingFire for details.