LanguageDetection.Ai 1.1.0

dotnet add package LanguageDetection.Ai --version 1.1.0
NuGet\Install-Package LanguageDetection.Ai -Version 1.1.0
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="LanguageDetection.Ai" Version="1.1.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add LanguageDetection.Ai --version 1.1.0
#r "nuget: LanguageDetection.Ai, 1.1.0"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install LanguageDetection.Ai as a Cake Addin
#addin nuget:?package=LanguageDetection.Ai&version=1.1.0

// Install LanguageDetection.Ai as a Cake Tool
#tool nuget:?package=LanguageDetection.Ai&version=1.1.0

Language Detection

Detect the language of a text using naive a Bayesian filter with generated language profiles from Wikipedia abstract xml, 99% over precision for 51 languages. Original author: Nakatani Shuyo.

.NET Port of Language Detection Library for Java by @shuyo Forked from TechnikEmpire/language-detection

This package has been updated to C# 11 and .NET 7 and all external dependencies has been removed. The execution has also been optimized a bit to use threads and other improvements. The algorithm is now detecting languages with all 51 languages added in 0.5 ms down from 1.12 ms.

The LanguageDetector is now threadsafe and you can create a singleton instance of it that can be reused thoughout your application

Feel free to send pull requests to this repo

The Naive Bayesian filter

The Naive Bayesian filter, which is a classification method based on Bayes' Theorem, works on the principle of considering each feature to be independent of one another. In the context of language detection, these "features" could be the frequency of certain words, characters, or n-grams (sequences of n characters) in a text.

Here's a high-level overview of how it works:

Training Phase: During training, the filter uses a set of labeled training data (in this case, the Wikipedia abstract XML data for each language) to calculate the prior probability of each class (language) and the conditional probability of each feature (word, character, or n-gram) given each class. The prior probability of a class is the overall likelihood of that class in the training set, while the conditional probability of a feature given a class is the likelihood of that feature occurring in instances of that class.

Prediction Phase: To predict the class of a new, unlabeled instance (in this case, a piece of text whose language we want to detect), the filter first transforms the instance into a feature vector. It then applies Bayes' Theorem to calculate the posterior probability of each class given this feature vector. The class with the highest posterior probability is chosen as the prediction.

In the context of this Language Detector, the "naive" assumption of the Naive Bayesian filter—that every feature is independent of every other feature—is not strictly true, as words and characters in a language are often dependent on each other. However, despite this assumption, the Naive Bayesian filter often performs well in practice and is particularly effective for language detection due to its ability to handle many features and its robustness to irrelevant features.

It's important to note that the accuracy of the Naive Bayesian filter heavily depends on the quality and representativeness of the training data. The language profiles generated from Wikipedia abstract XML data provide a broad and diverse sample of language use, contributing to the high precision of this Language Detector.

Supported Languages

This library provides language detection support for the following languages:

Language ISO 639-2/T Code
Afrikaans afr
Arabic ara
Bengali ben
Bulgarian bul
Czech ces
Danish dan
German deu
Greek ell
English eng
Estonian est
Persian fas
Finnish fin
French fra
Gujarati guj
Hebrew heb
Hindi hin
Croatian hrv
Hungarian hun
Indonesian ind
Italian ita
Japanese jpn
Kannada kan
Korean kor
Latvian lav
Lithuanian lit
Malayalam mal
Marathi mar
Macedonian mkd
Nepali nep
Dutch nld
Norwegian nor
Punjabi pan
Polish pol
Portuguese por
Romanian ron
Russian rus
Slovak slk
Slovenian slv
Somali som
Spanish spa
Albanian sqi
Swahili swa
Swedish swe
Tamil tam
Telugu tel
Tagalog tgl
Thai tha
Turkish tur
Twi twi
Ukrainian ukr
Urdu urd
Vietnamese vie
Chinese zho

These language codes follow the ISO 639-2/T standard. Make sure to use the correct language code when invoking the language detection methods.

Install

dotnet add package LanguageDetection.Ai

Use

using LanguageDetection;

Load all supported languages

LanguageDetector detector = new LanguageDetector();
detector.AddAllLanguages();
Assert.Equal("dan", detector.Detect("Denne tekst er skrevet i dansk"));

or a small subset

LanguageDetector detector = new LanguageDetector();
detector.AddLanguages("dan", "eng", "swe");
Assert.Equal("dan", detector.Detect("Denne tekst er skrevet i dansk"));

You can also change parameters

LanguageDetectorSettings settings = new LanguageDetectorSettings() {
    RandomSeed = 1,
    ConvergenceThreshold = 0.9,
    MaxIterations = 50,
};
LanguageDetector detector = new LanguageDetector();

License

Apache 2.0

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net8.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.1.0 2,109 12/21/2023
1.0.5 1,527 6/27/2023
1.0.4 123 6/27/2023
1.0.3 119 6/27/2023
1.0.1 121 6/27/2023
1.0.0 126 6/27/2023

This release delivers a significant speed enhancement of approximately 67%, reducing processing time from 1.4ms to 0.46ms for large multilingual texts. Improvements include thread-safe LanguageDetector, internal threading optimization, and comprehensive code reorganization.