Text analyzer that extracts tokens from text for use in full-text search queries and indexes.

Overview

GM Consult Pty Ltd

Tokenize text, compute document readbility and compare terms in Natural Language Processing.

THIS PACKAGE IS PRE-RELEASE, and SUBJECT TO DAILY BREAKING CHANGES.

Skip to section:

Overview

The text_analysis package provides methods to tokenize text, compute readibility scores for a document and evaluate similarity of terms. It is intended to be used in Natural Language Processing (NLP) as part of an information retrieval system.

It is split into four libraries:

  • text_analysis is the core library that exports the tokenization, analysis and string similarity functions;
  • extensions exports extension methods also provided as static methods of the TextSimilarity class;
  • implementation exports the mixins and base classes that implement the interfaces; and
  • type_definitions exports all the typedefs used in this package.

Refer to the references to learn more about information retrieval systems and the theory behind this library.

Tokenization

Tokenization comprises the following steps:

  • a term splitter splits text to a list of terms at appropriate places like white-space and mid-sentence punctuation;
  • a character filter manipulates terms prior to tokenization (e.g. changing case and / or removing non-word characters);
  • a term filter manipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. The termFilter can also filter out stopwords; and
  • the tokenizer converts terms to a collection of tokens that contain tokenized versions of the term and a pointer to the position of the tokenized term (n-gram) in the source text. The tokens are generated for keywords, terms and/or n-grams, depending on the TokenizingStrategy selected. The desired n-gram range can be passed in when tokenizing the text or document.

Text analysis

Readibility

The TextDocument enumerates a text document's paragraphs, sentences, terms and tokens and computes readability measures:

  • the average number of words in each sentence;
  • the average number of syllables per word;
  • the Flesch reading ease score, a readibility measure calculated from sentence length and word length on a 100-point scale; and
  • Flesch-Kincaid grade level, a readibility measure relative to U.S. school grade level. The TextDocument also includes a co-occurrence graph generated using the Rapid Keyword Extraction (RAKE) algorithm, from which the keywords (and keyword scores) can be obtained.

String Comparison

The following measures of term similarity are provided as extensions on String:

  • Damerau–Levenshtein distance is the minimum number of single-character edits (transpositions, insertions, deletions or substitutions) required to change one term into another;
  • edit similarity is a normalized measure of Damerau–Levenshtein distance on a scale of 0.0 to 1.0, calculated by dividing the the difference between the maximum edit distance (sum of the length of the two terms) and the computed editDistance, by the maximum edit distance;
  • length distance returns the absolute value of the difference in length between two terms;
  • character similarity returns the similarity two terms as it relates to the collection of unique characters in each term on a scale of 0.0 to 1.0;
  • length similarity returns the similarity in length between two terms on a scale of 0.0 to 1.0 on a log scale (1 - the log of the ratio of the term lengths); and
  • Jaccard similarity measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

The TermSimilarity class enumerates all the similarity measures of two terms and provides the TermSimilarity.similarity property that combines the four measures into a single value.

The TermSimilarity class also provides a function for splitting terms into k-grams, used in spell correction algorithms.

(back to top)

Usage

In the pubspec.yaml of your flutter project, add the following dependency:

dependencies:
  text_analysis: <latest version>

In your code file add the text_analysis library import. This will also import the Porter2Stemmer class from the porter_2_stemmer package.

// import the core classes
import 'package:text_analysis/text_analysis.dart';

To use the package's extensions and/or type definitions, also add any of the following imports:

// import the typedefs, if needed
import 'package:text_indexing/type_definitions.dart'; 

// import the extensions, if needed
import 'package:text_indexing/extensions.dart'; 

// import the implementation classes, if needed
import 'package:text_indexing/implementation.dart'; 

Basic English tokenization can be performed by using a TextTokenizer.english static const instance that uses the English text analyzer and no token filter:

  // Use the static TextTokenizer.english instance to tokenize the text using the  
  // English analyzer.
    final tokens = await TextTokenizer(analyzer: English()).tokenize(
      readabilityExample,
      strategy: TokenizingStrategy.all,
      nGramRange: NGramRange(1, 2));

To analyze text or a document, hydrate a TextDocument to obtain the text statistics and readibility scores:

  // get some sample text
  final sample =
      'The Australian platypus is seemingly a hybrid of a mammal and reptilian creature.';

  // hydrate the TextDocument
  final textDoc = await TextDocument.analyze(
      sourceText: sample,
      analyzer: English.analyzer,
      nGramRange: NGramRange(1, 3));

  // print the `Flesch reading ease score`
  print(
      'Flesch Reading Ease: ${textDoc.fleschReadingEaseScore().toStringAsFixed(1)}');
  // prints "Flesch Reading Ease: 37.5"

For more complex text analysis:

  • implement a TextAnalyzer for a different language or non-language documents;
  • implement a custom TextTokenizeror extend TextTokenizerBase; and/or
  • pass in a TokenFilter function to a TextTokenizer to manipulate the tokens after tokenization as shown in the examples; and/or extend TextDocumentBase.

To compare terms, call the desired extension on the term, or the static method from the TermSimilarity class:

  // define a misspelt term
  const term = 'bodrer';

  // a collection of auto-correct options
  const candidates = [
    'bord',
    'board',
    'broad',
    'boarder',
    'border',
    'brother',
    'bored'
  ];

  // get a list of the terms orderd by descending similarity
  final matches = term.matches(candidates);
  // same as TermSimilarity.matches(term, candidates))

  // print matches
  print('Ranked matches: $matches');
  // prints:
  //     Ranked matches: [border, boarder, bored, brother, board, bord, broad]
  //  

Please see the examples for more details.

(back to top)

API

The key interfaces of the text_analysis library are briefly described in this section. Please refer to the documentation for details.

The API contains a fair amount of boiler-plate, but we aim to make the code as readable, extendable and re-usable as possible:

  • We use an interface > implementation mixin > base-class > implementation class pattern:
    • the interface is an abstract class that exposes fields and methods but contains no implementation code. The interface may expose a factory constructor that returns an implementation class instance;
    • the implementation mixin implements the interface class methods, but not the input fields;
    • the base-class is an abstract class with the implementation mixin and exposes a default, unnamed generative const constructor for sub-classes. The intention is that implementation classes extend the base class, overriding the interface input fields with final properties passed in via a const generative constructor.
  • To maximise performance of the indexers the API performs lookups in nested hashmaps of DART core types. To improve code legibility the API makes use of type aliases, callback function definitions and extensions. The typedefs and extensions are not exported by the text_analysis library, but can be found in the type_definitions and extensions mini-libraries. Import these libraries seperately if needed.

(back to top)

TermSimilarity

The TermSimilarity class provides the following measures of similarity between two terms:

  • characterSimilarity returns the similarity two terms as it relates to the collection of unique characters in each term on a scale of 0.0 to 1.0;
  • editDistance returns the Damerau–Levenshtein distance, the minimum number of single-character edits (transpositions, insertions, deletions or substitutions) required to change one term into another;
  • editSimilarity returns a normalized measure of Damerau–Levenshtein distance on a scale of 0.0 to 1.0, calculated by dividing the the difference between the maximum edit distance (sum of the length of the two terms) and the computed editDistance, by by the maximum edit distance;
  • lengthDistance returns the absolute value of the difference in length between two terms;
  • lengthSimilarity returns the similarity in length between two terms on a scale of 0.0 to 1.0 on a log scale (1 - the log of the ratio of the term lengths);
  • jaccardSimilarity returns the Jaccard Similarity Index of two terms.

To compare one term with a collection of other terms, the following static methods are also provided:

  • editDistanceMap returns a hashmap of terms to their editSimilarity with a term;
  • editSimilarityMap returns a hashmap of terms to their editSimilarity with a term;
  • lengthSimilarityMap returns a hashmap of terms to their lengthSimilarity with a term;
  • jaccardSimilarityMap returns a hashmap of terms to Jaccard Similarity Index with a term;
  • termSimilarityMap returns a hashmap of terms to termSimilarity with a term;
  • termSimilarities, editSimilarities, characterSimilarities, lengthSimilarities and jaccardSimilarities all return a list of [SimilarityIndex] values for candidate terms; and
  • matches returns the best matches from terms for a term, in descending order of term similarity (best match first).

Term comparisons are NOT case-sensitive.

The TextSimilarity class relies on extension methods that can be imported from the extensions library.

(back to top)

TextAnalyzer

The TextAnalyzer interface exposes language-specific properties and methods used in text analysis:

  • characterFilter is a function that manipulates text prior to stemming and tokenization;
  • termFilter is a filter function that returns a collection of terms from a term. It returns an empty collection if the term is to be excluded from analysis or, returns multiple terms if the term is split (at hyphens) and / or, returns modified term(s), such as applying a stemmer algorithm;
  • termSplitter returns a list of terms from text;
  • sentenceSplitter splits text into a list of sentences at sentence and line endings;
  • paragraphSplitter splits text into a list of paragraphs at line endings;
  • stemmer is a language-specific function that returns the stem of a term;
  • lemmatizer is a language-specific function that returns the lemma of a term;
  • termExceptions is a hashmap of words to token terms for special words that should not be re-capitalized, stemmed or lemmatized;
  • stopWords are terms that commonly occur in a language and that do not add material value to the analysis of text; and
  • syllableCounter returns the number of syllables in a word or text.

The LatinLanguageAnalyzerMixin implements the TextAnalyzer interface methods for languages that use the Latin/Roman alphabet/character set.

The English implementation of TextAnalyzer is included in this library and mixes in the LatinLanguageAnalyzerMixin.

(back to top)

TextTokenizer

The TextTokenizer extracts tokens from text for use in full-text search queries and indexes. It uses a TextAnalyzer and token filter in the tokenize and tokenizeJson methods that return a list of tokens from text or a document.

An unnamed factory constructor hydrates an implementation class. Alternatively you can extend TextTokenizerBase.

(back to top)

TextDocument

The TextDocument object model enumerates a text document's paragraphs, sentences, terms, n-grams, syllable count and tokens and provides functions that return text analysis measures:

The TextDocumentMixin implements the averageSentenceLength, averageSyllableCount, wordCount, fleschReadingEaseScore and fleschKincaidGradeLevel methods.

A TextDocument can be hydrated with the unnamed factory constructor or using the analyze or analyzeJson static methods. Alternatively, extend TextDocumentBase class.

(back to top)

Definitions

The following definitions are used throughout the documentation:

  • corpus- the collection of documents for which an index is maintained.
  • character filter - filters characters from text in preparation of tokenization.
  • Damerau–Levenshtein distance - a metric for measuring the edit distance between two terms by counting the minimum number of operations (insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one term into the other (from Wikipedia).
  • dictionary (in an index) - a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
  • document - a record in the corpus, that has a unique identifier (docId) in the corpus's primary key and that contains one or more text fields that are indexed.
  • document frequency (dFt) - the number of documents in the corpus that contain a term.
  • edit distance - a measure of how dissimilar two terms are by counting the minimum number of operations required to transform one string into the other (from Wikipedia).
  • etymology - the study of the history of the form of words and, by extension, the origin and evolution of their semantic meaning across time (from Wikipedia).
  • Flesch reading ease score - a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document (from Wikipedia).
  • Flesch-Kincaid grade level - a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length (from Wikipedia).
  • IETF language tag - a standardized code or tag that is used to identify human languages in the Internet. (from Wikepedia).
  • index - an inverted index used to look up document references from the corpus against a vocabulary of terms.
  • index-elimination - selecting a subset of the entries in an index where the term is in the collection of terms in a search phrase.
  • inverse document frequency (iDft) - a normalized measure of how rare a term is in the corpus. It is defined as log (N / dft), where N is the total number of terms in the index. The iDft of a rare term is high, whereas the iDft of a frequent term is likely to be low.
  • Jaccard index measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets (from Wikipedia).
  • Map<String, dynamic> is an acronym for "Java Script Object Notation", a common format for persisting data.
  • k-gram - a sequence of (any) k consecutive characters from a term. A k-gram can start with "$", denoting the start of the term, and end with "$", denoting the end of the term. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.
  • lemma or lemmatizer - lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).
  • n-gram (sometimes also called Q-gram) is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles (from Wikipedia).
  • Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data (from Wikipedia).
  • Part-of-Speech (PoS) tagging is the task of labelling every word in a sequence of words with a tag indicating what lexical syntactic category it assumes in the given sequence (from Wikipedia).
  • Phonetic transcription - the visual representation of speech sounds (or phones) by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet (from Wikipedia).
  • postings - a separate index that records which documents the vocabulary occurs in. In a positional index, the postings also records the positions of each term in the text to create a positional inverted index.
  • postings list - a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text. In a zoned index, the postings lists records the positions of each term in the text a zone.
  • stem or stemmer - stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (generally a written word form) (from Wikipedia).
  • stopwords - common words in a language that are excluded from indexing.
  • term - a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
  • term filter - filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking a stemmer and / or lemmatizer.
  • term expansion - finding terms with similar spelling (e.g. spelling correction) or synonyms for a term.
  • term frequency (Ft) - the frequency of a term in an index or indexed object.
  • term position - the zero-based index of a term in an ordered array of terms tokenized from the corpus.
  • text - the indexable content of a document.
  • token - representation of a term in a text source returned by a tokenizer. The token may include information about the term such as its position(s) (term position) in the text or frequency of occurrence (term frequency).
  • token filter - returns a subset of tokens from the tokenizer output.
  • tokenizer - a function that returns a collection of tokens from text, after applying a character filter, term filter, stemmer and / or lemmatizer.
  • vocabulary - the collection of terms indexed from the corpus.
  • zone - the field or zone of a document that a term occurs in, used for parametric indexes or where scoring and ranking of search results attribute a higher score to documents that contain a term in a specific zone (e.g. the title rather that the body of a document).

(back to top)

References

(back to top)

Issues

If you find a bug please fill an issue.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.

You might also like...

A Full-Featured Mobile Browser App (such as the Google Chrome mobile browser) created using Flutter and the features offered by the flutter_inappwebview plugin.

A Full-Featured Mobile Browser App (such as the Google Chrome mobile browser) created using Flutter and the features offered by the flutter_inappwebview plugin.

Flutter Browser App A Full-Featured Mobile Browser App (such as the Google Chrome mobile browser) created using Flutter and the features offered by th

Jan 2, 2023

Socketio dart server and client - Full Socket.io implementation using Dart Lang

Socketio dart server and client - Full Socket.io implementation using Dart Lang

Getting Started Step 1: Run dart_server.dart Step 2: Android Emulator has proble

Jan 23, 2022

Flutter-Musive-app - Full-stack music player app written in flutter and dart using node.js music API

Flutter-Musive-app - Full-stack music player app written in flutter and dart using node.js music API

Musive Full-stack music player app is written in flutter and dart using node.js

Dec 28, 2022

Full Stack Instagram Clone With Flutter and Firebase

Full Stack Instagram Clone With Flutter and Firebase

Instagram Full Stack Clone with Flutter,Dart and Firebase Built an responsive Instagram Clone app that Works on Android and Web! Features Responsive I

Aug 14, 2022

Flying Fish is full-stack Dart framework - a semi-opinionated framework for building applications exclusively using Dart and Flutter

Flying Fish is full-stack Dart framework - a semi-opinionated framework for building applications exclusively using Dart and Flutter.

Dec 27, 2022

A full-featured (simple message, voice, video) flutter chat application by SignalR and WebRTC

A full-featured (simple message, voice, video) flutter chat application by SignalR and WebRTC

flutter_chat A full-featured (simple message, voice, video) flutter chat application by SignalR and WebRTC. Features Full Authentication service Bad r

Dec 11, 2022

A mobile application that allows you to search and fetch recipes using Flutter, TheMealDB and Domain Driven Design

recipe_app A new Flutter project. Getting Started This project is a starting point for a Flutter application. A few resources to get you started if th

Dec 4, 2021

a mobile app to search for information and watch movie, series and TV show trailers

inWatch Just a clean architecture app, to get trailers and informations of movies, series and TV shows, made with Getx, omdb API and Flutter sdk. The

Nov 10, 2022
Owner
GM Consult Pty Ltd
GM Consult Pty Ltd
GitHub Action that uses the Dart Package Analyzer to compute the Pub score of Dart/Flutter packages

Dart/Flutter package analyzer This action uses the pana (Package ANAlysis) package to compute the score that your Dart or Flutter package will have on

Axel Ogereau-Peltier 45 Dec 29, 2022
Ali Türkay AVCI 1 Jan 20, 2022
A google browser clone which is made by using flutter and fetching the google search api for the search requests.

google_clone A new Flutter project. Project Preview Getting Started This project is a starting point for a Flutter application. A few resources to get

Priyam Soni 2 May 31, 2022
Github-search - Allows users to search users on github Uses flutter

Github Search Github Search is a cross-platform mobile application powered by Flutter Framework and Github API. The application was built with simplic

Saul 3 Sep 13, 2022
Starlight search bar - Starlight search bar with flutter

starlight_search_bar If you find the easiest way to search your item, this is fo

Ye Myo Aung 1 Apr 20, 2022
Masked text field - A flutter package for masked text field for formet your text and good UI

Masked Text Field Masked Text Field Features A package for masked text field for

Alok Dubey 7 Sep 4, 2022
(Full-stack) Fully functional social media app (Instagram clone) written in flutter and dart with backend node.js and Postgres SQL.

Photoarc A Fully functional social media app written in flutter and dart using node.js and Postgres SQL as backend. Backend Repository Demo Download t

Ansh rathod 59 Jan 5, 2023
A Todo app with full fledge functionality and Awesome Look and feel.

to_do A new Flutter project. Getting Started This project is a starting point for a Flutter application. A few resources to get you started if this is

Naveed kaimkhani 4 Aug 5, 2022
A full screen mobile scanner for scanning QR Code and Bar Code.

Flutter QR Bar Scanner A Full Screen Scanner for Scanning QR code and Barcode using Google's Mobile Vision API Reading & Scanning QR/Bar codes using F

Lutfor Rahman 31 Oct 5, 2022