Role of Tokenization in NLP

Gautam Kumar
8 min readAug 10, 2023

--

What is Tokenization ?

In natural language processing (NLP), tokenization is the process of breaking up a stream of text, typically a sentence or document, into smaller units called “tokens”. These tokens are typically words, but can be sub-words or characters depending on the granularity required for a particular NLP task. Tokenization is a fundamental step in NLP as it lays the foundation for further analysis and processing of text data. The main goal of tokenization is to break up a continuous stream of text into discrete units that computers can easily process and analyze. By segmenting the text into tokens, NLP models and algorithms can understand the individual parts of the text and extract meaningful information from them.

Why is Tokenization required in NLP ?

Before processing a natural language problem (text data), we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text. Tokens are the building blocks used by NLP algorithms to process and understand human languages.

Main reasons, why tokenization is required in NLP listed below:
Text Segmentation
Feature Extraction
Text Understanding
Vocabulary Creation
Removing Redundancy
Statistical Analysis
Handling Ambiguity
Named Entity Recognition (NER)
Search and Information Retrieval
Sentiment Analysis and Language Modeling

We are having different types of tokenization, we will discuss all possible tokenization process.

  1. Word Tokenization
  2. Character Tokenization
  3. Sentence Tokenization
  4. Sub word Tokenization
  5. Morphological Tokenization

Word Tokenization : a fundamental process in NLP, that involves breaking a sequence of text into individual words. In this process, the input text, which can be a sentence, paragraph, or even an entire document, is divided into discrete linguistic units, each representing a single word. These individual words, referred to as “word tokens” For Ex. below given text will splited into word tokens

Text : “Generative artificial intelligence is artificial intelligence capable of generating text, images, or other media.”

Word Tokens: [“Generative”, “artificial”, “intelligence”, “is”, “artificial”, “intelligence”, “capable”, “of”, “generating”, “text”, “,”, “images”, “,”, “or”, “other”, “media”,”.”]

Discussing few approach to achieve the word tokens, listed below:

Using split() method

Sentence="Generative artificial intelligence is artificial intelligence capable of generating text, images, or other media."
tokens_split=Sentence.split()
tokens_split

Using RegEx method

import re
tokens_re = re.findall("[\w']+", Sentence)
tokens_re

Using NLTK library

from nltk.tokenize import word_tokenize
token_nltk=word_tokenize(Sentence)
token_nltk

Using spaCy library

from spacy.lang.en import English
nlp=English()
doc = nlp(Sentence)
token_list_spacy = []
for token in doc:
token_list_spacy.append(token.text)
token_list_spacy

Using Keras library

from keras.preprocessing.text import text_to_word_sequence
token_keras=text_to_word_sequence(Sentence)
token_keras

Output :

Output:
['Generative','artificial','intelligence','is','artificial','intelligence',
'capable','of','generating','text,','images,','or','other',
'media.']

Limitation of Word Tokenization :

Out-of-Vocabulary (OOV) Words: NLP models, especially those relying on pre-trained embeddings, can struggle with words that are not present in their vocabulary. If a word is not seen during training, the model might represent it as <UNK> (unknown), will leads to information loss.

Word Ambiguity: Many words have multiple meanings depending on the context in which they appear. Models may not always accurately disambiguate between these meanings, which can affect tasks like sentiment analysis, text classification, and machine translation.

Lack of Morphological Information: Word tokens often lack morphological information like tense, case, and plurality. This can be especially problematic for languages with rich inflectional systems.

Loss of Word Order Information: Tokenizing text discards the original word order, which can be crucial for understanding context and meaning in sentences.

Handling Punctuation and Special Characters: Tokenization can lead to issues in handling punctuation, special characters, and emojis, which can convey important information in text.

Named Entities and Abbreviations: Splitting named entities (e.g., “New York”) or abbreviations (e.g., “U.S.A.”) into individual tokens can disrupt their meaning and hinder downstream tasks.

Above listed limitation opens the door to “Character Tokenization”.

Character Tokenization : text processing technique that involves breaking down a given text into its individual characters or units. It overcomes the drawbacks we saw earlier about word tokenization. Character tokenizers handle OOV words consistently by preserving word information. Decomposes the word OOV into characters and represents the word in terms of these characters. It also limits the size of the vocabulary.

Discussing few approach to achieve the characters tokens, listed below:

Using base python syntax

Sentence="Generative artificial intelligence is artificial intelligence capable of generating text, images, or other media."
char_token = list(Sentence)
print(char_token)

Using RegEx method

import re
char_tokens_re = re.findall(r'\S|\s', Sentence)
print(char_tokens_re)

Output :

Output:
['G', 'e', 'n', 'e', 'r', 'a', 't', 'i', 'v', 'e', ' ', 'a', 'r', 't', 'i',
'f', 'i', 'c', 'i', 'a', 'l', ' ', 'i', 'n', 't', 'e', 'l', 'l', 'i', 'g',
'e', 'n', 'c', 'e', ' ', 'i', 's', ' ', 'a', 'r', 't', 'i', 'f', 'i', 'c',
'i', 'a', 'l', ' ', 'i', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c',
'e', ' ', 'c', 'a', 'p', 'a', 'b', 'l', 'e', ' ', 'o', 'f', ' ', 'g', 'e',
'n', 'e', 'r', 'a', 't', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ',', ' ',
'i', 'm', 'a', 'g', 'e', 's', ',', ' ', 'o', 'r', ' ', 'o', 't', 'h', 'e',
'r', ' ', 'm', 'e', 'd', 'i', 'a', '.']

Limitation of Character Tokenization :

Loss of Semantic Information: Character tokenization treats each character as a separate unit, which means that the inherent meaning and semantics of words and phrases are lost and you miss out the contextual information that words token used to provide.

Increased Sequence Length: Character-level tokenization results in longer sequences compared to word-level tokenization. This can lead to increased computational and memory requirements, potentially making training and inference slower and more resource-intensive.

Increased Noise and Ambiguity: Characters like punctuation marks and whitespace can introduce noise and ambiguity when tokenized individually. Word tokenization, on the other hand, often helps in segmenting these characters appropriately to aid in understanding sentence boundaries and structure.

Reduced Generalization: Character-level models might struggle to generalize well, especially when dealing with rare or unseen character combinations. Word-level models benefit from learning common linguistic patterns and generalizing them across various contexts.

Character tokens solve the OOV(out of vocab) problem, but the length of input and output sentences increases rapidly as we represent the sentence as a sequence of characters. As a result, it becomes difficult to learn the relationships between the characters to form meaningful words. This brings us to another tokenization technique known as “subword tokenization”, which lies between word and character tokenization.

Sub word Tokenization : is a text processing technique used in natural language processing (NLP) to break words down into smaller units, usually subword units or subword tokens. This approach is particularly useful for dealing with languages ​​with complex morphology and for tasks such as machine translation, text generation, and sentiment analysis. Subword tokenization is particularly valuable when dealing with non-vocabulary words or when working with limited vocabulary sizes.

“ Sub word tokenization is based on the idea that words can be decomposed into smaller meaningful units.”

One of the most popular algorithms for subword tokenization is Byte-Pair Encoding (BPE), which incrementally merges the most frequent pairs of characters or character n-grams in a corpus to create subword tokens. This allows the model to learn representations for both common words and rare words by combining existing subword tokens.

lets discuss Byte-Pair Encoding (BPE) Algorithms in details

Byte Pair Encoding (BPE) is a subword tokenization technique used to segment words into subword units based on their frequency in a given text corpus. The basic idea of ​​BPE is to iteratively merge the most common pairs of characters, or strings, to create a vocabulary of sub-word units.

Step 1: Initialize Vocabulary
Let’s say we have a small text corpus containing the following words:
[“low”, “lower”, “newest”, “wider”, “widest”]

We start by initializing the vocabulary with individual characters:

Vocabulary: {‘l’, ‘o’, ‘w’, ‘e’, ‘r’, ’n’, ‘w’, ‘s’, ‘i’, ‘d’, ‘t’}

Step 2: Merge Frequent Pairs
In each iteration, we find the most frequent pair of characters or character sequences in the corpus that can be merged. We then add the merged unit to the vocabulary.

In our example, let’s consider the pair ‘e’ and ‘s’ as they are the most frequent:

Merged unit: ‘es’
Update the vocabulary:

Vocabulary: {‘l’, ‘o’, ‘w’, ‘es’, ‘r’, ’n’, ‘w’, ‘i’, ‘d’, ‘t’}

Step 3: Repeat Merging
We repeat the merging process iteratively. Let’s find the next most frequent pair:

Merged unit: ‘es’ (this time, ‘es’ is the most frequent pair)

Update the vocabulary:

Vocabulary: {‘l’, ‘o’, ‘w’, ‘es’, ‘r’, ’n’, ‘w’, ‘i’, ‘d’, ‘t’}

Continuing this process, we find that ‘w’ and ‘es’ are the most frequent pair:

Merged unit: ‘wes’

Update the vocabulary:

Vocabulary: {‘l’, ‘o’, ‘w’, ‘es’, ‘r’, ’n’, ‘wes’, ‘i’, ‘d’, ‘t’}

Step 4: Final Vocabulary

We continue these iterations until a predefined number of merge steps or until the vocabulary reaches a certain size. In this example, let’s assume we stop here. The final vocabulary contains the following subword units:

Final Vocabulary: {‘l’, ‘o’, ‘w’, ‘es’, ‘r’, ’n’, ‘wes’, ‘i’, ‘d’, ‘t’}

Now, we can tokenize words using this vocabulary. For example, the word ‘newest’ would be tokenized as:

Tokenization of “newest” : [’n’, ‘ew’, ‘es’, ‘t’]
Tokenization of “lowest” : [‘l’, ‘ow’, ‘est’]

Sub word vocabulary can be then used for various NLP tasks, allowing models to handle unseen or rare words efficiently.

Limitation of Subword Tokenization :

Increased Vocabulary Size: Subword tokenization can lead to larger vocabularies compared to word-level tokenization, as the vocabulary needs to include subword units as well as complete words. This can impact memory and computation requirements.

Processing Time: Tokenizing text into subword units can be more time-consuming compared to simple word tokenization due to the need to analyze character-level patterns and generate subword units.

Sentence Tokenization : Sentence tokenization, also known as sentence segmentation or sentence boundary detection, is the process of breaking a continuous stream of text into individual sentences. In NLP, sentence tokenization is an important preprocessing step because many language analysis tasks, such as sentiment analysis, machine translation, and text summarization at the sentence level.

Using split() method

Sentence = "Generative artificial intelligence is artificial intelligence capable of generating text, images, or other media. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics."
Sentences_token=Sentence.split('. ')
Sentences_token

Using RegEx method

import re
Sentences_token_Re = re.compile('[.!?] ').split(Sentence)
Sentences_token_Re

Using NLTK library

from nltk.tokenize import sent_tokenize
Sentences_token_nltk=sent_tokenize(Sentence)
Sentences_token_nltk

Using spaCy library

from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)
doc = nlp(Sentence)
sents_list = []
for sent in doc.sents:
sents_list.append(sent.text)
sents_list

Output :

Output:
['Generative artificial intelligence is artificial intelligence capable of generating text, images, or other media.',
'Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.']

Limitation of Sentence Tokenization :

Ellipses and Dots: Ellipses (…) or other instances of consecutive dots can be used to indicate pauses or incomplete thoughts within a sentence. Sentence tokenization might incorrectly split these into separate sentences.

URLs and Emails: URLs, email addresses, or other web-related content might contain periods or punctuation that can be incorrectly interpreted as sentence boundaries.

Morphological tokenization : is a text processing technique that focuses on breaking down words into their smallest meaningful units, called “morphemes”. A “morpheme ” is the smallest unit of language that carries meaning or grammatical function.

for eg. Word: “unhappiness”
Morphemes: “un-” (prefix), “happy” (root), “-ness” (suffix)

Limitation of MorphologicalTokenization :

Morphological analyzers often rely on linguistic resources or dictionaries, and if a word or morpheme is not present in those resources, it might be incorrectly tokenized or not recognized at all.

Determining precise morpheme boundaries, especially in languages with agglutinative or fusional morphology, can be difficult due to the fusion of morphemes.

Some words might contain morphemes that can serve multiple functions (e.g., as a prefix or suffix), leading to difficulties in disambiguating tokenization.

  1. See that 👏 icon? Send my article some claps
  2. Connect with me via linkedin, github and on medium👈 and Buy me a coffee if you like this blog.

--

--

Gautam Kumar
Gautam Kumar

Written by Gautam Kumar

Data Scientist | MLOps | Coder l Machine learning | NLP | AI BOT I NEO4J | Python | Digital transformation |Applied AI | RPA | Blogger | Innovation enthusiast

No responses yet