Language Processing Pipeline in Spacy

Language Processing Pipeline in Spacy

posted 13 min read

Brief Overview of Spacy

The essential goal of natural language processing is to enable people to communicate with computers in a natural way to perform tasks that require intelligence. With this goal, we want the computer to be able to understand and process natural language intelligently. In the case of text, which is the kind of data we have been concentrating on, this task includes an ability to relate sentences in a coherent way, to interpret the subtle nuances of human language. This interpretation calls for different kinds of reasoning, such as knowledge-based, probabilistic, or rational-consequence methods. In stringent terms, this interpretation calls for different kinds of reasoning, such as knowledge-based, probabilistic, or rational-consequence methods.

No doubt, the most common and most meaningful type of data is human language, and so, human-computer interaction is the "mother of all computer applications" - an area whose scientific and technological foundations continue to grow by leaps and bounds. Natural language processing has of late made much progress with mathematical, statistical, and computational methods. With the achievement of that progress, we have seen the doors to many applications thrown open, some major, some more modest. Looking behind us, then, we wonder what self-respected language application could bear to go without at least some form of clever computer support. Yet, looking ahead, we are just beginning to explore the wonders of natural language processing.

Whether to use NLTK or Spacy?

Have you ever wondered on whether to use NLTK or Spacy?... Well it is pretty much interesting that in the data science world, Spacy is the best and most popular library for natural language processing capabilities. It is very fast and contains algorithms for straightforwardness of code to a model. Throughout the world Spacy is developed as a robust AI system. For more than one language, here we touch upon the architecture of the Spacy as we see that its used in various applications. I also understand that GPU is onward translated and it's trained deep learning models which are encoded. Beyond internet it is used for grossing NLP tasks.

How it works you may wonder?

Well with Spacy, it emphasizes the concept of a text object, where a portion of text is passed and the object is available to process it further. The text object is initiated using an nlp object, which seems to be a parser of the text object. The ArgumentParser method initiates a language processing pipeline to further process and use the information within the document/doc object.

To me it did not come very challenging to realize that the doc object provided by Spacy is meant for further processing through various pipelines available along with the library itself. I would advice one to first look into the language processing pipeline offered by Spacy, followed by some other attributes associated with it.
Secondly, check that the nlp object is a text object and its typed as a sanity check. After getting to that conclusion one can then further identify the type of doc, which is an object of the class spacy.tokens.doc.doc.
NLP is a text object provided by Spacy. As set below is the simpliest example set to it:

doc = nlp( "She ate the pizza while Sienna braided her hair. Ram drank the water while Sanya made a beautiful card" )

Up until now, I have been explaining some of the basics of natural language processing and the necessity of deep learning techniques for extracting and utilizing various natural language features. I have also discussed some basic model architectures used and some libraries that have pre-trained models and are known for utility functions related to natural language processing and this case was Spacy I mostly touched upon.

What is Tokenization you may ask?

Another topic that triggered my attention under natural language processing was that of tokenization. Based on my understanding tokenization is the process of breaking a text down into words, phrases, symbols, or other meaningful elements which particularly is called corpus tokenization.

I think it is fair to say that throwing a piece of text or a set of words into this black box and having all word occurrences spit out in front of you is being tokenized. There are various tokenization processes, such as sub-word tokenization, sentence piece tokenization, byte pair tokenization, word piece tokenization, and character tokenization. Each tokenization process is slightly different, but the basic word unit is the same as they are all words or sub-words. Tokenization is a challenge in analysis and machine learning algorithms, as it affects the probability of word occurrence. If tokenization is flawed, the models used in the analysis will be flawed because they may not consider bad cases. For all of these models to study and perform excellent tasks, we all need to do tokenization.

Tokenization is the process by which a large quantity of text is divided into smaller chunks called tokens. The input to the tokenization is a document or a sentence, and the output is a set of tokens. Each token is a smaller element of a text, such as words, symbols, or numbers. Tokenization can also be referred to as the tokenization of a string of text into its constituent elements. In the context of natural language processing, tokenization is a word embedded in its natural language. However, this is not applicable only to natural language processing which was something that actually came to my surprise. Indeed, the concept of tokenization applies to the process of parsing tokens in code. Both of these have the same common meaning which is we seek to extract all the basic elements that make up the text, whether they are right or not, whether they are words or not.

Why is Tokenization Important Though?

This is a part basically that makes up the pipeline of Spacy but why is it important though? …By breaking a chunk of text into sensible chunks, tokenization also helps in increasing the pace for further text analysis. The steps of text analysis that follow become harder with increasing word size. So breaking text into tokens simultaneously both structures the text and opens up the opportunity for quicker analysis.

Like for instance:

Input:
"Tokenization is an important NLP task."

Output:
["Tokenization", "is", "an", "important", "NLP", "task", "."]

Secondly, the relation among words conveyed by grammar is often the key to the meaning of a passage. For example, consider these two sentences: "The best way to commit a violent act is with a knife," and "The best way to commit a violent act is with a prayer." In the first sentence, the word "with" connects knife to a violent act suggesting 'instrument' or 'means'. In the second sentence, 'with' connects prayer to violent act suggesting 'method' or 'manner'. The use of grammar, and therefore the meaning of words, is key to understanding these sentences. With regards to text tokenization, the process represents the words in the text, which in turn help us analyze the sentence structure of the text.

First, tokenization brings in the important element of structure into a text. In a manner of speaking, it reduces the greatness and randomness of an entire passing novel into predictable patterns of people saying something to each other. Language can thus be studied more scientifically when reduced to tokens. Tokenization of text acts, therefore, as a cleansing or structuring step and opens the way for computational techniques.
It further came to my understanding that tokenization, or the process of dividing a large piece of text into smaller units, is an important step in a number of natural language processing tasks. These tasks include chatbots, machine translation, language parsing, and named entity recognition.

Definition and Importance of Parts Of Speech

Ever came to your curiosity on what POS stands for or besides what you learned in class what part of speech has to do with natural language processing?... Well, text corpora with part-of-speech (POS) - The textual database that is being used in this paper is the widely known "Penn Treebank" POS annotated corpus. An annotated corpus, known as a treebank or a parsed corpus, includes lexical and other information such as the part of speech tag as well as the syntaxable structure like for instance the syntax parse, that is typically done at the sentence level. Such corpora are extensively analyzed for the heterogeneous, new and unique knowledge that can be gained, such as novel linguistic content and the generation of new languages and language tools.

So going back to the topic of what Part-of-speech tagging is, also called grammatical tagging?... Well it is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context, like for instance, its relationship with adjacent and related words in a phrase, sentence, or paragraph. Like I had mentioned above on what may take you back to class and make you wonder on how computers make use of these parts of speech, to my surprise, a simplified form of this is commonly taught to school-aged children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

For example:

input:

Doc = nlp(“Captain South Africa ate 100$ of samosa. Then he said I can do this all day”)

for token in doc:
     print(token, “|”, token.pos_, “ | “, token.lemma_)

output:

Captain  |  PROPN  |  Captain
South  |  PROPN  |  South
Africa  |  PROPN  |  Africa
ate  |  VERB  |  eat
100  |  NUM  |  100
$  |  NUM  |  $
of  |  ADP  |  of
samosa  |  PROPN  |  samosa
.  |  PUNCT  |  .
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
this  |  PRON  |  this
all  |  DET  |  all
day  |  NOUN  |  day

Types of POS Tagging

There are various types of POS tagging and I will explain just a few that I understand. Let us start with stochastic tagging systems. Well stochastic tagging involves assigning a probability to each tag, which often depends on the identity of the word being tagged and the identities of the preceding tags. What I learned was that stochastic tagging systems use probabilities from a corpus or a parse tree using statistics of words, parts of speech, and their morphological counterparts from all over the corpus. It also calculates the conditional probability of the tag given the word and the hidden mark of the tag. From these numbers, the most likely tag can be selected. These systems require more memory and time to calculate the probabilities.

The second one will be the Rule-Based POS tagging systems. The Rule-Based POS tagging systems contain a lot of hand-written rules. These systems are domain dependent. This type of system is built using the methodology of morphology(crazy I know). What I mean by that is that it uses surface characteristics of a word. For example, string matching, string comparison, analogy, and it uses a sentence diagram like syntax rules.

The last one is the Lexical based POS tagging systems. These tagging systems use the knowledge of the word directly. This means that it tags the word mainly based on the knowledge of the word. The default tag may be the most common tag, the tag with the longest dictionary definition, etc. It uses a dictionary to look up the tags.

We can spend the whole day explaining various types of them as there are many parts of speech tagging techniques in NLP.

Here is a list of the rest below:

  • Maximum Entropy tagging.

  • Hidden Markov Models-based POS tagging systems.

  • Brill tagger.

  • N-gram based POS tagging systems.

  • Transformation-Based POS tagging system.

Definition and Importance of Dependency Parsing:

I won’t lie a bit but these for me were the hardest to understand but when it came to my understanding, it was not hard for me to comprehend or put the concept across to those who’d ask on what dependency parsing is and yeah, it is one of such tasks. In dependency parsing, as the name suggests, we analyze and thus understand the dependency between the words in a specific sentence. So, each sentence is seen as a directed graph where the words of the sentence are nodes and the relations between dependent words are arcs. We identify the words here as nodes and the relationships or the way they interact as the arcs. The main task is with the head and the dependent. The head is the word the dependent relies on. If one has to understand this using an example, for the sentence : ‘The sun rises every morning’; dependency parser can help in understanding that the sun and rises are in a relationship because "rises" is the verb and "sun" is the subject of the action.

On the other hand, text analysis has naturally become one of the major fields where machine learning and AI have shone. It has helped in many linguistic tasks, such as part-of-speech (POS) tagging, named entity recognition, named entity linking, and others. The good part is that it is really a boon for researchers and developers with tools and packages, which with a few lines of code will give them results and help them understand text data better. Well, one of the tools is Textacy, a Python library that includes a lot of ways to work with various text tasks such as generating topics, provision of datasets, extract specific information and other general work.

Types of Dependency Parsing

The are various types we can actually work with and with dependency parsing it can be looked at in two ways. Firstly, constituency parsing, and secondly dependency parsing. The constituency parsing you may ask is concerned with identifying the constituents of a sentence or a document. Example of such constituents would be: sentence -> noun phrase goes verb, noun phrase -> DT JJ NN. On the other hand, the dependency parsing is focused on identifying how words in a sentence or a document are related to each other. Generally, the relationship identified is of the type: "governing word" -> "dependent word". For instance, in the sentence "The quick brown fox jumps over the lazy dog.", 'jumps' is the governing word. 'jumps' -> 'fox' 'over'; 'The' -> 'quick'; 'brown' -> fox; and so on. So, in other words, we can say that the dependency parsing is about parsing the relationships between word tokens in a sentence.

There was a previous article I read where this lady talked in detail about the ways to understand the grammar of any language. She looked at how to understand the structure of any language with the help of language trees. But as I continued reading it came to my understanding that such trees aren't available for all of the languages. As of today, these language trees are available only for a few of the common languages. This calls for a need for another way to understand the grammar for any language if its language tree isn't available. In such cases,it further came to my understanding that we can make use of dependency parsing. So below I used a code snippet to show rather a practical example of how dependency parsing is used in python:

Well first things first let me try and explain the code below. The token.dep_ takes the individual tokens of the parsed sentence and returns the corresponding type of dependency. These values can be further be detailed by calling the explained interface in the spaCy library, providing in-depth information for each identified compound type and what they denote on the concrete section of the sentence. Then further going down, the rest of this code simply organizes and prints these details for the example text input.

The code examples in the following code snippets also goes through the basic steps for dependency parsing tasks executed in the spaCy library. They assume the library has been installed using the configuration !pip install spaCy on jupyter notebook, and then the model that includes the parser has been installed with a command like this one: ! spacy download en_core_web_sm. From there, the library can be used to pre-process text documents. The sample text used in the following code snippets goes: "The teacher gave a difficult assignment to the students."

import spacy #importing spacy library
 
nlp = spacy.load("en_core_web_sm") 
doc = nlp("The teacher gave a difficult assignment to the students.") # Processing the string 
for token in doc: # Visualizing the result 
     print (token.text, "\t", token.dep_, "\t", spacy.explain(token.dep_))

Another example would be like the similar one you would see when you need spacy to explain the type of object to you as the output explains if whether the object is a person or organization and this is most likely to be found under parts of speech tagging. The text below then goes like "Tesla Inc va racheter Twitter pour $45 millliards de dollars":

Import spacy

nlp = spacy.load(“en_core_web_sm”)
doc = nlp("Tesla Inc va racheter Twitter pour $45 millliards de dollars")
for ent in doc.ents:
     print(ent.text,  “ | “, ent.label_, “ | “, spacy.explain(ent.label_))

How I would conclude you may ask?

Indeed, resources and imagination combined have revitalized thinking about language and writing. One of my friends, when interviewed by a prominent company's search engine division, said: "The problem is that the web is not big enough." Given that English is evolving faster now than at any time in its history, this seems a nonsensical comment. Even so, it is the case that 'normative' English as it used to be spoken or written is a distinct collection of publications. All the time that I have been talking about language and writing, 'computational linguistics/philology', it has been the case that people have looked askance at any particular conclusion. Some have not even been polite, either to the study or its principal proponent. It therefore seems only fair that I pop some of the balloons that give rise to the unfounded allegations.

Here we are in the concluding section of the concluding chapter of a very long overview of natural language processing(with apologies to A.A. Milne). What then can we observe, assert, and simply hope for in the future of natural language processing (NLP)? One of our observations is that the future of NLP is perpetual and self-reinforcing. That is, the more complex analyses that can be exploited will open up a lively new agenda of empirical (i.e., experimental) writing. This, of course, demands resources. Fortunately, availability of really useful large text corpora (especially for grammar induction, summarizing, and question-answering) will make of course make analysis and speculation that much more straightforward.

References:

Garza, K. et al. (2015) “Framing the community data system interface,” Proceedings of the 2015 British HCI Conference. British HCI 2015: 2015 British Human Computer Interaction Conference, ACM.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

Are you using generative AI correctly?

Elmer Urbina - Nov 24

Meet the .csv datasets and understand how they works on creating expert system Artificial Intelligence models

Elmer Urbina - Oct 19

Foundational Algorithmic Paradigms and Advanced Algorithmic Concepts in AI Development

Niladri Das - May 16

Stemming vs Lemmatization in NLP: Main Differences

Thatohatsi Matshidiso Tilodi - Jun 1

You should know these f-string tricks in Python

mouyuan123 - May 28
chevron_left