NLP Fundamentals in Python
Working with natural language involves being able to extract, treat and break down text into a list of relevant words to be processed.
Below is a list of must-know libraries in Python for that purpose.
Regular expressions
Regular expressions are a classical way to extract word classes or symbols from classes with a well-known metalanguage that describes which symbols to match and or replace.
This can be achieved by “re” python library.
import re
r'\sa[a-z]*', 'carlos arthur jack amanda') # ==> ['arthur', 'amanda']
re.findall(
= re.search(r'\sa[a-z]*', 'carlos arthur') # ==> <re.Match object; span=(6, 13), match=' arthur'>
m
r'\sa[a-z]*', 'carlos arthur jack amanda')
re.match(
r'[0-9]+', 'carlos00arthur11jack') # ==> ['carlos', 'arthur', 'jack'] re.split(
Tokenization
Tokenization is the process of breaking down a text of body into sentences and then sentences into words.
“nltk” is a library specialized in just that.
import nltk # Python for NLP tokenization
from nltk.tokenize import TweetTokenizer
'punkt_tab')
nltk.download(
= "Hello everyone! \nHow are you?\n Tomorrow."
sentence
# ==> ['Hello', 'everyone', '!', 'How', 'are', 'you', '?', 'Tomorrow', '.']
nltk.word_tokenize(sentence)
# ==> ['Hello everyone!', 'How are you?', 'Tomorrow.']
nltk.sent_tokenize(sentence)
= '@aminbaybon please dont stop'
tweet = TweetTokenizer() # tweet tokenizer
tokenizer # ==> ['@aminbaybon', 'please', 'dont', 'stop'] tokenizer.tokenize(tweet)
Word count with bag of words
After extracting words from sentence, word frequency needs to be counted with “Counter” library.
from collections import Counter
= ['you', 'are', 'not', 'here', 'and', 'you', 'are', 'my', 'friend']
word_list
= Counter(word_list) # --- produces ---> list of tuples, (word, frequency)
word_freq 10) # ----> 10 most common tokens word_freq.most_common(
Word Pre-processing
As part of text body preprocessing, words should be processed and sometimes convert to a common root.
This is to avoid variability such as in “play”, “plays” or in “player” and “players”.
NLTK offers support for that type of processing as well.
from nltk.stem import WordNetLemmatizer # -> convert words to its common root
'wordnet') ### --> initializing nltk with vocabulary database
nltk.download(
= WordNetLemmatizer()
wnl 'play') # --> play
wnl.lemmatize('plays') # --> play
wnl.lemmatize('player') # --> player
wnl.lemmatize('players') # --> players wnl.lemmatize(
from gensim.corpora.dictionary import Dictionary # - Dictionary (word -> freq)
dict = Dictionary(list_of_documents)
for document in documents:
= dict.doc2bow(document) # list of tuples (key, value) where key is a word id number and value is frequency bag_of_words
TF-IDF
TF-IDF stands for term frequency - inverse document frequency and it is a way to reduce the importance of too common words among documents while increase importance of unique / rare ones.
TF-IDF can be described with the following formula:
Given document i and term j, tf-idf is given by:
tfidf_i_j = (term i % in doc j) * log(number_of _docs / number_of_docs_with_term_i)
Below is an exemple of TF-IDF processing with gensin library:
from gensim.models.tfidfmodel import TfidfModel
= TfidfModel(corpus)
tfidf = tfidf[doc] # list of tuples (word id, tfidf weight) word_weights
Named Entity Recognition
Named entity recognition is a task to identity important named entities such as people names, states, organizations, places, etc.
Installation requires Java Stanford NLP and env variables configurations
NLTK example
import nltk
= [nltk.tag.pos_tag(sentence) for sentence in token_sentences] # tag each token as named entity
pos_sentences = nltk.chunk.ne_chunk_sents(pos_sentences, binary=True) # tag each token as named entity:
chunked_sentences for each chunk in chunked_sentences:
if hasattr(chunk, "label") and chunk.label() == "NE": # check for label method and value “NE”
print(chunk)
SpaCy example
import spacy # library
= spacy.load('en', tagger=False, parser=False, matcher=False) # load library vocabulary
nlp = nlp(article) # load article
doc for ent in doc.ents: # check recognized entities in text
print(ent.label_, ent.text) # print corresponding category (person, location, etc.)
PS: each word is a pre-trained embedding from a vocabulary
Word Processing with sklearn
Sklearn offers support to process words or build classifiers.
Word vectorizers offers different options to process and score idenfified words from a text.
Below there are 2 code examples to illustrate each option.
Count Vectorizer
This is the simplest vectorizer, each word is identified and corresponding frequency is counted.
from sklearn.feature_extraction.text import CountVectorizer
= CountVectorizer(stop_words='english')
count_vectorizer # list of feature names (words) count_vectorizer.get_feature_names()
TF-IDF Vectorizer
With TF-IDF (term frequency - inverse document frequency) the score (frequency) also depends on how rare a word is.
See section above on TF-IDF details.
from sklearn.feature_extraction.text import TfidfVectorizer
= TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_vectorizer = tfidf_vectorizer.fit_transform(X_train)
tfidf_train = tfidf_vectorizer.transform(X_test)
tfidf_test # ---> list of feature names tfidf_vectorizer.get_feature_names()
Text Classification with Multinomial Naive Bayes
A text classifier can be trained and measured with sklearn with a few lines of code.
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
= MultinomialNB()
nb_classifier
nb_classifier.fit(tfidf_train, y_train)= nb_classifier.predict(tfidf_test)
pred = metrics.accuracy_score(y_test, pred)
score = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL']) cm
Conclusion
There are several libraries in python to take care of tokenizing or processing extracted words as well as extract the frequency of words or rank its importance based on a mechanism such as “tf-idf”.
This vairety of libraries added with simplicity,makes python the preferred choice of programming language for such type of work.