Keyword searching using Vectorizers in Python.

Have you ever wondered how search engines like google search the keywords you type in their search box? Even if you type incorrect grammar or make spelling mistakes during search, it manages to give you correct results.

The process of selecting the webpages with similar contents to the search query involves the use of machine learning. And it is much more complex than basic string matching, since most of the time we don’t type the exact text as it appears in web pages.

I will teach you basics of keyword finding using machine learning in python. Keep in mind that this project gives a a basic intro about this and the actual technology used by search engines for this is much more complex.

Setting up environment.

Download the miniconda package for python and install libraries numpy, scipy, scikit-learn and nltk using command:

conda install numpy scipy scikit-learn nltk

Now you need to download a sample dataset called toy, which contains some strings in text files for being searched. here is a link for it: Download.

Extract the folder and save it in the same directory where you will keep your python file after coding. The folder contains 5 txt files with some text in each. We will input a query string and then out of the 5 text files, we will search which one matches the best to the query string. To do this, first we will store contents of all files in a list, the we will vectorize all of them. Then we will vectorize the query string and find its distance with each. The file with minimum distance will be the best possible match to the query. Here is the code:

import os, sys, nltk.stem
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# words which are to be ignored during analysis.
stopwords = ['a', 'about', 'above', 'across', 'after', 'afterwards', 'again',
             'against', 'all', 'almost', 'alone', 'along', 'already', 'also',
             'although', 'always', 'am', 'among', 'amongst', 'amoungst']
posts = [open(os.path.join("toy", f)).read() for f in os.listdir("toy")]  # copy content of text files in elements of list.
vectorizer = CountVectorizer(min_df=1, stop_words=stopwords)
X_train = vectorizer.fit_transform(
    posts)  # creating vectors from post data according to frequency of each letter in each post.
print("vectorized data: ", X_train)
num_samples, num_features = X_train.shape  # no. of distinct words, no. of text files.
new_post = "imaging databases" # query string

new_post_vec = vectorizer.transform([new_post])
# print no. of occurrences of vector [x, y] where x is index of element taken in array as input by transform func., y is feature index.
# for example, output (0, 4)  1 means that 4th word from the 0th file comes once in the new_post string.
print("vector for new_post string: \n",new_post_vec)


# compute the distance between two normalised vectors.
def dist_raw(v1, v2):
    v1_normalised = v1 / sp.linalg.norm(v1.toarray())  # vector divided by its magnitude.
    v2_normalised = v2 / sp.linalg.norm(v2.toarray())
    delta = v1_normalised - v2_normalised
    return sp.linalg.norm(delta.toarray())


best_doc = None
best_dist = sys.maxsize
best_i = None
for i in range(num_samples):
    if posts[i] == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_raw(post_vec, new_post_vec)
    if d < best_dist:
        best_dist = d
        best_i = i
print("for count vectorizer, best post is %i.txt with dist=%.2f" % (best_i+1, best_dist))

We see, for count vectorizer, closest matching post is 04.txt with dist=0.77

Now let us improve our vectorizer by adding stemming capability to it. Stemming is basically shortening a word to its root word (not necessarily meaningful). For example, “listing” will become “list” after stemming and so on. Suppose we type the query “listing databases” in search, or we type “list database” in search engine,in both instances we mean the same thing and both should produce same results. But our above vectorizer will treat them different. So before vectorizing our strings, we need to stem our words so that they become similar. So after stemming both “listing databases” and “list database” will become “list databas”. So we make a new StemmedCountVectorizer class extending the original CountVectorizer class, and add stemming capability to original build_analyze() method.

english_stemmer = nltk.stem.SnowballStemmer("english")


class StemmedCountVectorizer(CountVectorizer):  # creating a new StemmedCountVectorizer class which extends CountVectorizer.
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in
                            analyzer(doc))  # automatically stem words when sentences are passed in class.


vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

best_doc = None
best_dist = sys.maxsize
best_i = None
for i in range(num_samples):
    if posts[i] == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_raw(post_vec, new_post_vec)
    if d < best_dist:
        best_dist = d
        best_i = i
print("for stemmed count vectorizer, best post is %i.txt with dist=%.2f" % (best_i+1, best_dist))


# trying Tfid vectorizer with stemming:
class TfidStemmedCountVectorizer(TfidfVectorizer):  # creating a new Term freq. inverse document Vectorizer class which extends original.
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in
                            analyzer(doc))  # automatically stem words when sentences are passed in class.


vectorizer = TfidStemmedCountVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

best_doc = None
best_dist = sys.maxsize
best_i = None
for i in range(num_samples):
    if posts[i] == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_raw(post_vec, new_post_vec)
    if d < best_dist:
        best_dist = d
        best_i = i
print("for stemmed tfid vectorizer, best post is %i.txt with dist=%.2f" % (best_i+1, best_dist))

We see, for stemmed count vectorizer, best post is 03.txt with dist=0.63.

Thus we saw how to pick best match to a query string. Search engines work in a similar way, They vectorize the text from web pages after stemming and applying various other methods, then the vectorize the query string and finds first n closest results to it for you to browse.

For more source codes, programming tips and tricks, stay tuned to The Coding Bot!

Leave a Reply

Your email address will not be published. Required fields are marked *