In this hands-on project, we'll use our knowledge of TFIDF to implement a search engine! Our dataset will be a set of 25,000+ Wikipedia articles. The assignment will serve two primary objectives - (a) understand and apply TFIDF on a realistic task, (b) see what solving an NLP problem looks like end-to-end and (c) understand the fundamentals of how a search engine works.
Overview
We're going to implement a search engine for searching the 25,000+ Wikipedia articles in our dataset.
Our search ranking function is going to be quite simple. Suppose the query is "albert einstein". Then, score(article) = TFIDF(article, "albert") + TFIDF(article, "einstein"). Our search engine will display the articles sorted by score.
In the first half of the project, we'll compute term frequency and document frequency statistics. In the last half, we'll use these metrics to calculate the final score and rank the articles. And finally, we'll end with some ideas for taking this project further.
Let's get started!