CommonLounge Archive

Hands-on Project: Implementing a Search engine from scratch

June 28, 2018

In this hands-on project, we’ll use our knowledge of TFIDF to implement a search engine! Our dataset will be a set of 25,000+ Wikipedia articles. The assignment will serve two primary objectives - (a) understand and apply TFIDF on a realistic task, (b) see what solving an NLP problem looks like end-to-end and (c) understand the fundamentals of how a search engine works.

Overview

We’re going to implement a search engine for searching the 25,000+ Wikipedia articles in our dataset.

Our search ranking function is going to be quite simple. Suppose the query is “albert einstein”. Then, score(article) = TFIDF(article, "albert") + TFIDF(article, "einstein"). Our search engine will display the articles sorted by score.

In the first half of the project, we’ll compute term frequency and document frequency statistics. In the last half, we’ll use these metrics to calculate the final score and rank the articles. And finally, we’ll end with some ideas for taking this project further.

Let’s get started!

Step 0: Downloading the dataset and installing NLTK

First download (and unzip) the dataset from this link: Search Engine. It has two files wiki-600 (4MB) and wiki-26000 (190MB). The numbers indicate the number of Wikipedia articles included in the dataset.

Additionally, make sure you install the nltk Python library.

Step 1: Loading the dataset

Now that we have our python libraries installed and dataset downloaded, let’s load the dataset and take a look.

from __future__ import print_function
import codecs
import re
############################################################################
## load the dataset
text = codecs.open('./wiki-600', encoding='utf-8').read()
starts = [match.span()[0] for match in re.finditer('\n = [^=]', text)]
articles = list()
for ii, start in enumerate(starts):
    end = starts[ii+1] if ii+1 < len(starts) else len(text)
    articles.append(text[start:end])
snippets = [' '.join(article[:200].split()) for article in articles]
for snippet in snippets[:20]:
    print(snippet)
############################################################################

Some notes:

  1. We are using the wiki-600 to begin with.
  2. All articles are in one file. Articles titles are formatted as follows: = Albert Einstein = . If it has two or more = signs on both sides, then it’s a subheading. The regex looks for article titles, and splits the text file.
  3. Then we calculate snippets, i.e. the first 200 characters for each article.

The output of the above code looks as follows:

= Valkyria Chronicles III = Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside
= Tower Building of the Little Rock Arsenal = The Tower Building of the Little Rock Arsenal , also known as U.S. Arsenal Building , is a building located in MacArthur Park in downtown Little Roc
= Cicely Mary Barker = Cicely Mary Barker ( 28 June 1895 – 16 February 1973 ) was an English illustrator best known for a series of fantasy illustrations depicting fairies and flowers . Barker '
= Gambia women 's national football team = The Gambia women 's national football team represents the Gambia in international football competition . The team , however , has not competed in a mat
= Plain maskray = The plain maskray or brown stingray ( Neotrygon annotata ) is a species of stingray in the family Dasyatidae . It is found in shallow , soft @-@ bottomed habitats off northern
= 2011 – 12 Columbus Blue Jackets season = The 2011 – 12 Columbus Blue Jackets season was the team 's 12th season in the National Hockey League ( NHL ) . The Blue Jackets ' record of 29 – 46 – 7
= Position ; GP = Games played in ; G
= Goals ; A = Assists ; Pts
= Points ; PIM = Penalty minutes ; + / - = Plus / minus = = = Goaltenders = = = Note : GP
= Games Played ; TOI = Time On Ice ( minutes ) ; W
= Wins ; L = Losses ; OT
= Overtime Losses ; GA = Goals Against ; GAA = Goals Against Average ; SA = Shots Against ; SV
= Saves ; Sv % = Save Percentage ; SO = Shutouts † Denotes player spent time with another team before joining Blue Jackets . Stats reflect time with the Blue Jackets only . ‡ Traded mid @-@ seas
= Gregorian Tower = The Gregorian Tower ( Italian : Torre Gregoriana ) or Tower of the Winds ( Italian : Torre dei Venti ) is a round tower located above the Gallery of Maps , which connects the
= There 's Got to Be a Way = " There 's Got to Be a Way " is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as
= Nebraska Highway 88 = Nebraska Highway 88 ( N @-@ 88 ) is a highway in northwestern Nebraska . It has a western terminus at Wyoming Highway 151 ( WYO 151 ) at the Wyoming – Nebraska state line
= USS Atlanta ( 1861 ) = Atlanta was a casemate ironclad that served in the Confederate and Union Navies during the American Civil War . She was converted from a Scottish @-@ built blockade runn
= Jacqueline Fernandez = Jacqueline Fernandez ( born 11 August 1985 ) is a Sri Lankan actress , former model , and the winner of the 2006 Miss Universe Sri Lanka pageant . As Miss Universe Sri L
= John Cullen = Barry John Cullen ( born August 2 , 1964 ) is a Canadian former professional ice hockey centre who played in the National Hockey League ( NHL ) for the Pittsburgh Penguins , Hart
= SMS Erzherzog Ferdinand Max = For the ironclad present at the Battle of Lissa of the same name , see SMS Erzherzog Ferdinand Max ( 1865 ) . SMS Erzherzog Ferdinand Max ( German : " His Majes

Step 2: Calculating term frequencies

This section is the bulk of the project.

Our next step is to calculate the term frequencies. In particular, we want to create a variables term_frequency such that term_frequency[token][article_id] = number of times token appears in articles[article_id]. Note that term_frequency is a dictionary of dictionaries.

A short code template is given below, but you’ll be doing most of the implementation.

###########################################################################
## tokenize the articles, calculate term frequencies
import sys
term_frequency = defaultdict(dict)
def get_tokens(article):
    # TODO: return list of 'tokens' that appear in article
    return tokens
def index(id, article):
    tokens = get_tokens(article)
    # TODO: calculate term frequencies and store in term_frequency[token][id]
for ii, article in enumerate(articles):
    if ii and ii % 10 == 0: print(ii, end=', ')
    sys.stdout.flush()
    index(ii, article)
###########################################################################

Some notes. In get_tokens():

  1. Make sure you convert text to lowercase. This will allow our search to be case-insensitive. That is, we’ll be able to search for “einstein” instead of “Einstein”.
  2. If a token appears multiple times in the article, it should be listed multiple times in the returned list.
  3. It’s up to you to decide what is a token. One option is any consecutive sequence of the following characters: a-z, A-Z, 0-9 and - (hyphen). I recommend using word_tokenize from NLTK. See: nltk.org/howto/tokenize.html
  4. Make sure you remove stopwords like ‘the’, ‘if’, etc from the list of tokens. NLTK has a list of stopwords. (See below).
  5. Make sure you stem the tokens, so that “caring” and “care” and “cares” are all mapped to the same word. I recommend using PorterStemmer from NLTK. See: Stemmers

To verify the correctness of the code, you can do the following:

print('term_frequency for "einstein"')
print(term_frequency['einstein'])
# Expected output: {300: 1, 84: 5, 294: 1}
# That is, articles[300] has token einstein 1 times, articles[5] has 
# token einstein 5 times, and articles[294] has token einstein 1 times.

Hint: NLTK imports

from nltk.tokenize import word_tokenize               # <=== tokenizer 
from nltk.stem.porter import PorterStemmer            # <=== stemmer 
from nltk.corpus import stopwords as nltk_stopwords   # <=== stopwords
STOPWORDS = set(nltk_stopwords.words('english'))

Step 3: TFIDF

We actually already have all the information to calculate TFIDF(articleid, “einstein”). Most of the required information is available in `termfrequency. We'll needsnippetsto know the total number of documents in the corpus (= len(snippets)` ), and to display the results.

Step 4: Saving the computation

Now that all the heavy computation is complete, let’s save the work we’ve done so far. (we don’t want to loose all the work and wait for things to compute again). We are only storing the variables we’ll need in the future - term_frequency and snippets

###########################################################################
## saving and loading
import pickle
def picklesave(obj, filename):
    print('Saving .. ')
    ff = file(filename, 'wb')
    pickle.dump(obj, ff)
    ff.close()
    print('Done')
    return True
def pickleload(filename):
    print('Loading .. ')
    ff = file(filename, 'rb')
    obj = pickle.load(ff)
    ff.close()
    print('Done')
    return obj
picklesave([snippets, term_frequency], 'data-600.pdata')
snippets, term_frequency = pickleload('data-600.pdata')
########################################################################### 

If we were building a proper search engine, we would be saving this data in a database. However, for our case, we’ll just be saving it into a single file.

Step 5: Ranking the articles for Search!

It’s time to write the final search function. Below is a template to get you started.

import math
D = len(snippets)
def search(query, nresults=10):
    tokens = get_tokens(query)
    scores = defaultdict(float)
    for token in tokens:
        for article, score in term_frequency[token].items():
            # TODO: scores[article] += TFIDF(article, token) 
    return # TODO: top nresults results
def display_results(query, results):
    print('You search for: "%s"' % query)
    print('-'*100)
    for result in results:
        print(snippets[result])
    print('='*100)
display_results('obama', search('obama'))
display_results('einstein', search('einstein'))
display_results('physics', search('physics'))
display_results('india', search('india'))
display_results('director', search('director'))

If everything goes well, you should see output similar to the following:

You search for: "obama"
----------------------------------------------------------------------------------------------------
= Bob Dylan = Bob Dylan ( / ˈdɪlən / ; born Robert Allen Zimmerman , May 24 , 1941 ) is an American singer @-@ songwriter , artist and writer . He has been influential in popular music and cultu
= 2010 Haiti earthquake = The 2010 Haiti earthquake ( French : Séisme de 2010 à Haïti ; Haitian Creole : Tranblemanntè 12 janvye 2010 nan peyi Ayiti ) was a catastrophic magnitude 7 @.@ 0 Mw ear
= Berkley Bedell = Berkley Warren Bedell ( born March 5 , 1921 ) is a former U.S. Representative from Iowa . After starting a successful business in his youth , Berkley Fly Co . , he ran for the
= Rio de Janeiro bid for the 2016 Summer Olympics = The Rio de Janeiro bid for the 2016 Summer Olympics and Paralympics was a successful bid to host the Games of the XXXI Olympiad and the XV Par
= Chris Turner ( American football ) = Chris Turner ( born September 8 , 1987 ) is an American football quarterback . He played quarterback for the Maryland Terrapins at the University of Maryla
= Sholay = Sholay ( pronunciation , meaning " Embers " ) is a 1975 Indian Hindi @-@ language action @-@ adventure film directed by Ramesh Sippy and produced by his father G. P. Sippy . The film
= Mumia Abu @-@ Jamal = Mumia Abu @-@ Jamal ( born Wesley Cook April 24 , 1954 ) is a convicted murderer who was sentenced to death in 1982 for the 1981 murder of Philadelphia police officer Dan
= Cambodian Campaign = The Cambodian Campaign ( also known as the Cambodian Incursion and the Cambodian Invasion ) was a series of military operations conducted in eastern Cambodia during 1970 b
====================================================================================================
You search for: "einstein"
----------------------------------------------------------------------------------------------------
= Edward Creutz = Edward Creutz ( January 23 , 1913 – June 27 , 2009 ) was an American physicist who worked on the Manhattan Project at the Metallurgical Laboratory and the Los Alamos Laboratory
= Transit of Venus = A transit of Venus across the Sun takes place when the planet Venus passes directly between the Sun and a superior planet , becoming visible against ( and hence obscuring a
= Bob Dylan = Bob Dylan ( / ˈdɪlən / ; born Robert Allen Zimmerman , May 24 , 1941 ) is an American singer @-@ songwriter , artist and writer . He has been influential in popular music and cultu
====================================================================================================
You search for: "physics"
----------------------------------------------------------------------------------------------------
= Edward Creutz = Edward Creutz ( January 23 , 1913 – June 27 , 2009 ) was an American physicist who worked on the Manhattan Project at the Metallurgical Laboratory and the Los Alamos Laboratory
= Jane 's Attack Squadron = Jane 's Attack Squadron is a 2002 combat flight simulator developed by Looking Glass Studios and Mad Doc Software and published by Xicat Interactive . Based on World
= Frederick Reines = Frederick Reines ( RYE @-@ ness ) ; ( March 16 , 1918 – August 26 , 1998 ) was an American physicist . He was awarded the 1995 Nobel Prize in Physics for his co @-@ detectio
= Ten Commandments in Catholic theology = The Ten Commandments are a series of religious and moral imperatives that are recognized as a moral foundation in several of the Abrahamic religions , i
= Carre 's Grammar School = Carre 's Grammar School is a selective secondary school for boys in Sleaford , a market town in Lincolnshire , England . Founded on 1 September 1604 by an indenture o
= Track and field = Track and field is a sport which includes athletic contests established on the skills of running , jumping , and throwing . The name is derived from the sport 's typical venu
= Martin Keamy = First Sergeant Martin Christopher Keamy is a fictional character played by Kevin Durand in the fourth season and sixth season of the American ABC television series Lost . Keamy
= Marshall Applewhite = Marshall Herff Applewhite , Jr . ( May 17 , 1931 – March 26 , 1997 ) , also known as " Bo " and " Do " , among other names , was an American cult leader who founded what
= Bodyline = Bodyline , also known as fast leg theory bowling , was a cricketing tactic devised by the English cricket team for their 1932 – 33 Ashes tour of Australia , specifically to combat t
= Crown Fountain = Crown Fountain is an interactive work of public art and video sculpture featured in Chicago 's Millennium Park , which is located in the Loop community area . Designed by Cata
====================================================================================================
You search for: "india"
----------------------------------------------------------------------------------------------------
= Independence Day ( India ) = Independence Day , observed annually on 15 August is a national holiday in India commemorating the nation 's independence from the British Empire on 15 August 1947
= Mortimer Wheeler = Sir Robert Eric Mortimer Wheeler CH , CIE , MC , TD , FSA , FRS , FBA ( 10 September 1890 – 22 July 1976 ) was a British archaeologist and officer in the British Army . Over
= Varanasi = Varanasi ( Hindustani pronunciation : [ ʋaːˈraːɳəsi ] ) , also known as Benares , Banaras ( Banāras [ bəˈnaːrəs ] ) , or Kashi ( Kāśī [ ˈkaːʃi ] ) , is a North Indian city on the ba
= Sholay = Sholay ( pronunciation , meaning " Embers " ) is a 1975 Indian Hindi @-@ language action @-@ adventure film directed by Ramesh Sippy and produced by his father G. P. Sippy . The film
= Jacqueline Fernandez = Jacqueline Fernandez ( born 11 August 1985 ) is a Sri Lankan actress , former model , and the winner of the 2006 Miss Universe Sri Lanka pageant . As Miss Universe Sri L
= Vistara = Tata SIA Airlines Limited , operating as Vistara , is an Indian domestic airline based in Gurgaon with its hub at Delhi @-@ Indira Gandhi International Airport . The carrier , a join
= Arikamedu = Arikamedu is an archaeological site in Southern India , inKakkayanthope , Ariyankuppam Commune , Puducherry . It is 4 kilometres ( 2 @.@ 5 mi ) from the capital , Pondicherry of th
= Elephanta Caves = Elephanta caves are a network of sculpted caves located on Elephanta Island , or Gharapuri ( literally " the city of caves " ) in Mumbai Harbour , 10 kilometres ( 6 @.@ 2 mi
= Battle of Tellicherry = The Battle of Tellicherry was a naval action fought off the Indian port of Tellicherry between British and French warships on 18 November 1791 during the Third Anglo @-
= HMS Marlborough ( 1912 ) = HMS Marlborough was an Iron Duke @-@ class battleship of the British Royal Navy , named in honour of John Churchill , 1st Duke of Marlborough . She was built at Devo
====================================================================================================
You search for: "director"
----------------------------------------------------------------------------------------------------
= Mortimer Wheeler = Sir Robert Eric Mortimer Wheeler CH , CIE , MC , TD , FSA , FRS , FBA ( 10 September 1890 – 22 July 1976 ) was a British archaeologist and officer in the British Army . Over
= Laurence Olivier = Laurence Kerr Olivier , Baron Olivier , OM ( / ˈlɒrəns kɜːr ɒˈlɪvieɪ / ; 22 May 1907 – 11 July 1989 ) was an English actor who , along with his contemporaries Ralph Richards
= Paul Thomas Anderson = Paul Thomas Anderson ( born June 26 , 1970 ) also known as P.T. Anderson , is an American film director , screenwriter and producer . Interested in film @-@ making at a
= Welsh National Opera = Welsh National Opera ( WNO ) ( Welsh : Opera Cenedlaethol Cymru ) is an opera company based in Cardiff , Wales ; it gave its first performances in 1946 . It began as a m
= Magadheera = Magadheera ( English : Great Warrior ) is a 2009 Indian Telugu @-@ language romantic @-@ action film , written by K. V. Vijayendra Prasad and directed by S. S. Rajamouli . Based o
= Sholay = Sholay ( pronunciation , meaning " Embers " ) is a 1975 Indian Hindi @-@ language action @-@ adventure film directed by Ramesh Sippy and produced by his father G. P. Sippy . The film
= Gold dollar = The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from 1849 to 1889 . The coin had three types over its li
= Edward Creutz = Edward Creutz ( January 23 , 1913 – June 27 , 2009 ) was an American physicist who worked on the Manhattan Project at the Metallurgical Laboratory and the Los Alamos Laboratory
= Not Quite Hollywood : The Wild , Untold Story of Ozploitation ! = Not Quite Hollywood : The Wild , Untold Story of Ozploitation ! is a 2008 Australian documentary film about the Australian New
= Civilian Public Service = The Civilian Public Service ( CPS ) was a program of the United States government that provided conscientious objectors with an alternative to military service during
====================================================================================================

Step 6: Searching among 25,000 wikipedia articles

Congrats for getting this far. Now, it’s time to use our large dataset (and a lot of waiting). Run all the code so far on the wiki-26000 dataset.

Here’s the output I got on the larger dataset.

You search for: "obama"
----------------------------------------------------------------------------------------------------
= Barack Obama = Barack Hussein Obama II ( US / bəˈrɑːk huːˈseɪn oʊˈbɑːmə / ; born August 4 , 1961 ) is the 44th and current President of the United States . He is the first African American to
= First inauguration of Barack Obama = The first inauguration of Barack Obama as the 44th President of the United States took place on Tuesday , January 20 , 2009 . The inauguration , which set
= Michelle Obama = Michelle LaVaughn Robinson Obama ( born January 17 , 1964 ) is an American lawyer and writer . She is married to the 44th and current President of the United States , Barack O
= Joe Biden = Joseph Robinette " Joe " Biden Jr . ( / ˈdʒoʊsᵻf rɒbᵻˈnɛt ˈbaɪdən / ; born November 20 , 1942 ) is the 47th and current Vice President of the United States , having been jointly el
= Hillary Clinton = Hillary Diane Rodham Clinton ( / ˈhɪləri daɪˈæn ˈrɒdəm ˈklɪntən / ; born October 26 , 1947 ) is an American politician and the nominee of the Democratic Party for President o
= John McCain = John Sidney McCain III ( born August 29 , 1936 ) is the senior United States Senator from Arizona . He was the Republican presidential nominee in the 2008 United States president
= Barack Obama " Hope " poster = The Barack Obama " Hope " poster is an image of Barack Obama designed by artist Shepard Fairey , which was widely described as iconic and came to represent his 2
= Chris Lu = Christopher P. Lu ( simplified Chinese : 卢沛宁 ; traditional Chinese : 盧沛寧 ; pinyin : Lú Pèiníng ; born June 12 , 1966 ) is the United States Deputy Secretary of Labor . He also serve
= Illinois 's 1st congressional district election , 2000 = The 2000 United States House of Representatives election for the 1st district in Illinois took place on November 7 , 2000 to elect a re
= Patient Protection and Affordable Care Act = The Patient Protection and Affordable Care Act ( PPACA ) , commonly called the Affordable Care Act ( ACA ) or , colloquially , Obamacare , is a Uni
====================================================================================================
You search for: "einstein"
----------------------------------------------------------------------------------------------------
= Albert Einstein = Albert Einstein ( / ˈaɪnstaɪn / ; German : [ ˈalbɛɐ ̯ t ˈaɪnʃtaɪn ] ; 14 March 1879 – 18 April 1955 ) was a German @-@ born theoretical physicist . He developed the general t
= General relativity = General relativity ( GR , also known as the general theory of relativity or GTR ) is the geometric theory of gravitation published by Albert Einstein in 1915 and the curre
= Introduction to general relativity = General relativity is a theory of gravitation that was developed by Albert Einstein between 1907 and 1915 . According to general relativity , the observed
= hν . As shown by Albert Einstein , some form of energy quantization must be assumed to account for the thermal equilibrium observed between matter and electromagnetic radiation ; for this explanat
= Einstein – Szilárd letter = The Einstein – Szilárd letter was a letter written by Leó Szilárd and signed by Albert Einstein that was sent to the United States President Franklin D. Roosevelt o
= Wilhelm Reich = Wilhelm Reich ( 24 March 1897 – 3 November 1957 ) was an Austrian psychoanalyst , a member of the second generation of analysts after Sigmund Freud . The author of several infl
= Jürgen Ehlers = Jürgen Ehlers ( German : [ ˈjʏʁɡŋ ̩ ˈeːlɐs ] ; 29 December 1929 – 20 May 2008 ) was a German physicist who contributed to the understanding of Albert Einstein 's theory of gene
= Fizeau experiment = The Fizeau experiment was carried out by Hippolyte Fizeau in 1851 to measure the relative speeds of light in moving water . Fizeau used a special interferometer arrangement
= Universe = The Universe is all of time and space and its contents . It includes planets , moons , minor planets , stars , galaxies , the contents of intergalactic space , and all matter and en
= Black hole = A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing — including particles and electromagnetic radiation such as light — can escape from
====================================================================================================
You search for: "physics"
----------------------------------------------------------------------------------------------------
= Philosophy of mind = Philosophy of mind is a branch of philosophy that studies the nature of the mind , mental events , mental functions , mental properties , consciousness , and their relatio
= M @-@ theory = M @-@ theory is a theory in physics that unifies all consistent versions of superstring theory . The existence of such a theory was first conjectured by Edward Witten at a strin
= AdS / CFT correspondence = In theoretical physics , the anti @-@ de Sitter / conformal field theory correspondence , sometimes called Maldacena duality or gauge / gravity duality , is a conjec
= Condensed matter physics = Condensed matter physics is a branch of physics that deals with the physical properties of condensed phases of matter . Condensed matter physicists seek to understan
= Enrico Fermi = Enrico Fermi ( Italian : [ enˈriːko ˈfermi ] ; 29 September 1901 – 28 November 1954 ) was an Italian physicist , who created the world 's first nuclear reactor , the Chicago Pil
= Josiah Willard Gibbs = Josiah Willard Gibbs ( February 11 , 1839 – April 28 , 1903 ) was an American scientist who made important theoretical contributions to physics , chemistry , and mathema
= Mirror symmetry ( string theory ) = In algebraic geometry and theoretical physics , mirror symmetry is a relationship between geometric objects called Calabi – Yau manifolds . The term refers
= Joint custody ( United States ) = Joint custody is a court order whereby custody of a child is awarded to both parties . In joint custody both parents are custodial parents and neither parent
= Albert Einstein = Albert Einstein ( / ˈaɪnstaɪn / ; German : [ ˈalbɛɐ ̯ t ˈaɪnʃtaɪn ] ; 14 March 1879 – 18 April 1955 ) was a German @-@ born theoretical physicist . He developed the general t
= General relativity = General relativity ( GR , also known as the general theory of relativity or GTR ) is the geometric theory of gravitation published by Albert Einstein in 1915 and the curre
====================================================================================================
You search for: "india"
----------------------------------------------------------------------------------------------------
= India = India , officially the Republic of India ( Sanskrit : Bhārata Gaṇarājya ) , is a country in South Asia . It is the seventh @-@ largest country by area , the second @-@ most populous co
= Political integration of India = At the time of Indian independence in 1947 , India was divided into two sets of territories , one under the control of the British Empire , and the other over
= Harbhajan Singh = Harbhajan Singh Plaha ( pronunciation ; born 3 July 1980 in Jalandhar , Punjab , India ) , commonly known as Harbhajan Singh , is an Indian international cricketer and former
= Virat Kohli = Virat Kohli ( pronunciation ; born 5 November 1988 ) is an Indian international cricketer . He is a right @-@ handed batsman and occasional right @-@ arm medium pace bowler , who
= Air India = Air India is the flag carrier airline of India and the third largest airline in India in terms of passengers carried , after IndiGo and Jet Airways . It is owned by Air India Limit
= South India = South India is the area encompassing the Indian states of Andhra Pradesh , Karnataka , Kerala , Tamil Nadu and Telangana as well as the union territories of Andaman and Nicobar ,
= Irfan Pathan = Irfan Khan Pathan ( pronunciation ; born 27 October 1984 ) is an Indian cricketer who made his debut for India in the 2003 / 04 Border @-@ Gavaskar Trophy , and was a core membe
= Hindu – German Conspiracy = The Hindu – German Conspiracy ( Note on the name ) was a series of plans between 1914 and 1917 by Indian nationalist groups to attempt Pan @-@ Indian rebellion agai
= Climate of India = The climate of India comprises a wide range of weather conditions across a vast geographic scale and varied topography , making generalisations difficult . Based on the Köpp
= Muhammad Ali Jinnah = Muhammad Ali Jinnah ( Urdu : محمد علی جناح ALA @-@ LC : Muḥammad ʿAlī Jināḥ , born Mahomedali Jinnahbhai ; 25 December 1876 – 11 September 1948 ) was a lawyer , politicia
====================================================================================================
You search for: "director"
----------------------------------------------------------------------------------------------------
= Christopher Nolan = Christopher Edward Nolan ( / ˈnoʊlən / ; born July 30 , 1970 ) is an English @-@ American film director , screenwriter , and producer . He is one of the highest @-@ grossin
= Film noir = The film noir genre generally refers to mystery and crime dramas produced from the early 1940s to the late 1950s . Movies of this genre were characteristically shot in black and wh
= Clint Eastwood = Clinton " Clint " Eastwood Jr . ( born May 31 , 1930 ) is an American actor , film director , producer , musician , and political figure . He rose to international fame with h
= Stanley Kubrick = Stanley Kubrick ( / ˈkuːbrɪk / ; July 26 , 1928 – March 7 , 1999 ) was an American film director , screenwriter , producer , cinematographer , editor , and photographer . Par
= Georgia Tech Research Institute = The Georgia Tech Research Institute ( GTRI ) is the nonprofit applied research arm of the Georgia Institute of Technology in Atlanta , Georgia , United States
= Stanley Donen = Stanley Donen ( / ˈdɔːnən / DAWN @-@ ən ; born April 13 , 1924 ) is an American film director and choreographer whose most celebrated works are Singin ' in the Rain and On the
= English National Opera = English National Opera ( ENO ) is an opera company based in London , resident at the London Coliseum in St. Martin 's Lane . It is one of the two principal opera compa
= The Cabinet of Dr. Caligari = The Cabinet of Dr. Caligari ( German : Das Cabinet des Dr. Caligari ) is a 1920 German silent horror film , directed by Robert Wiene and written by Hans Janowitz
= Donnie Darko : The Director 's Cut = Donnie Darko : The Director 's Cut is a 2004 extended version of Richard Kelly 's directorial debut , Donnie Darko . A critical success but a commercial fa
= Abbas Kiarostami = Abbas Kiarostami ( Persian : عباس کیارستمی pronunciation ; 22 June 1940 – 4 July 2016 ) was an Iranian film director , screenwriter , photographer and film producer . An act
====================================================================================================

Step 7: Doing searches interactively

Also add the following code to perform search interactively from the command line.

###########################################################################
## interactive
while True:
    # replace raw_input() with input() if using Python3
    query = raw_input("Please enter the search query: ")
    display_results(query, search(query))
###########################################################################

Here’s some sample interaction I did:

Please enter the search query: feynman
You search for: "feynman"
----------------------------------------------------------------------------------------------------
= Quantum electrodynamics = In particle physics , quantum electrodynamics ( QED ) is the relativistic quantum field theory of electrodynamics . In essence , it describes how light and matter int
= Force = In physics , a force is any interaction that , when unopposed , will change the motion of an object . In other words , a force can cause an object with mass to change its velocity ( wh
= John Archibald Wheeler = John Archibald Wheeler ( July 9 , 1911 – April 13 , 2008 ) was an American theoretical physicist . He was largely responsible for reviving interest in general relativi
= Drexler – Smalley debate on molecular nanotechnology = The Drexler – Smalley debate on molecular nanotechnology was a public dispute between K. Eric Drexler , the originator of the conceptual
= Primer ( film ) = Primer is a 2004 American indie science fiction drama film about the accidental discovery of a means of time travel . The film was written , directed , produced , edited and
= Klaus Fuchs = Emil Julius Klaus Fuchs ( 29 December 1911 – 28 January 1988 ) was a German theoretical physicist and atomic spy who , in 1950 , was convicted of supplying information from the A
= Robert Bacher = Robert Fox Bacher ( August 31 , 1905 – November 18 , 2004 ) was an American nuclear physicist and one of the leaders of the Manhattan Project . Born in Loudonville , Ohio , Bac
= Frederick Reines = Frederick Reines ( RYE @-@ ness ) ; ( March 16 , 1918 – August 26 , 1998 ) was an American physicist . He was awarded the 1995 Nobel Prize in Physics for his co @-@ detectio
= AdS / CFT correspondence = In theoretical physics , the anti @-@ de Sitter / conformal field theory correspondence , sometimes called Maldacena duality or gauge / gravity duality , is a conjec
= Trinity ( nuclear test ) = Trinity was the code name of the first detonation of a nuclear weapon , conducted by the United States Army on July 16 , 1945 , as part of the Manhattan Project . Th
====================================================================================================
Please enter the search query: i love einstein
You search for: "i love einstein"
----------------------------------------------------------------------------------------------------
= Albert Einstein = Albert Einstein ( / ˈaɪnstaɪn / ; German : [ ˈalbɛɐ ̯ t ˈaɪnʃtaɪn ] ; 14 March 1879 – 18 April 1955 ) was a German @-@ born theoretical physicist . He developed the general t
= Courtney Love = Courtney Michelle Love ( born Courtney Michelle Harrison , July 9 , 1964 ) is an American musician , actress , and visual artist . Prolific in the punk and grunge scenes of the
= General relativity = General relativity ( GR , also known as the general theory of relativity or GTR ) is the geometric theory of gravitation published by Albert Einstein in 1915 and the curre
= Introduction to general relativity = General relativity is a theory of gravitation that was developed by Albert Einstein between 1907 and 1915 . According to general relativity , the observed
= hν . As shown by Albert Einstein , some form of energy quantization must be assumed to account for the thermal equilibrium observed between matter and electromagnetic radiation ; for this explanat
= Crazy in Love = " Crazy in Love " is a song from American singer Beyoncé 's debut solo album Dangerously in Love ( 2003 ) . Beyoncé wrote the song with Rich Harrison , Jay Z , and Eugene Recor
= Love on Top = " Love on Top " is a song recorded by American singer Beyoncé for her fourth studio album 4 ( 2011 ) . Inspired from her state of mind while playing Etta James in the 2008 musica
= Einstein – Szilárd letter = The Einstein – Szilárd letter was a letter written by Leó Szilárd and signed by Albert Einstein that was sent to the United States President Franklin D. Roosevelt o
= Troilus = Troilus ( English pronunciation : / ˈtrɔɪləs / or / ˈtroʊələs / ; Ancient Greek : Τρωΐλος Troïlos ; Latin : Troilus ) is a legendary character associated with the story of the Trojan
= I Could Fall in Love = " I Could Fall in Love " is a song recorded by American Tejano singer Selena for her fifth studio album , Dreaming of You ( 1995 ) , released posthumously by EMI Latin o
====================================================================================================
Please enter the search query: einstein is the best
You search for: "einstein is the best"
----------------------------------------------------------------------------------------------------
= Albert Einstein = Albert Einstein ( / ˈaɪnstaɪn / ; German : [ ˈalbɛɐ ̯ t ˈaɪnʃtaɪn ] ; 14 March 1879 – 18 April 1955 ) was a German @-@ born theoretical physicist . He developed the general t
= General relativity = General relativity ( GR , also known as the general theory of relativity or GTR ) is the geometric theory of gravitation published by Albert Einstein in 1915 and the curre
= Introduction to general relativity = General relativity is a theory of gravitation that was developed by Albert Einstein between 1907 and 1915 . According to general relativity , the observed
= Pete Best = Randolph Peter " Pete " Best ( born Randolph Peter Scanland , 24 November 1941 ) is an English musician , principally known as the original drummer for the Beatles from 1960 to 196
= hν . As shown by Albert Einstein , some form of energy quantization must be assumed to account for the thermal equilibrium observed between matter and electromagnetic radiation ; for this explanat
= Einstein – Szilárd letter = The Einstein – Szilárd letter was a letter written by Leó Szilárd and signed by Albert Einstein that was sent to the United States President Franklin D. Roosevelt o
= Wilhelm Reich = Wilhelm Reich ( 24 March 1897 – 3 November 1957 ) was an Austrian psychoanalyst , a member of the second generation of analysts after Sigmund Freud . The author of several infl
= Best Thing I Never Had = " Best Thing I Never Had " is a song recorded by the American singer Beyoncé for her fourth studio album , 4 ( 2011 ) . It was released by Columbia Records on June 1 ,
= Jürgen Ehlers = Jürgen Ehlers ( German : [ ˈjʏʁɡŋ ̩ ˈeːlɐs ] ; 29 December 1929 – 20 May 2008 ) was a German physicist who contributed to the understanding of Albert Einstein 's theory of gene
= Fizeau experiment = The Fizeau experiment was carried out by Hippolyte Fizeau in 1851 to measure the relative speeds of light in moving water . Fizeau used a special interferometer arrangement
====================================================================================================
Please enter the search query: marx
You search for: "marx"
----------------------------------------------------------------------------------------------------
= Karl Marx = Karl Marx ( / mɑːrks / ; German : [ ˈkaɐ ̯ l ˈmaɐ ̯ ks ] ; 5 May 1818 – 14 March 1883 ) was a philosopher , economist , sociologist , journalist , and revolutionary socialist . Bor
= Sociology = Sociology is the study of social behavior or society , including its origins , development , organization , networks , and institutions . It is a social science that uses various m
= Flywheel , Shyster , and Flywheel = Flywheel , Shyster , and Flywheel is a situation comedy radio show starring two of the Marx Brothers , Groucho and Chico , and written primarily by Nat Perr
= Vladimir Lenin = Vladimir Ilyich Ulyanov , alias Lenin ( / ˈlɛnɪn / ; 22 April [ O.S. 10 April ] 1870 – 21 January 1924 ) , was a Russian communist revolutionary , politician , and political t
= The Blood Red Tape of Charity = The Blood Red Tape of Charity is a 1913 American silent short drama film written , directed and starring by Edwin August and produced by Pat Powers . August wro
= Max Weber = Karl Emil Maximilian " Max " Weber ( German : [ ˈmaks ˈveːbɐ ] ; 21 April 1864 – 14 June 1920 ) was a German sociologist , philosopher , jurist , and political economist whose idea
= My Musical = " My Musical " is a musical episode of the American comedy @-@ drama television series Scrubs . It is the 123rd episode of the show , and was originally aired as episode 6 of seas
= Max Weber = Karl Emil Maximilian " Max " Weber ( German : [ ˈmaks ˈveːbɐ ] ; 21 April 1864 – 14 June 1920 ) was a German sociologist , philosopher , jurist , and political economist whose idea
= Che Guevara = Ernesto " Che " Guevara ( Spanish pronunciation : [ ˈtʃe ɣeˈβaɾa ] ; June 14 , 1928 – October 9 , 1967 ) , commonly known as El Che or simply Che , was an Argentine Marxist revol
= William Morris = William Morris ( 24 March 1834 – 3 October 1896 ) was an English textile designer , poet , novelist , translator , and socialist activist . Associated with the British Arts an
====================================================================================================
Please enter the search query: harry potter
You search for: "harry potter"
----------------------------------------------------------------------------------------------------
= Harry Potter = Harry Potter is a series of fantasy novels written by British author J. K. Rowling . The novels chronicle the life of a young wizard , Harry Potter , and his friends Hermione Gr
= Religious debates over the Harry Potter series = Religious debates over the Harry Potter series of books by J. K. Rowling are based on claims that the novels contain occult or Satanic subtexts
= J. K. Rowling = Joanne " Jo " Rowling , OBE , FRSL ( / ˈroʊlɪŋ / ; born 31 July 1965 ) , pen names J. K. Rowling and Robert Galbraith , is a British novelist , screenwriter and film producer b
= Harry Potter and the Deathly Hallows = Harry Potter and the Deathly Hallows is the seventh and final novel of the Harry Potter series , written by British author J. K. Rowling . The book was r
= Legal disputes over the Harry Potter series = Since first coming to wide notice in the late 1990s , the Harry Potter book series by J. K. Rowling has engendered a number of legal disputes . Ro
= Harry and the Potters = Harry and the Potters are an American rock band known for spawning the genre of wizard rock . Founded in Norwood , Massachusetts in 2002 , the group is primarily compos
= Harry Potter and the Philosopher 's Stone = Harry Potter and the Philosopher 's Stone is the first novel in the Harry Potter series and J. K. Rowling 's debut novel , first published in 1997 b
= Harry Potter and the Deathly Hallows – Part 1 = Harry Potter and the Deathly Hallows – Part 1 is a 2010 British @-@ American fantasy film directed by David Yates and distributed by Warner Bros
= Harry Potter and the Order of the Phoenix ( film ) = Harry Potter and the Order of the Phoenix is a 2007 British @-@ American fantasy film directed by David Yates and distributed by Warner Bro
= Harry Potter and the Chamber of Secrets = Harry Potter and the Chamber of Secrets is the second novel in the Harry Potter series , written by J. K. Rowling . The plot follows Harry 's second y
====================================================================================================
Please enter the search query: chamber of secrets
You search for: "chamber of secrets"
----------------------------------------------------------------------------------------------------
= Dwain Chambers = Dwain Anthony Chambers ( born 5 April 1978 ) is a British track sprinter . He has won international medals at World and European level and is one of the fastest European sprin
= Secret trusts in English law = In English law , secret trusts are a class of trust defined as an arrangement between a testator and a trustee , made to come into force after death , that aims
= Wookey Hole Caves = Wookey Hole Caves are a series of limestone caverns , show cave and tourist attraction in the village of Wookey Hole on the southern edge of the Mendip Hills near Wells in
= Harry Potter and the Chamber of Secrets = Harry Potter and the Chamber of Secrets is the second novel in the Harry Potter series , written by J. K. Rowling . The plot follows Harry 's second y
= Barzillai J. Chambers = Barzillai Jefferson Chambers ( December 5 , 1817 – September 16 , 1895 ) was an American surveyor , lawyer , and politician of the Gilded Age . Born in Kentucky , he mo
= Treblinka extermination camp = Treblinka ( pronounced [ trɛˈblʲinka ] ) was an extermination camp , built by Nazi Germany in occupied Poland during World War II . It was located in a forest no
= Elimination Chamber ( 2010 ) = Elimination Chamber ( 2010 ) ( also known as No Way Out ( 2010 ) in Germany ) was a professional wrestling pay @-@ per @-@ view event produced by World Wrestling
= The Secret Service = The Secret Service is a British children 's espionage television series , filmed by Century 21 for ITC Entertainment and broadcast on Associated Television , Granada Telev
= Secret ( Madonna song ) = " Secret " is a song recorded by American singer and songwriter Madonna from her sixth studio album Bedtime Stories ( 1994 ) . It was released on September 27 , 1994
= Portal 2 = Portal 2 is a 2011 first @-@ person puzzle @-@ platform video game developed and published by Valve Corporation . It is the sequel to Portal ( 2007 ) and was released on April 19 ,
====================================================================================================
Please enter the search query: order of phoenix
You search for: "order of phoenix"
----------------------------------------------------------------------------------------------------
= Phoenix , Arizona = Phoenix ( / ˈfiːnɪks / ) is the capital and largest city of the U.S. state of Arizona . With 1 @,@ 563 @,@ 025 people ( as of 2015 ) , Phoenix is the sixth most populous ci
= Beth Phoenix = Elizabeth " Beth " Kocianski ( born November 24 , 1980 ) is an American retired professional wrestler , better known by her ring name Beth Phoenix . She is best known for her ti
= Phoenix Wright : Ace Attorney = Phoenix Wright : Ace Attorney , known in Japan as Gyakuten Saiban ( 逆転裁判 , lit . " Turnabout Trial " ) , is a visual novel adventure video game developed by Cap
= Phoenix ( fireboat ) = Phoenix is a fireboat owned by State of California and operated by the city of San Francisco in the San Francisco Bay since 1955 . Phoenix is known for helping to save M
= Harry Potter and the Order of the Phoenix ( film ) = Harry Potter and the Order of the Phoenix is a 2007 British @-@ American fantasy film directed by David Yates and distributed by Warner Bro
= Roads and freeways in metropolitan Phoenix = The metropolitan area of Phoenix in the U.S. state of Arizona contains one of the nation 's largest and fastest @-@ growing freeway systems , boast
= Ace Attorney = Ace Attorney , known in Japan as Gyakuten Saiban ( Japanese : 逆転裁判 , " Turnabout Trial " ) , is a series of visual novel adventure video games developed by Capcom . The first en
= Phoenix Wright : Ace Attorney − Justice for All = Phoenix Wright : Ace Attorney − Justice for All , known in Japan as Gyakuten Saiban 2 ( Japanese : 逆転裁判2 , " Turnabout Trial 2 " ) , is a visu
= Remedies in Singapore administrative law = The remedies available in Singapore administrative law are the prerogative orders – the mandatory order ( formerly known as mandamus ) , prohibiting
= Apollo Justice : Ace Attorney = Apollo Justice : Ace Attorney , known in Japan as Gyakuten Saiban 4 ( Japanese : 逆転裁判4 , lit . " Turnabout Trial 4 " ) , is a visual novel adventure video game
====================================================================================================
Please enter the search query: japan
You search for: "japan"
----------------------------------------------------------------------------------------------------
= Germany – Japan relations = The Germany – Japan relations ( Japanese : 日独関係 , Hepburn : Nichidokukankei ) and German : Deutsch @-@ japanische Beziehungen ) were established in 1860 with the fi
= Japan = Japan ( Japanese : 日本 Nippon [ nip ̚ põ ̞ ɴ ] or Nihon [ nihõ ̞ ɴ ] ; formally 日本国 Nippon @-@ koku or Nihon @-@ koku , " State of Japan " ) is an island country in East Asia . Located
= Japan Airlines = Japan Airlines Co . , Ltd . ( JAL ) ( 日本航空株式会社 , Nihon Kōkū Kabushiki @-@ gaisha , TYO : 9201 , OTC Pink : JAPSY ) , is the flag carrier airline of Japan and the second larges
= Air raids on Japan = Allied forces conducted many air raids on Japan during World War II , causing extensive destruction to the country 's cities and killing between 241 @,@ 000 and 900 @,@ 00
= Surrender of Japan = The surrender of Japan was announced by Imperial Japan on August 15 and formally signed on September 2 , 1945 , bringing the hostilities of World War II to a close . By th
= Sea of Japan naming dispute = The international name for the body of water which is bordered by Japan , North Korea , Russia , and South Korea is disputed . In 1992 , objections to the name Se
= Flag of Japan = The national flag of Japan is a white rectangular flag with a large red disc representing the sun in the center . This flag is officially called Nisshōki ( 日章旗 , " sun @-@ mark
= World War II = World War II ( often abbreviated to WWII or WW2 ) , also known as the Second World War , was a global war that lasted from 1939 to 1945 , although related conflicts began earlie
= X Japan = X Japan ( エックス ・ ジャパン , Ekkusu Japan ) is a Japanese heavy metal band from Chiba , formed in 1982 by drummer Yoshiki and lead vocalist Toshi . Predominantly a power / speed metal ban
= Atomic bombings of Hiroshima and Nagasaki = The United States , with the consent of the United Kingdom as laid down in the Quebec Agreement , dropped nuclear weapons on the Japanese cities of
====================================================================================================
Please enter the search query: japan climate
You search for: "japan climate"
----------------------------------------------------------------------------------------------------
= Japan = Japan ( Japanese : 日本 Nippon [ nip ̚ põ ̞ ɴ ] or Nihon [ nihõ ̞ ɴ ] ; formally 日本国 Nippon @-@ koku or Nihon @-@ koku , " State of Japan " ) is an island country in East Asia . Located
= Germany – Japan relations = The Germany – Japan relations ( Japanese : 日独関係 , Hepburn : Nichidokukankei ) and German : Deutsch @-@ japanische Beziehungen ) were established in 1860 with the fi
= Global warming = Global warming and climate change are terms for the observed century @-@ scale rise in the average temperature of the Earth 's climate system and its related effects . Multipl
= Japan Airlines = Japan Airlines Co . , Ltd . ( JAL ) ( 日本航空株式会社 , Nihon Kōkū Kabushiki @-@ gaisha , TYO : 9201 , OTC Pink : JAPSY ) , is the flag carrier airline of Japan and the second larges
= Climate = Climate is the statistics ( usually , mean or variability ) of weather , usually over a 30 @-@ year interval . It is measured by assessing the patterns of variation in temperature ,
= Air raids on Japan = Allied forces conducted many air raids on Japan during World War II , causing extensive destruction to the country 's cities and killing between 241 @,@ 000 and 900 @,@ 00
= Surrender of Japan = The surrender of Japan was announced by Imperial Japan on August 15 and formally signed on September 2 , 1945 , bringing the hostilities of World War II to a close . By th
= Sea of Japan naming dispute = The international name for the body of water which is bordered by Japan , North Korea , Russia , and South Korea is disputed . In 1992 , objections to the name Se
= Joseph J. Romm = Joseph J. Romm ( born June 27 , 1960 ) is an American author , blogger , physicist and climate expert who advocates reducing greenhouse gas emissions and global warming and in
= Flag of Japan = The national flag of Japan is a white rectangular flag with a large red disc representing the sun in the center . This flag is officially called Nisshōki ( 日章旗 , " sun @-@ mark
====================================================================================================
Please enter the search query: tokyo
You search for: "tokyo"
----------------------------------------------------------------------------------------------------
= Hiroh Kikai = Hiroh Kikai ( 鬼海 弘雄 , Kikai Hiroo , born 18 March 1945 ) is a Japanese photographer best known within Japan for four series of monochrome photographs : scenes of buildings in and
= Air raids on Japan = Allied forces conducted many air raids on Japan during World War II , causing extensive destruction to the country 's cities and killing between 241 @,@ 000 and 900 @,@ 00
= Tokyo Tower = Tokyo Tower ( 東京タワー , Tōkyō tawā ) is a communications and observation tower located in the Shiba @-@ koen district of Minato , Tokyo , Japan . At 332 @.@ 9 metres ( 1 @,@ 092 ft
= Japan Airlines = Japan Airlines Co . , Ltd . ( JAL ) ( 日本航空株式会社 , Nihon Kōkū Kabushiki @-@ gaisha , TYO : 9201 , OTC Pink : JAPSY ) , is the flag carrier airline of Japan and the second larges
= Tokyo Mew Mew = Tokyo Mew Mew ( 東京ミュウミュウ , Tōkyō Myū Myū ) is a Japanese shōjo manga series written by Reiko Yoshida and illustrated by Mia Ikumi . It was originally serialized in Nakayoshi fr
= Antonin Raymond = Antonin Raymond ( or Czech : Antonín Raymond ) , born as Antonín Reimann ( 10 May 1888 , Kladno , Bohemia – 21 November 1976 Langhorne , Pennsylvania ) , was a Czech American
= Kenzō Tange = Kenzō Tange ( 丹下 健三 , Tange Kenzō , 4 September 1913 – 22 March 2005 ) was a Japanese architect , and winner of the 1987 Pritzker Prize for architecture . He was one of the most
= Germany – Japan relations = The Germany – Japan relations ( Japanese : 日独関係 , Hepburn : Nichidokukankei ) and German : Deutsch @-@ japanische Beziehungen ) were established in 1860 with the fi
= Shin Megami Tensei IV = Shin Megami Tensei IV ( Japanese : 真 ・ 女神転生IV , literally " True Goddess Reincarnation IV " ) is a Japanese post @-@ apocalyptic role @-@ playing video game developed b
= Kanō Jigorō = Kanō Jigorō ( 嘉納 治五郎 , 28 October 1860 – 4 May 1938 ) was a Japanese educator and athlete , the founder of Judo . Judo was the first Japanese martial art to gain widespread inter
====================================================================================================

Solution on Google Colaboratory

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your own system.

Discussion

We covered a lot of ground. Let’s see what all things of note there are.

Result quality:

  1. Notice when we searched phrases like “einstein is the best” and “I love einstein”, we still got Albert Einstein as the top result, even though there are people with surnames “Love” and “Best”. That’s because the IDF of “einstein” is larger than the IDF of “love” and “best”.
  2. When we searched for “order of phoenix”, the top result was the article on Phoenix, USA. This is because “order” has a low IDF, and “phoenix” appears a lot more times in the Phoenix, USA article. We can improve this by having an “advanced search” option, which says all words must be present.

Google?

  1. Our simple ranking function did a pretty good job on Wikipedia because the data is so clean. Google can’t use such a simple ranking function because everyone will game the system by saying “Einstein” 1000s of times on their websites.
  2. Google indexes about 130 trillion webpages. That’s about 4,000,000,000 times more articles than we indexed (26,000).

Ideas for further exploration

You can take this project a lot further. Some ideas related to NLP / search / ranking

  1. adding a feature for matching all words in the query,
  2. support for search complete phrases,
  3. excluding results which contain a specific word,
  4. searching numbers within a particular range.
  5. choosing a dataset with much shorter documents (say twitter)

You can also take it in a completely different direction, by

  1. performing the indexing in database,
  2. supporting new documents to be edited / deleted / added without re-indexing everything
  3. making search faster (currently, time for a single search keeps growing unboundedly as more documents get indexed)

Hope you enjoyed this project!


© 2016-2022. All rights reserved.