In this post, I will giving two detailed implementations of Trie. One using vectors, and another using pointers.
Prerequisites:
- The introductory tutorial on trie: Trie and Strings Searching
- It's preferable to know what's a struct in programming languages.
- To understand Section 2: Implementation using pointers, you need to know what pointers are.
Quick Revision:
- A Trie is a tree where each node represents a word or a prefix.
- The root represents the empty string, each node will have outgoing edges. Each of them labeled with one letter of the alphabet.
- Since each node represents a prefix, moving along an edge would be equivalent to appending the letter it's associated with to this prefix.
- Prefixes/Words of length i are presented with nodes that are i edges away from the root.
- If a node w is an ancestor of node v then the word represented by w is a prefix of the word represented by v.
- Inserting / Deleting / Searching for a word in the Trie is done in linear time in length of the word, i.e. O(W).
- Memory complexity of trie is O(TOTAL_LENGTH), where TOTAL_LENGTH is the sum of the lengths of the individual words.
Problem setup:
We will implement a dictionary that supports the following operations:
- Insert a word into the dictionary
- For a given word W, report the number of words V in the dictionary such that W is a prefix of V.
Assume that our words consist only of lowercase english letters.
Section 1: Implementation using vectors:
First of all, we should decide what information we should keep in each node of our Trie (this depends of course on the problem we are solving).
Each node in our trie will be stored in a vector. Observe this piece of code.
int NIL = -1;struct node {int count;int child[26];node(){count = 0;memset(child, NIL, sizeof(child));}};vector <node> data; // check Note [1]
Each node in our trie has an array of exactly 26 edges. Each edge stores the index of the node (the index at which the child node can be found in the data vector). [2, 3]
- child[0] refers to the node representing current prefix + letter 'a'
- child[1] refers to the node representing current prefix + letter 'b'
- and so on, ...
- child[25] refers to the node representing current prefix + letter 'z'
If a child does not exist, then we set child[x] = NIL. The special value NIL denotes that the current node does not have an x-th child, i.e. there are no words with starting with current prefix + x-th letter.
We are going to discuss variable count a bit later.
What does insert_word() function look like?
First of all we should declare the head of our Trie and assign it an element in our array that was never assigned to any other node. After that we may insert our words starting from the head. We must be able to get any word in the dictionary by starting from the head and going along edges.
vector <node> data;void insert_word(int head, string str){int cur = head;for(int i = 0 ; i < str.size() ; i++){int child_index = str[i] - 'a';if(data[cur].child[child_index] == NIL){data[cur].child[child_index] = data.size();data.append(node());}cur = data[cur].child[child_index];++data[cur].count;}/*Comments:- Line 6: We start with the head of our Trie- Line 8: We loop on the letters of our string from thebeginning to the end- Line 12: We check whether the child of this nodeassociated with the letter exists or not- If it doesn't we create a new one- Line 14: We initialize our new created node to anempty one.- Line 20: We move from our node to the child associatedwith the letter, thus we have extended the currentmatched prefix from our string by one letter.- Our loop will continue on till it matches theword representing our string completely.*/}int main(){data.append(node());// initializing our head to an empty node}
You may be asking yourself, what would we do with the variable count in our struct. Our second query is: For a given word W, report the number of words V such that W is a prefix of V.
If we find the node representing our full word W, then our answer would be the number of words that passed by this node while constructing them from the root. Thus we are maintaining a variable keeping track of the number of words which have passed by a certain node.
Can you implement the second query function yourself?
int find_word(int head, string str){int cur = head;for(int i = 0 ; i < str.size() ; i++){int child_index = str[i] - 'a';if(data[cur].child[child_index] == NIL)return 0;// Prefix is not present in the dictionary. Terminate.cur = data[cur].child[child_index];}// We reached the node representing our word.// Thus return count of this node as the answerreturn data[cur].count;}
Section 2: Implementing using pointers:
For each of the code snippers above, there will be some fairly minor changes.
node:
int NIL = -1;struct node{int count;node* child[26]; // NOTE: this is now an array of pointers/* 0~a , 1~b , 2~c , 3~d ... 24~y , 25~z */node(){count = 0;memset(child , NIL , sizeof(child));}};// NOTE: no data array / vector needed
insert:
void insert_word(node head, string str){node *it = head; // NOTE: pointer to headfor(int i = 0 ; i < str.size() ; i++){int child_index = str[i] - 'a';if(it->child[child_index] == NIL)it->child[child_index] = new node();it = it->child[child_index];++it->count;}/*Comments:- Line 2: We start with a pointer referring to thehead of our Trie- Line 3: We loop on the letters of our string fromthe beginning to the end- Line 7: We check whether the child of this nodeassociated with the letter exists or not- If it doesn't we create a new one- Line 9: We move our pointer to the child associatedwith the letter, thus we have extended the currentmatched prefix from our string by one letter.- Our loop will continue on till it matches theword representing our string completely.*/}int main(){node *head;head = new node();insert_word(head, "hussain");insert_word(head, "keshav");}
query:
int find_word(node *head, string str){node *it = head;for(int i = 0 ; i < str.size() ; i++){int child_index = str[i] - 'a';if(it->child[child_index] == NULL)return 0;// Prefix is not in the dictionary. Terminate.it = it->child[child_index];}// We reached the node representing our word// Thus returning the count of this node as the answerreturn it->count;}
Notes:
- The data vector can also be stored as an array of size MAX_TOTAL_LENGTH, if the maximum size is known beforehand. In this case, you will also have to maintain a size counter, which tells you the current numbers of nodes that have been used.
- This has similarities to the Adjacency List representation of graphs, in the sense that you store the index of the vertex in the adjacency list, and here too you store the index of the node.
- In terms of memory needed, the representation is similar to Adjacency Matrix, in the sense even if there is only one child node, we store an entire vector with almost all elements equal to NIL. This can be made more memory efficient using a map.
- The above 2 implementations here consumes memory of (26 * TOTAL_LENGTH), we can make it linear by turning our array in the struct into a vector, but that would affect the searching complexity. In most problems we are dealing with the normal alphabet so we can say that it's liear. When dealing with large alphabets the situation is different.
Hope you enjoyed! Share the love and the knowledge! :)