CommonLounge Archive

Hands-on Project: Word Count

August 02, 2018

In this project, you’ll write a program to calculate and display the number of times each word appears in Life, the Universe and Everything! (or any other book). In the process, you’ll get to practice almost all the Python skills you have acquired over the course, including command-line arguments, file input-output, strings and string methods, functions, lists and dictionaries, custom sorting, loops and if-statements. Let’s get started! (Sidenote: Life, the Universe and Everything! is the third book in the Hitchhiker’s Guide to Galaxy science fiction series by Douglas Adams).

Note: You need to have Python installed on your computer to be able to do this project.

Overview

Although we have provided guidance and instructions for the project, you’ll be writing all the code for this project. You are to read text from a file, and output the number of times each word appears in the file.

Here’s the sample expected interaction. The following command:

python3 wordcount.py hitch3.txt

should produce two output files, most_popular.txt and alphabetical.txt:

most_popular.txt

the 3076
and 1599
of 1490
to 1371
a 1344
he 1248
it 1061
was 917
... more lines ... 
robot 55
any 54
made 54
will 54
eyes 53
how 53
too 53
anything 51
galaxy 51
mind 51
round 51
got 50
nothing 50
rather 50
right 50
being 49
sky 49
... many more lines ... 

alphabetical.txt

' 11
'cos 2
'em 1
'strue 1
- 83
--indeed 1
1 1
10 1
108 4
11 1
... more lines ...
about 173
above 20
abrupt 1
abruptly 1
absence 2
absolute 5
absolutely 2
abstractedly 1
... many more lines ... 

Keep reading for more detailed instructions. If you feel confident, try downloading the data and not looking at the rest of the instructions (or use as little as needed). Once you’re done with coding, take the quiz!

Guidelines

  • Step 1: Download the data
  • You can download the data from the following URL: hitch3.txt. This file contains a plain-text version of Life, the Universe and Everything! the third book in the Hitchhiker’s Guide to Galaxy science fiction series. (Sample included below)
  • Tip: I suggest creating another file, say hitch3small.txt, which only has the first 50 lines or so. It will make it easier to print out what your code is doing and look at the output.
  • Step 2: Get filename from command-line and read the input
  • Use the sys module to get the filename from command-line, and then read the file. Relevant tutorials:
  • Python3 Modules and Command-line execution
  • Python3 Sorting and File input-output
  • Tip: After every step, keep printing out the values of your intermediate variables to check if everything is working as you expect it to.
  • Step 3: Split the text into a list of words
  • For this exercise, we’ll define a word as any sequence consisting of alphabets (a-z, A-Z), digits (0-9), apostrophe (’) or hyphens (-). For example, "You're a jerk, Dent," it said simply. has the following words: You're, a, jerk, Dent, it, said and simply.
  • You might find it helpful to define is_legal(chr) which returns True if chr is among the characters mentioned above.
  • Relevant tutorial: Python3 Lists and Loops
  • Step 4: Count the number of occurrences of each word.
  • When counting, convert words to lowercase. So Hello, HELLO, hello are all considered towards the count for hello.
  • Relevant tutorial: Python3 Dictionaries and Tuples
  • Step 5: Sort the items by word count and output to most_popular.txt
  • Most popular first. Break ties by alphabetical ordering. In the example above, robot comes before any because word count for robot is higher. But any comes before made because they have the same word count, but any is earlier in alphabetical order.
  • Relevant tutorial: Python3 Sorting and File input-output
  • Step 6: Sort the words by alphabetical order and output to alphabetical.txt
  • Step 7: Double check everything works as expected and take the quiz!

Sample of hitch3.txt

The file begins as follows:

             Douglas Adams
           Life, the Universe, and Everything
=================================================================
Douglas Adams   The Hitch Hiker's Guide to the Galaxy
Douglas Adams   The Restaurant at the End of the Universe
Douglas Adams   Life, the Universe, and Everything
Douglas Adams   So long, and thanks for all the fish
=================================================================
Life, the universe and everything
for Sally
=================================================================
Chapter 1
The regular early morning yell of horror was the sound of  Arthur
Dent waking up and suddenly remembering where he was.
It wasn't just that the cave was cold, it wasn't just that it was
damp  and smelly. It was the fact that the cave was in the middle
of Islington and there wasn't a bus due for two million years.
... 

Solution

The solution to this project is included at the end of the quiz.


© 2016-2022. All rights reserved.