In this project, you'll write a program to calculate and display the number of times each word appears in Life, the Universe and Everything! (or any other book). In the process, you'll get to practice almost all the Python skills you have acquired over the course, including command-line arguments, file input-output, strings and string methods, functions, lists and dictionaries, custom sorting, loops and if-statements. Let's get started! (Sidenote: Life, the Universe and Everything! is the third book in the Hitchhiker's Guide to Galaxy science fiction series by Douglas Adams).
Note: You need to have Python installed on your computer to be able to do this project.
Overview
Although we have provided guidance and instructions for the project, you'll be writing all the code for this project. You are to read text from a file, and output the number of times each word appears in the file.
Here's the sample expected interaction. The following command:
python3 wordcount.py hitch3.txt
should produce two output files, most_popular.txt and alphabetical.txt:
most_popular.txt
the 3076and 1599of 1490to 1371a 1344he 1248it 1061was 917... more lines ...robot 55any 54made 54will 54eyes 53how 53too 53anything 51galaxy 51mind 51round 51got 50nothing 50rather 50right 50being 49sky 49... many more lines ...
alphabetical.txt
' 11'cos 2'em 1'strue 1- 83--indeed 11 110 1108 411 1... more lines ...about 173above 20abrupt 1abruptly 1absence 2absolute 5absolutely 2abstractedly 1... many more lines ...
Keep reading for more detailed instructions. If you feel confident, try downloading the data and not looking at the rest of the instructions (or use as little as needed). Once you're done with coding, take the quiz!
Guidelines
- Step 1: Download the data
- You can download the data from the following URL: hitch3.txt. This file contains a plain-text version of Life, the Universe and Everything! the third book in the Hitchhiker's Guide to Galaxy science fiction series. (Sample included below)
- Tip: I suggest creating another file, say hitch3small.txt, which only has the first 50 lines or so. It will make it easier to print out what your code is doing and look at the output.
- Step 2: Get filename from command-line and read the input
- Use the sys module to get the filename from command-line, and then read the file. Relevant tutorials:
- Tip: After every step, keep printing out the values of your intermediate variables to check if everything is working as you expect it to.
- Step 3: Split the text into a list of words
- For this exercise, we'll define a word as any sequence consisting of alphabets (a-z, A-Z), digits (0-9), apostrophe (') or hyphens (-). For example, "You're a jerk, Dent," it said simply. has the following words: You're, a, jerk, Dent, it, said and simply.
- You might find it helpful to define is_legal(chr) which returns True if chr is among the characters mentioned above.
- Relevant tutorial: Python3 Lists and Loops
- Step 4: Count the number of occurrences of each word.
- When counting, convert words to lowercase. So Hello, HELLO, hello are all considered towards the count for hello.
- Relevant tutorial: Python3 Dictionaries and Tuples
- Step 5: Sort the items by word count and output to most_popular.txt
- Most popular first. Break ties by alphabetical ordering. In the example above, robot comes before any because word count for robot is higher. But any comes before made because they have the same word count, but any is earlier in alphabetical order.
- Relevant tutorial: Python3 Sorting and File input-output
- Step 6: Sort the words by alphabetical order and output to alphabetical.txt
- Step 7: Double check everything works as expected and take the quiz!
Sample of hitch3.txt
The file begins as follows:
Douglas AdamsLife, the Universe, and Everything=================================================================Douglas Adams The Hitch Hiker's Guide to the GalaxyDouglas Adams The Restaurant at the End of the UniverseDouglas Adams Life, the Universe, and EverythingDouglas Adams So long, and thanks for all the fish=================================================================Life, the universe and everythingfor Sally=================================================================Chapter 1The regular early morning yell of horror was the sound of ArthurDent waking up and suddenly remembering where he was.It wasn't just that the cave was cold, it wasn't just that it wasdamp and smelly. It was the fact that the cave was in the middleof Islington and there wasn't a bus due for two million years....
Solution
The solution to this project is included at the end of the quiz.