To start this guide, download this zip file.

Counting

Dictionaries make it easy to count items. For example let’s say we wanted to count the number of vowels in a string. Here is what this program should do:

% python vowel_counts.py 'Hello. How are you?'
{'a': 1', 'e': 2, 'i': 0, 'o': 3, 'u': 1}

Notice that for the string Hello. How are you? we have created a dictionary that maps each vowel to the number of times it appears:

a: 1
e: 2
i: 0
o: 3
u: 1

To see how we can do this, take a look at this function:

def count(letters, text):
    # create an empty dictionary
    counts = {}

    # loop through all of the letters we are counting
    # and initialize their counts to zero
    for letter in letters:
        counts[letter] = 0

    # loop through all of the letters in the text
    # be sure to convert to lowercase
    for c in text.lower():
        # if this letter is one we are counting, add 1 to its count
        if c in counts:
            counts[c] += 1

    # return the dictionary
    return counts

This function takes a set of letters to count and a string. For example, we could call this with:

vowel_counts = count('aeiou', text)

In this function we:

create an empty dictionary
loop through all of the letters we are counting and initialize their counts to zero
loop through all of the letters in the text
- if the letter we are looking at is one of the ones we are counting, then add one to its count

Here is a program that uses this function, which you can find in vowel_counts.py:

import sys


def count(letters, text):
    # create an empty dictionary
    counts = {}

    # loop through all of the letters we are counting
    # and initialize their counts to zero
    for letter in letters:
        counts[letter] = 0

    # loop through all of the letters in the text
    # be sure to convert to lowercase
    for c in text.lower():
        # if this letter is one we are counting, add 1 to its count
        if c in counts:
            counts[c] += 1

    # return the dictionary
    return counts


def main(text):
    # count how many times each vowel occurs in the text
    vowel_counts = count('aeiou', text)
    # print out the dictionary
    print(vowel_counts)


if __name__ == '__main__':
    main(sys.argv[1])

We can test this program by giving it another string:

% python vowel_counts.py "I am going to double major in Computer Science and Journalism"
{'a': 4, 'e': 4, 'i': 5, 'o': 6, 'u': 3}

Looks like it works!

vowels from Sesame Street

States

To practice this, we are going to write a program that has a group of people enter their home state or country. After all of the places are entered, the program then prints out how many people are from each place. For example:

% python place_count.py
State or Country: Delaware
State or Country: Montana
State or Country: Pakistan
State or Country: Iran
State or Country: Montana
State or Country: Pakistan
State or Country: India
State or Country: California
State or Country:
{'Delaware': 1, 'Montana': 2, 'Pakistan': 2, 'Iran': 1, 'India': 1, 'California': 1}

Here is a function to do compute the dictionary:

def get_places():
    # create an empty dictionary
    places = {}
    while True:
        # get a place
        place = input('State or Country: ')
        # break if we are done
        if not place:
            break
        # if this place is not in the dictionary yet
        # then initialize this place to zero
        if place not in places:
            places[place] = 0
        # increment this place by one
        # this doesn't cause an error because we were sure
        # to initialize it to zero above
        places[place] += 1

    # return the dictionary
    return places

Notice that this follows a similar pattern as when we counted values. However, the difference here is that we don’t know the keys for the dictionary in advance. If we are counting vowels, the keys are always “aeiou”. But for this problem, the keys are whatever states and countries people enter.

We can handle this problem by using this code:

if place not in places:
    places[place] = 0

Whenever we find a place that is not in the dictionary, then we initailize its value to zero.

Here is a complete program using this function, which you can find in places_count.py:

def get_places():
    # create an empty dictionary
    places = {}
    while True:
        # get a place
        place = input('State or Country: ')
        # break if we are done
        if not place:
            break
        # if this place is not in the dictionary yet
        # then initialize this place to zero
        if place not in places:
            places[place] = 0
        # increment this place by one
        # this doesn't cause an error because we were sure
        # to initialize it to zero above
        places[place] += 1

    # return the dictionary
    return places

def main():
    places = get_places()
    print(places)


if __name__ == '__main__':
    main()

Removing punctuation

Counting words

For this program, we are going to count all times each word occurs in a file. But we need to ignore both case and punctuation. This is important because if the file contains:

Twinkle, twinkle, little star,
how I wonder, what you are!
Up above the world so high,
like a diamond in the sky.
Twinkle, twinkle, little star,
how I wonder what you are!

Then we need “Twinkle” to be counted the same as “twinkle”, and we need to remove commas and exclamation points.

Reading the file as a long string

When we want to count words in a file, we could read the file as a list of lines, like we usually do, and then split each line into words. However, a simpler thing to do is to read the file as one long string. Then you can split this long string into words all at once using split().

Here is how to read a file as one long string:

def readfile(filename):
    with open(filename) as file:
        return file.read()

This function uses file.read() instead of file.readlines():

file.read() — read an entire file and return it as one long string:

'Line one\n, Line two\n, Line three\n'

file.readlines() — read an entire file and return it as a list of strings, one per line in the file:

['Line one\n', 'Line two\n', 'Line three\n']

Removing punctuation

To remove punctuation, we can use strip(). Normally, strip() removes all leading and trailing white space. But if we give it a string as an argument, then we can remove all trailing and leading characters that are in the string.

For example, this will remove just exclamation points and question marks:

word = word.strip('!?')

To remove all punctuation, you could imagine trying to list all the punctuation characters in something like:

word = word.strip('.,?!#@$%^&*()')

However, with this strategy it can be easy to overlook something. Instead, python can provide us with a full list of all the punctuation characters:

from string import punctuation
word = word.strip(punctuation)

A function to count words

Here is a function that will count words in a long string (containing multiple lines):

from string import punctuation

def count_words(content):
    """Count the number of each word in content.
    Ignore casing and punctuation."""
    # create an empty dictionary
    counts = {}
    # loop through all of the words, first converting to lowercase
    # and then splitting them using white space
    for word in content.lower().split():
        # strip any leading or trailing punctuation from the word
        word = word.strip(punctuation)
        # if the word is not in the dictionary,
        # initialize an entry to zero
        if word not in counts:
            counts[word] = 0
        # increment the count by one for this word
        counts[word] += 1
    # return the dictionary
    return counts

The two important things to notice here are:

we convert the content to lowercase using lower() before we split it into words using split()
we remove all of the punctuation using strip()

Otherwise, this follows the same pattern as counting places.

The file count_words.py contains a complete program:

import sys
from string import punctuation


def readfile(filename):
    with open(filename) as file:
        return file.read()


def count_words(content):
    """Count the number of each word in content.
    Ignore casing and punctuation."""
    # create an empty dictionary
    counts = {}
    # loop through all of the words, first converting to lowercase
    # and then splitting them using white space
    for word in content.lower().split():
        # strip any leading or trailing punctuation from the word
        word = word.strip(punctuation)
        # if the word is not in the dictionary,
        # initialize an entry to zero
        if word not in counts:
            counts[word] = 0
        # increment the count by one for this word
        counts[word] += 1
    # return the dictionary
    return counts


def main(filename):
    # read the file
    content = readfile(filename)
    # count how many times each word appears
    counts = count_words(content)
    # print the counts dictionary
    print(counts)


if __name__ == '__main__':
    main(sys.argv[1])

You can run this using the file twinkle.txt:

python count_words.py twinkle.txt
{'twinkle': 4, 'little': 2, 'star': 2, 'how': 2, 'i': 2, 'wonder': 2,
 'what': 2, 'you': 2, 'are': 2, 'up': 1, 'above': 1, 'the': 2, 'world': 1,
 'so': 1, 'high': 1, 'like': 1, 'a': 1, 'diamond': 1, 'in': 1, 'sky': 1}