Random Words, Sets and Instance Identity

Recently I’ve been working around testing database related operations and in that case, I have need of many little arrays of random words. I also have run into some cases where I wanted to remove all duplicate items from a list, and that sent me down into understanding how the set class in python works with instances of custom classes.

Building a list of random words for testing

I’m still in love with python comprehensions. In this case I wanted to generate a list of random words.

In [1]:
import string
import random
In [204]:
## Produce a word list generator where:
## wordlen is an integer describing the length of each word
## listlen is the number of words that will eventually be generated
## charlist is a list of characters that will be chosen from to generate the wordlist.
##
def wordlist(wordlen=10, listlen=10, charlist=string.ascii_uppercase):
    for _ in range(listlen):
        yield "".join([random.choice(seq=charlist) for _ in range(wordlen)])

Demonstrate a wordlist

In [205]:
list(wordlist())
Out[205]:
['PNVEEHHJKD',
 'ZSQEIPZVQA',
 'ZLPMMMKCDA',
 'GBJLDDZCZG',
 'JAABASOGKQ',
 'RNVGMYFJBG',
 'PASFFOCRKH',
 'CGZYSCCVJS',
 'ELIOIQXYMF',
 'RQFSAMXKWE']

Demonstrate a word list with a different set of characters.

In [207]:
list(wordlist(charlist=string.ascii_letters))
Out[207]:
['jiEYIVSwtM',
 'aLSxzKSOph',
 'FLykQrpaIQ',
 'eTVzQGxYUC',
 'UCcTakLbgZ',
 'BmqUfdyKAw',
 'yUFqkAAsjf',
 'SbivFeUqGp',
 'PocykjUfKq',
 'uYqukzslRa']

Removing duplicates

First setup a word list with some unique words to start with

In [208]:
ulist = list(wordlist(wordlen=4, listlen=3))
ulist
Out[208]:
['JGWH', 'YOAP', 'TMBN']

Repeatedly call choice over ulist (will result in duplicates) Note, ulist must be a list because choice uses indexing to choose one when called. Sets are not index addressable.

In [209]:
duplist = [random.choice(ulist) for _ in range(10)]
duplist
Out[209]:
['JGWH',
 'TMBN',
 'JGWH',
 'JGWH',
 'TMBN',
 'TMBN',
 'YOAP',
 'TMBN',
 'TMBN',
 'YOAP']

Demonstrate using the set class to show removal of duplicates.

In [212]:
set(duplist)
Out[212]:
{'JGWH', 'TMBN', 'YOAP'}
In [215]:
list(set(duplist)).sort() == ulist.sort()
Out[215]:
True

Note: By definition sets are not ordered so can’t impose some known order when printing a set

Sets on instances of a defined class

I have some cases where I would like to be able to strip duplicate objects out of a list. I can use the set class approach to filter out duplicate objects. However, in order to do so, I have to implement __eq__ and __hash__ methods in the class. This is necessary to define unicity in the class.

A counter example: class without unicity

As an example, here is an item class that does not implement __eq__ and __hash__.

In [185]:
class item:

    def __init__(self, key, data=None):
        self.key = key
        self.data = data

    def __repr__(self):
        return('item(key={},data={})'.format(self.key,self.data))

Definition of item without __eq__ and __hash__ defined. Want two item objects to be considered the same.

In [186]:
a = item('THAT')
b = item('THAT',data=4)
In [190]:
print(a.__hash__())
print(b.__hash__())

278991750
-9223372036575784062

Two different has values indicate that these two classes are not considered to be the same thing.

In [191]: 
a == b
Out[191]:
False

Class with unicity definition

Now redefining item to include __eq__ and __hash__ based on the key attribute

In [216]:
class item:

    def __init__(self, key, data=None):
        self.key = key
        self.data = data

    def __repr__(self):
        return('item(key={},data={})'.format(self.key,self.data))

    #In my case, equal keys is sufficient to consider two objects to be equal
    def __eq__(self,other):
        return(self.key == other.key)

    #Use the key to produce the hash for this instance.
    def __hash__(self):
        return(hash(self.key))

Now these two objects are considered to be the same.

In [217]:
a = item('THAT')
b = item('THAT',data=4)
In [218]:
print(a.__hash__())
print(b.__hash__())

-3822845408751240381
-3822845408751240381

Identical hash indicates same object

In [198]:
a == b
Out[198]:
True

Now a and b are considered equal even though they may have different data.

Sets of items using unicity by key

Now a set of item will reduce down to a distinct list of items with unique keys. Note that the instance of item kept in the set is abitrary. You won’t know which three items identified by key as ‘this’ will be represented in the set.

In [199]:
{item("this",data=3),item("this",data=44),item("this",data=220),item("that")}
Out[199]:
{item(key=that,data=None), item(key=this,data=3)}