Find unique words in a sentence or paragraph using python

In this article we shall see how to find out unique words in sentence, paragraph or files using the set data type in python. This will be extremely useful if we are creating word clouds.

If you want to know more about sets and the methods associated with them, check out our previous article.

Let us have a quick recap of what is a set and what it does.

What is a set?

A set is a collection that is unordered and unindexed. In Python, sets are written with curly brackets. The elements in the set are always unique. so we can use the set class to remove duplicate elements from an iterable.

In python sets are written with curly braces.

set in python

Now we know that a set will have only unique elements. We are going to use this property of set to find the unique words from a paragraph. Basically what we are going to do is we are going to push words one by one into a set. The duplicate elements are omitted and finally we have a set only with unique elements, in our case only unique words.

Example

Let us follow these steps to get what we want.

  1. Create a set.
  2. Open a text file.
  3. Read lines one after one from the file.
  4. Split the lines to get each word using the split() method.
  5. Convert it into lowercase using the lower() method.
  6. Add the words to the set using the add() method.
  7. The result will be a set having unique words.

We can simplify the code by using the update() method. The Python set update() method updates the set, adding items from other iterables. The syntax of update() is: A.update(iterable).

We can also simplify this code to a single line like this.

The output of this one liner is,

{'same,', 'knowing', 'i', 'if', 'better', 'undergrowth;', 'lay', 'telling', 'sorry', 'way', 'diverged', 'black.', 'though', 'passing', 'traveler,', 'oh,', 'stood', 'another', 'bent', 'that', 'difference.', 'kept', 'ages', 'travel', 'where', 'wear;', 'back.', 'how', 'somewhere', 'first', 'to', 'just', 'worn', 'not', 'all', 'day!', 'could', 'morning', 'be', 'looked', 'this', 'no', 'a', 'one', 'leaves', 'leads', 'with', 'ever', 'other,', 'way,', 'having', 'sigh', 'then', 'two', 'should', 'step', 'traveled', 'because', 'has', 'trodden', 'by,', 'made', 'less', 'yet', 'and', 'about', 'fair,', 'them', 'equally', 'there', 'took', 'on', 'hence:', 'far', 'as', 'perhaps', 'come', 'had', 'really', 'wood,', 'in', 'the', 'down', 'shall', 'for', 'long', 'claim,', 'was', 'wanted', 'doubted', 'roads', 'both', 'it', 'grassy', 'yellow'}

Using sets is the simplest and easiest way to find unique elements from any iterables. Hope this article was helpful.

Happy coding!