Using Set Operations on Ruby Arrays

The array construct in Ruby is a very versatile tool. In this blog post we introduce the set operations that are supported on ruby arrays. We discuss cases in which these set operations can be exploited to write eloquent and idomatic ruby code. We also introduce the Set class and consider under what circumstances you might wish to consider an actual Set over an Array.

Introduction

The ruby language has always valued developer productivity and eloquent, expressive code. Arrays are a cornerstone of most programming languages and, as such, the ruby Array is packed full of syntactic delights. Amongst the extensive Array interface are a number of set operations that provide some excellent additions to your arsenal for writing terse and expressive ruby code.

In this post we will look at the set operations offered on the ruby Array class, where you might choose to use these operations and when you might want to opt for an actual Set.

Array set operations

A set is a fundamental data construct that contains zero or more members and guarantees the uniqueness of those members. There are a number of common set operations that are supported by ruby arrays. The operations to be discussed in this post are union, intersection and difference. The last of these is commonly discussed in the context of ruby array set operations and, whilst a useful operation, I contest that ruby's implementation is not strictly a set operation, but more on that later.

Let's look at each of these operations in turn.

Union

The union of two sets $A$ and $B$, $A \cup B$, will return all the members that are either in set $A$, in set $B$ or in both. It can be visualized as shown in the Venn diagram.
If $A = \{ 1, 2, 3, 4 \}$ and $B = \{ 2, 4, 6, 8 \}$, then the union $A \cup B = \{ 1, 2, 3, 4, 6, 8 \}$.

In ruby we can form the union of two arrays using the | operator as follows:

    
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]

ministry | gryffindor      # Returns ["weasley", "fudge", "shacklebolt", "granger", "potter", "longbottom"]

Whilst arrays do not require that the entries are unique, this is a fundamental requirement of a set. As a result, our union operation on our arrays, ministry | gryffindor, only returns the unique entries. This fact gives us an interesting alternative to #uniq for returning the unique elements in an array:

    
A = ["foo", "bar", "foo", "baz", "foo", "bar"]
A.uniq     # Returns ["foo", "bar", "baz"]
A | []    # Returns ["foo", "bar", "baz"]

Asides from this slightly esoteric application the array union operator provides a neat way to incrementally build lists of distinct items. For example, imagine a helper method for building an HTML element that could be embellished with a number of different classes

    
module CustomTagHelper
  def build_tag(text_content, options = {})
    content_tag(:span, text_content, {
      class: [
        "custom_class_a",
        "custom_class_b"
      ] | (options[:class] || "").split(" ")
    }.merge(options.slice(:style, :title)))
  end
end

In this example we make use of the content_tag view-helper method exposed by Rails (see here). Our CustomTagHelper module defines our own helper method build_tag, which takes a string of content and will stick it into a span tag. It also allows the caller to pass a number of options which will be used to embellish the tag with a limited set of HTML attributes. The style and title attributes can be passed in the options hash and will be added as attributes to the element. More interestingly, the client code can also pass a string of classes within the options[:class], and these will be merged into a list of default classes which are applied to the element. Building a unique array of HTML class names is a frequent task, and the array union operator has proven a very useful and concise tool to use in such cases.

Intersection

The intersection of two sets $A$ and $B$, $A \cap B$, will return all the members that are present in both set $A$ and $B$. It can be visualized as shown in the Venn diagram with the red-coloured region representing the intersection.
If $A = \{ 1, 2, 3, 4 \}$ and $B = \{ 2, 4, 6, 8 \}$, then the intersection $A \cap B = \{ 2, 4 \}$.

The interseciton of two sets will return a set that includes only the elements that exist in both sets. The result is indicated by the red region in the diagram.

In ruby we can form the union of two arrays using the & operator as follows:

     
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]

gryffindor & ministry     # Returns ["weasley"]

As for the union operation our intersection operation, gryffindor & ministry, will remove all duplicate entries in the resulting array. And once again, this provides another interesting way to get the unique elements in an array:

    
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
ministry & ministry     # Returns ["weasley", "fudge", "shacklebolt"]
ministry.uniq     #  Returns ["weasley", "fudge", "shacklebolt"]

The intersection operator can be useful when filtering some user-provided values against an allow-list of accepted terms. For example, we could imagine the following code appearing in a Rails controller:

    
class FoosController < ApplicationController
  ALLOWED_PROPERTIES = %w( name description updated_at )

  def index
    sanitized_properties = params[:properties] & ALLOWED_PROPERTIES
    @foos = Foo.all.pluck(*sanitized_properties)
  end
end

In this example we can imagine that users will hit the FoosController#index endpoint with a list of :properties that they wish to retrieve for each Foo instance. It is probably a bad idea to allow users to see any detail they choose, so to restrict the properties that they can actually retrieve we filter the user-supplied params[:properties] through the allow-list defined by ALLOWED_PROPERTIES. I have found the array intersection operation a neat tool to use in similar filtering situations.

Difference

The difference between sets $A$ and $B$, $B-A$, will return the members that are in set $B$ that are not in set $A$. It can be visualized as the coloured region in the Venn diagram.
If $A = \{ 1, 2, 3, 4 \}$ and $B = \{ 2, 4, 6, 8 \}$, then the difference $B - A = \{ 6, 8 \}$.

In ruby we can take the difference between two arrays as follows:

    
ministry = ["weasley", "fudge", "shacklebolt", "weasley"]
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]

gryffindor - ministry     # Returns ["granger", "potter", "longbottom"]

In this example, the elements that are in gryffindor but not in ministry are returned. However, I am unsure if we can really classify this as a set operation, because duplicates in the first array can remain after the operation (i.e. this method does not return a set, small s). Consider, for example:

    
gryffindor = ["weasley", "granger", "potter", "longbottom", "weasley"]
gryffindor -  ["shacklebolt", "potter"]     # Returns ["weasley", "granger", "longbottom", "weasley"]

Here the duplicate weasley entry remains in the output. So it seems that the difference operator (-) does not return a set. Interestingly the ruby v2.7 docs for this operator, make an acknowledgement of this non-set-like behaviour with this remark:

If you need set-like behavior, see the library class Set.

However, this note is actually removed from the documentation in the ruby v3+ docs. Not sure why, but there you go. Nonetheless, this remark in the documentation does get us thinking: If we are looking for set-like behaviour why aren't we using the Set class rather than using the Array class as our proverbial hammer?

When to use `Set`

The truth is, probably, that we stick with the Array class due to its familiarity and flexibility.

Familiar because, as a data structure, arrays are ubiquitous in programming and all developers will be comfortable reasoning in terms of arrays.
Flexible because the Array interface is more extensive than that of Set. This means that, for better or worse, we can easily reuse the same array in multiple different ways. We can treat our array like a set (albeit without any in-built uniqueness guarantees), but then we can go ahead an use array methods also, like indexing into the array etc.

So if arrays provide all the tools we need, why would I ever choose to use a Set?

Asides from the semantic benefits to our code, the main powers of Set are two-fold:

Entries are guaranteed to be unique
Checking for inclusion in the set is extremenly fast

Relating to this second point, if we want to see if the Set A contains a member, then the call A.include?(member) will be very fast. And as an extension to this, the set operations for #subset? and #superset? will also be very efficient.

A simple benchmark

To give an idea of the relative performance merits of Set versus Array we can run a very simple benchmark:

    
require 'benchmark'
require 'set'

def generate_random_string(len=6)
  (0..len).map{ ('a'..'z').to_a[rand(26)] }.join('')
end

n=100_000
big_set = nil
big_arr = nil
Benchmark.bmbm do |x|
  x.report("Building Set")   { big_set = Set.new; n.times { big_set << generate_random_string } }
  x.report("Building Array") { big_arr = Array.new; n.times { big_arr << generate_random_string } }
  x.report("Set#include?")   { 1_000.times{ big_set.include?(generate_random_string) } }
  x.report("Array#include?") { 1_000.times{ big_arr.include?(generate_random_string) } }
end

This benchmark code defines a generate_random_string method which builds a 6 character string with lowercase letters. We call this method in two separate loops to populate a large Set and a large Array. We report on the time taken to build these two separate data structures. The benchmark then repeatedly (1000 times) tries to find a random string in each of the big_set and the big_array. Again, we report on the time it takes to do the 1000 searches against the Set and how long it takes to complete the searches against the Array. The results are below:

    
                     user     system      total        real
Building Set     3.152817   0.019997   3.172814 (  3.174221)
Building Array   2.948688   0.016002   2.964690 (  2.965641)
Set#include?     0.027470   0.000000   0.027470 (  0.027548)
Array#include?   6.498631   0.007982   6.506613 (  6.509682)

We should not read too much into the absolute values here, as we make no attempt to understand the impact of the calls to generate_random_string. However, these costs should be the same for both the Array and the Set loops, so the relative timings should be significant.

We can see that building the Set burns more time than building the Array. This stands to reason, as we build the Set there will need to be internal checks to ensure that we are not adding a duplicate entry. By contrast, the Array does not care if the entry is a duplicate, so adding an element to an Array is quicker than adding to a Set of equivalent size.

When it comes to presence checks (i.e. #include?) we can see tha the Set absolutely smokes (technical term) the Array. When a Set checks for existence it can use a hash-based lookup, but the Array needs to check every item, which can get quite slow then the array is large.

Summary

In this post we have introduced the set operations for union (|) and intersection (&) that are exposed on ruby arrays, and we have seen a couple of simple scenarios where these operators prove their worth.

We have also discussed the difference operator (-) which, while convenient, is perhaps miscategorised as a set operation.

Finally we took a look at why you might choose to use a Set rather than an Array, with efficient lookups over large unique collections being the primary use case for the Set class.

In my own experience I use the Set class very infrequently, typically reaching for an Array instead. It seems that most array instances are small thow-away data constructs, for which the true value of Set will not be very evident. However, in writing this post I have resolved to keep an open mind to using the Set class more liberally in my day-to-day programming. Even when the lookup efficiency is not a significant, enforcing of a unique-membered-collection has undoubtedly got some benefits, even just from a semantic perspective.

References

Ruby docs for Array class
Ruby docs for Set class
Wikipedia entry for set union operation
Wikipedia entry for set intersection operation
Wikipedia entry for set difference operation
Please note that the Venn diagrams reproduced in this blog post have been borrowed from Wikipedia
Rails documentation for the content_tag view helper method
Ruby v2.7 docs for Array - operator
Wikipedia entry for Venn diagrams

Using Set Operations on Ruby Arrays

Introduction

Array set operations

Union

Intersection

Difference

When to use `Set`

A simple benchmark

Summary

References

Comments

Got your own view or feedback? Share it with us below …

Subscribe

Using Set Operations on Ruby Arrays

Introduction

Array set operations

Union

Intersection

Difference

When to use Set

A simple benchmark

Summary

References

Comments

Got your own view or feedback? Share it with us below …

Subscribe

When to use `Set`