Set union problem

Today I had the privilege of a job interview with one of the leading companies in the online streaming music space. I’d like to think the interview went well, although I was incredibly nervous and my brain decided it was going to operate in a way that suggested it was wading through treacle; but I digress. During the interview I was asked an algorithmic question and I have to admit I was initially quite flummoxed. This post is about that question.

I spend quite a lot of time trying to solve algorithmic programming puzzles. Those that I find interesting I blog about. For whatever reason I’d never come across this problem before. It was a great question but the solution was not immediately obvious to me. Anyway, with the help of some (lots!) of prompting from the interviewer I think we got there in the end.

I can’t say this was my finest hour and I left the interview feeling slightly miffed with myself. Still, as amazing as I am I don’t know everything and so to make sure I never forget how to solve this problem I now present it here along, with the solution we eventually derived during the interview.

The problem

Abstract

It’s a little tricky to explain (the interviewer had to explain it to me about 4 times before I understood it – which probably says more about me than anything else), but here goes…

You have a collection of strings. Implement an algorithm that will union any string with any other string that has any matching character. Continue this process until there are no more matches to be found and what remains is a smaller collection of strings that contain unique characters. Any strings that can’t be unioned with any other string shall remain in the collection untouched.

Detail

To make that clearer let’s look at an example.

Suppose have the following vector V that contains the strings S1 through Sn.

V = [
   S1{ "abc" },
   S2{ "def" },
   S3{ "cf" },
   S4{ "ghi" },
   S6{ "jkl" },
   S7{ "mno" },
   S8{ "pqtl" }
]

Following the transformation applied by the algorithm, this should be the result.

V = [ S1{ "abcdef" }, S4{ "ghi" }, S6{ "jklpqt" }, S7{"mno"} ]

We get to this point because S1 shares a common character with S3 (‘c’), which in term shares a common character with S2 (‘f’) and so are merged. Likewise, S6 shares a common character with S8 (‘l’) and so are merged. The remaining strings share no common characters so just stay as they are in the collection.

The solution

Initial attempt

My initial thoughts on how to solve this were to use recursion (you can solve anything with recursion, right?). The idea being that you pass the vector and current string position to a function. It then checks the rest of the vector for matches with the string whose position you passed in. If it finds a match it calls itself with the vector and the position of that string. If it finds no other match it just returns the string at the original position found.

In this way, as the stack unwinds you then just merge the current string with the one coming down the stack and then return the newly merged string. Once the stack has finished unwinding you’ll have all the valid merges for the target string and you can then move on and start checking the next one.

In fact, this probably wouldn’t work very well for a number of reasons:

  • To prevent infinite recursion (well, until your stack blows up), you’d have to remove the current string out of the vector before making the recursive call. For example, from S1 I match S3, I make the recursive call and I now match S2, I make a recursive calls and I now match S3, I make a recursive call and I match S2 and so on… I think you can probably see where this is heading.
  • Removing items from the vector isn’t necessarily that simple. You’re already in the process of enumerating the vector and there is a good chance that if you’re not careful you’ll invalidate it and you’ll end up iterating up your own tail pipe.
  • Even if you manage to resolve problems 1 and 2 you still have the problem that unless you come up with a fancy tail-recursion implementation, your recursive solution will quite probably blow the stack if the number of strings was large

So, in summary, I suspect it is quite possible to implement a recursive solution for this but it doesn’t take too long thinking about it to realise that it’s probably not really the way to go.

A better attempt

On this occasion, the best solution is iterative. The idea is to focus on one target string at a time and keep looking for matches until no more can be found and then move on to the next. Let’s walk it through.

We need 3 loops. The outer loop is a control loop, that will keep repeating until we find there are no more candidates for unification. The next loop will iterate the strings in the vector and, likewise, so will the inner loop. We should have something like this.

while not done
   done = true
   for outer in vector
      for inner in vector
         if no merges then
           done = false
         end
      done
   done
done

So, upon initialisation outer is S1 and inner is S1. There is no point in comparing these as they are the same thing so we skip inner on one. Next, inner is S2 and there is no match so we skip on again. Now inner is S3 and there is a match so we union S1 (outer) and S3 (inner) and store the result in S1 (as this is currently the target). We now remove S3 because we’ve dealt with it. We try S4 through S8 and find no more matches so this stage is done.

Now we have to process S1 again because during the last stage we merged new stuff into it so we might now be able to find another match. As it happens, S2 will now match because S1 now contains the contents of S3, which means it contains the letter ‘f’. Again, we merge S2 into S1 and then trip on through the remaining strings. We should find no more matches.

Again, we’ve updated the target string so once again we need to iterate through and make sure nothing else matches. This time there are no more matches so S1 is now complete. We can now move on to the next remaining string, which is S4. We now repeat this process again and keep going until we’ve exhausted all the target strings (ie. there are no more matches to be found). At this point we should have a much smaller set of strings, all with unique letters in them.

I hope that makes sense. As you can see, it’s not that easy to explain. Hopefully, some code will help. Today I’ve chosen C++ as the language. Whilst I normally try and present algorithms in Python I wanted to do this in C++ as it was a new problem for me and I tend to think new things through better when coding them in C++, just purely because that is the language I know best. As some point I might also add a Python version :)

#include
#include
#include
#include
#include

using namespace std;

class merger
{
public:
   merger(vector< string > const & vv)
      : vv_(vv)
   {
      process();
   }

   vector< string > const & get() const
   {
      return vv_;
   }

private:
   void process()
   {
      vector< string >::iterator oitr = vv_.begin();
      string tmp;

      bool more = true;
      do
      {
         more = false;

         while(oitr != vv_.end())
         {
            vector< string >::iterator iitr = oitr + 1;
            bool changed = false;

            while(iitr != vv_.end())
            {
               if(find_first_of(
                  oitr->begin(), oitr->end(),
                  iitr->begin(), iitr->end()) != oitr->end())
               {
                  sort(oitr->begin(), oitr->end());
                  sort(iitr->begin(), iitr->end());

                  tmp.resize(oitr->size() + iitr->size(), char());

                  string::const_iterator itr = set_union (
                     oitr->begin(), oitr->end(),
                     iitr->begin(), iitr->end(),
                     tmp.begin());

                  tmp.erase(itr, tmp.end());

                  oitr->swap(tmp);

                  iitr = vv_.erase(iitr);
                  changed = more = true;
               }
               else
               {
                  ++iitr;
               }
            }

            if (! changed)
               ++oitr;
         }
      }
      while(more);
   }

private:
   vector< string > vv_;
};

int main()
{
   vector< string > vv;
   vv.push_back("abc");
   vv.push_back("def");
   vv.push_back("cf");
   vv.push_back("ghi");
   vv.push_back("jkl");
   vv.push_back("mno");
   vv.push_back("pqtl");

   merger m(vv);

   copy(vv.begin(), vv.end(), ostream_iterator(cout, " "));
   cout << endl;

   copy(m.get().begin(), m.get().end(), ostream_iterator(cout, " "));
   cout << endl;
}

And we’re done

So, there you have it. I hope you found this as interesting as I did. If you have your own solution for this problem I’d welcome you posting in the comments below.

8 thoughts on “Set union problem

  1. Wow, took me at least four times until I understood the task 🙂

    I think there’s a subtle problem in the explanation that makes it even harder to understand: in the final sentence “Any strings that can be unioned with any other string shall remain in the collection untouched.” that should probably be “Any strings that cannot be unioned.”

    Anyway, it turns out my initial solution to the problem is quite similar to yours (I’ve no idea how to format code here, oh well…):

    #include 
    #include 
    #include 
    #include 
    #include 
    
    using namespace std;
    
    void merge(vector&amp;amp; v)
    {
       bool any;
    
       do
       {
          size_t i = 0;
          any = false;
    
          while (i &amp;lt; v.size())
          {
             size_t j = i + 1;
    
             while (j &amp;lt; v.size())
             {
                if (v[i].find_first_of(v[j]) != string::npos)
                {
                   sort(v[i].begin(), v[i].end());
                   sort(v[j].begin(), v[j].end());
                   string tmp;
                   set_union(v[i].begin(), v[i].end(),
                             v[j].begin(), v[j].end(),
                             std::back_inserter(tmp));
                   v[i] = tmp;
                   v.erase(v.begin() + j);
                   any = true;
                }
                else
                {
                   ++j;
                }
             }
    
             ++i;
          }
       }
       while (any);
    }
    
    int main()
    {
       vector v;
       v.push_back(&quot;abc&quot;);
       v.push_back(&quot;def&quot;);
       v.push_back(&quot;cf&quot;);
       v.push_back(&quot;ghi&quot;);
       v.push_back(&quot;jkl&quot;);
       v.push_back(&quot;mno&quot;);
       v.push_back(&quot;pqtl&quot;);
    
       merge(v);
    
       for (const std::string&amp;amp; s : v)
       {
          cout &amp;lt;&amp;lt; s &amp;lt;&amp;lt; endl;
       }
    }
    

    After that I looked at your code and I think there's a problem: at the point where you vv.erase(iitr), your oitr is invalidated. It's pure luck if that code doesn't crash. 😉

  2. Hi mhx,

    Thanks for spotting that typo (and sorry if it caused you any headache). I’ve now corrected it.

    Regarding the iterator problem, since oitr is always behind iiter it shouldn’t be a problem because the C++ Standard guarantees that only the iterator being erased and any after it are invalidated. Any before it are still valid.

    http://www.cplusplus.com/reference/vector/vector/erase/

    “Iterators, pointers and references pointing to position (or first) and beyond are invalidated, with all iterators, pointers and references to elements before position (or first) guaranteed to keep referring to the same elements they were referring to before the call.”

    I’ve tested this code in Visual Studio with its secure CRT checks enabled (which will assert if you try to access any invalid iterators) and it reported no problems. Of course, it may just be that I’ve mis-understood your observation. If so, please to let me know and I’d be happy to take another look (and fix).

    Regards.

  3. HI Parvel,

    Well, the truth is I don’t really think I could claim to have solved in during the interview. Unfortunately, it took quite a bit of prompting from the interviewer; however, I will say that I was extremely nervous (it’s a job I’d quite like to get) and didn’t help. =/

  4. I’m sure you’ll find much better place.
    I do not like such questions of the interview. Almost always people asks you such questions and then, when you already works there, you perform very standard tasks absolutely not related to such questions. Even more – these people who asked you on the interview, personally are very bad programmers.

  5. Hey Ricky, you’re absolutely right regarding the iterator validity. I could blame it on the fact that I wrote this at 1am, but I just wasn’t aware that the standard made this guarantee. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s