5/10/19

Simplified Algorithm to find largest Color Blob in a Grid (in Python and C++)

I have taught how to use algorithms and data structures for years as a professor. And many students think that just learning about algorithm and data structure is about choosing the correct implementation of a queue, search, standard graph algorithm, etc. from the either a library or canned code they learned in their class.

The best students can use a deep knowledge of how each algorithm and data structure works and then design variations or combinations of the original structures and algorithms to solve a problem. This shows an real application of knowledge and the ability to think out of the box that interviewers look for.

This article will show how I created a custom algorithm that combines a 2d array, and linked list, and some other 'tricks' to solve the problem of given a 2d grid where the cells of the grid are colored(based in integer in grid). I believe this is a very efficient, and with a small amount of code to solve this problem.

I was inspired to develop this method after the article: Bet you can't solve this Google interview question! by Kevin Ghadyani and then also by the Geeksforgeeks article: Largest connected component on a grid .

The Problem:

Find the size of the largest contiguous color area (the number cells in the color blob). For example given this grid with 4 different colors

[[1, 4, 4, 4, 4, 3, 3, 1], 
 [2, 1, 1, 4, 3, 3, 1, 1], 
 [3, 2, 1, 1, 2, 3, 2, 1], 
 [3, 3, 2, 1, 2, 2, 2, 2], 
 [3, 1, 3, 1, 1, 4, 4, 4], 
 [1, 1, 3, 1, 1, 4, 4, 4]]

... which represents this colored grid:

You can see the biggest color group are the underlined 1's, there are 9 of them

The approach:

There are a lot of ways to approach this problem. Most articles will create a graph and do depth first search recursivly or bredth first search. You need to keep track of what cells have been visited and have a way to pick the next un-processed cell.

When every you are working a problem with a grid that is represented by a 2d array, you don't need to use a generalized graph alorithm such as a graph library that keeps a map of nodes and list of adjacent nodes. The indexes to the 2d array serve as a constant time lookup, and the adjacentcy list is just the four directions left, right, top or bottom limited just limited at the outer bounds of the gird. You do need an effencient way to keep track of visited and to pick a unvisited cell to continue. You can easily do this by having each cell(node) have composit data of not just the color number, but a visited flag.

Below is both a python 3.6 version and a C++ version:

Python 3.6:

"""
Finding largest contiguous region of same color (or integer value) in grid.
bfs is run to find all connected cells for each region,
each cell is linked into a double linked list builtin to grid and counted(visited) flag
"""
class Node:
def __init__(self, x, y, color, prev, next):
self.x = x
self.y = y
self.color = color
self.prev = prev # double linked list of unvisited cells
self.next = next # so we can choose next cell from this list
self.counted = False # keeps track of visited cells
class LargestColorBlob:
def __init__(self, grid):
self.n = len(grid)
self.m = len(grid[0])
self.head = None
self.tail = None
self.grid = []
prev = None
for x in range(self.n): # build grid and double linked list
self.grid.append(list())
for y in range(self.m):
self.grid[-1].append(Node(x, y, grid[x][y], prev, None))
n = self.grid[x][y]
if prev:
prev.next = n
prev = n
self.head = self.grid[0][0]
self.tail = self.grid[-1][-1]
def delete_linked(self, n): # delete node in double linked list
before = n.prev
after = n.next
if before:
before.next = after
else:
self.head = after
if after:
after.prev = before
else:
self.tail = before
# note that if you are solving such large grids that the stack would overflow, then you can change
# to a non-recursive dfs using a list to hold the adjacent cells to be explored from.
# in fact you could unlink the cell from the big double linked list and string them together to form the list
def dfs(self, n, c): # dfs to find all connected colors and count them
if n.color != c or n.counted:
return 0
count = 1
self.delete_linked(n)
n.counted = True
# explore adjacent cells, not edge conditions are checked before
if n.x > 0: count += self.dfs(self.grid[n.x-1][n.y], c)
if n.x < self.n-1: count += self.dfs(self.grid[n.x+1][n.y], c)
if n.y > 0: count += self.dfs(self.grid[n.x][n.y-1], c)
if n.y < self.m-1: count += self.dfs(self.grid[n.x][n.y+1], c)
return count
def biggest_blob(self):
max_blob = 0 # 1 for starting cell
while self.head: # each time though loop will collect a color blob count
m = self.dfs(self.head, self.head.color) # do next item in not counted list
max_blob = max(m, max_blob) # only need to keep track of max count
return max_blob
data = [[1, 4, 4, 4, 4, 3, 3, 1],
[2, 1, 1, 4, 3, 3, 1, 1],
[3, 2, 1, 1, 2, 3, 2, 1],
[3, 3, 2, 1, 1, 2, 2, 2],
[3, 1, 3, 1, 1, 2, 4, 4],
[1, 1, 3, 1, 1, 4, 4, 4]]
lcb = LargestColorBlob(data)
print("largest size: ", lcb.biggest_blob())

C++:

#include <iostream>
using namespace std;
// Finding largest contiguous region in grid or same color in value.
// bfs is used run to find all connected cells for each region,
// each cell is linked into a double linked list builtin to grid and counted(visited) flag
// this also shows how to do it outside of the standard C++ libraries for data structures
// which in mostly a exercise in learning to develop customize implementations of algorithms that
// can be a little faster. Not in a big O way but just less instructions cycles or memory.
// I could have writen a vector based version, but instead I keep the nodes for the grid
// in a linear array and use a macro I(X,Y) to calculate the linear index from x and y indexes
struct Node {
int color;
int x, y;
Node *next;
Node *prev;
bool visited;
};
struct G {
int n, m;
Node *head;
Node *tail;
Node *grid;
} g;
// macro for indexing into g.grid
#define I(X,Y) (Y)+g.m*(X)
void setup(int a[], int n, int m) {
g.n = n;
g.m = m;
g.grid = new Node[n*m]; // index as [x*g.m + y] with I(X,Y) macro
Node * prev = NULL;
for(int x = 0; x < n; x++)
for(int y = 0; y < m; y++) { // build grid of nodes and linked list
Node* node = &g.grid[I(x,y)];
node->color = a[I(x,y)];
node->prev = prev;
node->x = x;
node->y = y;
if(prev != NULL) prev->next = node;
prev = node;
}
g.head = &g.grid[I(0,0)];
g.tail = &g.grid[I(n-1,m-1)];
};
void delete_linked(Node* n) {
Node *before = n->prev;
Node *after= n->next;
if(before != NULL) {
before->next = after;
} else {
g.head = after;
}
if(after != NULL) {
after->prev = before;
} else {
g.tail = before;
}
}
int dfs(Node* node, int color) { // find count of color blob starting at node
if(node->color != color || node->visited) {
return 0;
}
delete_linked(node);
node->visited = true;
int count = 1;
// explore adjacent cells filter if at edge of grid
if(node->x > 0)
count += dfs(&g.grid[I( node->x-1, node->y)], color);
if(node->x < g.n-1)
count += dfs(&g.grid[I( node->x+1, node->y)], color);
if(node->y > 0)
count += dfs(&g.grid[I( node->x, node->y-1)], color);
if(node->y < g.m-1)
count += dfs(&g.grid[I( node->x, node->y+1)], color);
return count;
}
int max_color() {
int max_blob = 0; // biggest blob size
while(g.head != NULL) {
int blob_max = dfs(g.head, g.head->color); // current blob size
if(blob_max > max_blob) max_blob = blob_max;
}
return max_blob;
}
int main() {
int n = 6, m = 8;
int a[] = { // note that I am using linear array to hold grid!
1, 4, 4, 4, 4, 3, 3, 1,
2, 1, 1, 4, 3, 3, 1, 1,
3, 2, 1, 1, 2, 3, 2, 1,
3, 3, 2, 1, 1, 2, 2, 2,
3, 1, 3, 1, 1, 2, 4, 4,
1, 1, 3, 1, 1, 4, 4, 4
};
setup(a,n,m);
std::cout << "max: " << max_color() << std::endl;
}

Hope you enjoyed this. -- Professor Jenkins

Check out my youtube channel https://www.youtube.com/user/gjenkinslbcc