Bubble-Sort Revisited

This is a continuation of a series on the Bubble Sort.

In the Beginning

In the previous bubble sort post I implemented the basic bubble sort, but there is a variation of it where you short-circuit the search if the collection is already sorted which is what I'll look at here. The short-circuiting means that the bubble sort can sometimes do better than other methods, but only when the data is mostly sorted so it's still not a practical algorithm. Why do this, then? Eh, why not?

Some Setup

Imports

# python
from functools import partial

# pypi
# from bokeh.models import HoverTool
from expects import be_below, contain_exactly, equal, expect
from joblib import Parallel, delayed
from numba import njit
from numba.typed import List
from numpy.random import default_rng

import altair
import numpy
import pandas

# this project
from graeae.visualization.altair_helpers import output_path, save_chart
from bowling.sort.bubble import bubble, BubbleOutput

# my other stuff
from graeae import Timer

Stuff For Later

TIMER = Timer()
SLUG = "double-bubble-sort"
OUTPUT_PATH = output_path(SLUG)
random = default_rng(2021)

The Sorters

Bubba Sort

This will be the short-circuiting version of the bubble sort.

Bubba

This is the short-circuiter. It works pretty much the same way with just an extra if-statement to break out of the loop if nothing was swapped.

def bubba(elements) -> BubbleOutput:
    """Sorts the list in place

    Args:
     elements: list of (in-place) sortable elements

    Returns:
     number of elements, count of comparisons
    """
    all_but_one = len(elements) - 1
    comparisons = swaps = 0
    for items_sorted in range(all_but_one):
        swapped_at_least_once = False
        for in_front_of_us in range(all_but_one - items_sorted):
            comparisons += 1
            to_the_right = in_front_of_us + 1
            if elements[to_the_right] < elements[in_front_of_us]:
                (elements[in_front_of_us],
                 elements[to_the_right]) = (elements[to_the_right],
                                          elements[in_front_of_us])
                swapped_at_least_once = True
                swaps += 1
        if not swapped_at_least_once:
            break
    return BubbleOutput(len(elements), comparisons, swaps, elements)

Bubble

I showed how I implement the bubble sort in the previous bubble sort post so, since I imported it up above, I'm just going to run it.

Test It Out

Once again, let's just make sure everything works.

inputs = [1, 2, 3, 4, 7, 6, 5]

expected = list(sorted(inputs))

test_1, test_2 = inputs.copy(), inputs.copy()

bubba_output = bubba(test_1)
original_output = bubble(test_2)

expect(test_1).to(contain_exactly(*expected))
expect(test_2).to(contain_exactly(*expected))

n = len(inputs)
worst = (n * (n - 1))/2
expect(bubba_output.comparisons).to(be_below(worst))
expect(bubba_output.comparisons).to(equal(15))

expect(bubba_output.swaps).to(equal(original_output.swaps))
expect(original_output.comparisons).to(equal(worst))

# try a bigger input
inputs = random.choice(list(range(100)), size=100)
expected = list(sorted(inputs))

test_1, test_2 = inputs.copy(), inputs.copy()

bubba_output = bubba(test_1)
original_output = bubble(test_2)

expect(list(test_1)).to(contain_exactly(*expected))
expect(list(test_2)).to(contain_exactly(*expected))

n = len(inputs)
worst = (n * (n - 1))/2
expect(bubba_output.comparisons).to(be_below(worst))
expect(original_output.comparisons).to(equal(worst))

expect(bubba_output.swaps).to(equal(original_output.swaps))
expect(original_output.comparisons).to(equal(worst))

Counting Comparisons

Since the actual method of sorting is the same the swaps should be the same in either case (the new version doesn't quit until there's no more swaps to be done) so I'll just look at the comparisons and see if it made any real difference.

Set Up Numba

bubba = njit(bubba, nogil=True)
bubble = njit(bubble, nogil=True)

Run The Bubble Counter

things_to_sort = [random.choice(list(range(count)), size=count)
                  for count in range(1, 10**5+ 1, 1000)]

with TIMER:
    bubbles_counts_and_comparisons = Parallel(n_jobs=-1)(
        delayed(bubble)(thing_to_sort)
        for thing_to_sort in things_to_sort)
Started: 2022-01-12 23:40:18.730737
Ended: 2022-01-12 23:41:51.143272
Elapsed: 0:01:32.412535

Run the Bubba Counter

with TIMER:
    bubbas_counts_and_comparisons = Parallel(n_jobs=-2)(
        delayed(bubba)(thing_to_sort)
        for thing_to_sort in things_to_sort)
Started: 2022-01-12 23:41:58.766180
Ended: 2022-01-12 23:43:38.475355
Elapsed: 0:01:39.709175

Time-wise it seems to have done about the same as the original bubble sort.

SIZE, COMPARISONS = 0, 1
bubble_unzipped = list(zip(*bubbles_counts_and_comparisons))
bubba_unzipped = list(zip(*bubbas_counts_and_comparisons))
bubba_frame = pandas.DataFrame({"Elements": bubble_unzipped[SIZE],
                                "Bubble": bubble_unzipped[COMPARISONS],
                                "Bubba": bubba_unzipped[COMPARISONS]})

melted = bubba_frame.melt(id_vars=["Elements"],
                          var_name="Sorter",
                          value_name="Comparisons")
chart = altair.Chart(melted).mark_point().encode(
    x="Elements",
    y="Comparisons",
    color="Sorter",
    tooltip=[altair.Tooltip("Elements", format=","), "Sorter",
             altair.Tooltip("Comparisons", format=",")]
).properties(
    title="Bubble Sort Comparisons (with and without short-circuiting)",
    width=900,
).interactive()

save_chart(chart, "bubble_comparisons", output_path=OUTPUT_PATH, height=400)

Figure Missing

At first glance it looks like there's only one grey set of points, but if you zoom in (using the mouse's scroll wheel) you'll see that the grey points are actually (sometimes) created by adjacent points with the short-circuiting version's point (the blue point) slightly lower than the non-short-ciruiting orange point. So there is a difference, but it's small enough that it isn't easy to see.

Let's see if it becomes more obvious using the difference between the points instead.

bubba_frame["Difference"] = (bubba_frame["Bubble"] -
                             bubba_frame["Bubba"])
chart = altair.Chart(bubba_frame).mark_trail().encode(
    x="Elements:O",
    y="Difference:Q",
    size="Difference:Q",
    tooltip=[altair.Tooltip("Elements", format=","),
             altair.Tooltip("Difference", format=","),
             altair.Tooltip("Bubble", format=","),
             altair.Tooltip("Bubba", format=",")]
).properties(title="Difference In Comparisons Between the Bubble Sorts",
             height=500,
             width=900).interactive()

save_chart(chart, "bubble_differences", output_path=OUTPUT_PATH, height=600)

Figure Missing

So, it looks like there are sometimes significant differences between the two sorts, but how do they appear given the total number of comparisons made?

bubba_frame["Difference/Comparisons"] = bubba_frame.Difference/bubba_frame.Bubba
sub_bubba = bubba_frame[["Elements", "Difference/Comparisons"]]
chart = altair.Chart(sub_bubba).mark_trail().encode(
    x="Elements:O",
    y="Difference/Comparisons:Q",
    size="Difference/Comparisons:Q",
    tooltip=[altair.Tooltip("Elements", format=","),
             altair.Tooltip("Difference/Comparisons", format=".4f")]
).properties(title="Proportion of Difference Between the Bubble Sorts",
             height=500,
             width=900).interactive()

save_chart(chart, "bubble_difference_proportions", output_path=OUTPUT_PATH,
           height=600)

Figure Missing

It looks like as the amount of sorting that's needed goes up, the difference made by the short-circuiting becomes smaller in comparison to the total number of comparisons made, but even the biggest effect proportional to the amount of comparisons made by the short-circuiting sorter only comes to the difference being 0.2% of the total comparisons.

Best Case, Worst Case

Now that we know that the short-circuiting isn't that big a deal I'll add some plotting showing how our random cases compare to the best-case and worst case sorting inputs.

best_things = [numpy.arange(count, dtype=int) for count in range(1, 10**5+ 1, 1000)]
worst_things = [numpy.flip(things) for things in best_things]

with TIMER:
    worst_output = Parallel(n_jobs=-1)(
        delayed(bubba)(thing) for thing in worst_things
)
Started: 2022-01-13 00:01:27.932132
Ended: 2022-01-13 00:02:08.108278
Elapsed: 0:00:40.176146
with TIMER:
    best_output = Parallel(n_jobs=-1)(
        delayed(bubba)(thing) for thing in best_things
)
Started: 2022-01-13 00:08:59.231664
Ended: 2022-01-13 00:09:00.409103
Elapsed: 0:00:01.177439
worst_unzipped = list(zip(*worst_output))
best_unzipped = list(zip(*best_output))

del bubba_frame["Bubble"]
del bubba_frame["Difference"]
del bubba_frame["Difference/Comparisons"]
bubba_frame = bubba_frame.rename(columns={"Bubba": "Random"})
bubba_frame["Worst Case"] = worst_unzipped[COMPARISONS]
bubba_frame["Best Case"] = best_unzipped[COMPARISONS]

melted = bubba_frame.melt(id_vars=["Elements"],
                          var_name="Input",
                          value_name="Comparisons")
chart = altair.Chart(melted).mark_point().encode(
    x="Elements",
    y="Comparisons",
    color="Input"
).properties(
    width=800,
    height=550,
    title="Bubble Sort Best, Worst, and Random Input Comparisons"
).interactive()

save_chart(chart, "best-worst-random", output_path=OUTPUT_PATH, height=600)

Figure Missing

Once again we run into the problem/case that it looks like the Random input is missing, but if you zoom way in you'll see it does slightly better than the worst-case, but the difference is so small that when you zoom back out they look like they did exactly the same. The best-case does much better, since it only has to pass through the data once to see that it's already sorted and then quit (which isn't really reflected in the plot because our increment of comparisons only happens if the inner loop is entered), but except for this unusual case, random inputs don't do enough better than the worst-case examples to make short-circuiting a noteworthy improvement.

The End, The End

Well, I think I've beaten this dead horse enough for now. The overview for all the posts is here, if you didn't get enough of it. Up next: Selection Sort.