Implementing Bucket Sort for Floating Point Numbers in Python

Introduction to Bucket Sort for Floating-Point Numbers

Bucket sort is a distribution-based sorting algorithm that partitions input data into a finite number of “buckets” and then sorts the contents of each bucket individually. When applied to floating-point numbers that are uniformly distributed over a known interval — typically [0, 1) — bucket sort can achieve linear average-case time complexity, making it a strong candidate for high-performance sorting tasks.

The core idea is simple: instead of comparing every pair of elements (as in comparison sorts like quicksort or mergesort), bucket sort first distributes the elements across buckets based on their values. Each bucket naturally groups together a narrow range of values. After that, a simple sorting algorithm — often insertion sort or even a recursive call to bucket sort — finishes the work. Finally, the buckets are concatenated in order to produce the sorted array.

This article provides an in-depth look at implementing bucket sort for floating-point numbers in Python, covering its mechanics, complexity, strengths, pitfalls, and real-world applications.

How Bucket Sort Works

Bucket sort assumes that the input is uniformly distributed within a known range, typically [0, 1). The algorithm proceeds in three phases:

Initialization: Create an array of n empty buckets, where n is the number of elements.
Distribution: For each element arr[i], compute its bucket index int(arr[i] * n) (assuming values are in [0, 1)) and place the element into that bucket.
Sorting and Concatenation: Sort each bucket individually (using any stable or efficient internal sort), then concatenate the buckets in order to produce the final sorted array.

The key insight is that because the data is uniformly distributed, each bucket receives roughly n / n = 1 element on average. That keeps the cost of sorting individual buckets extremely low — often constant time per bucket.

Handling Edge Cases

When a floating-point number exactly equals 1.0, the computed index would be int(1.0 * n) = n, which is out of bounds. A common fix is to clamp the index to n-1 for such values. In practice, if your data is strictly [0, 1), this edge case does not occur, but it’s wise to guard against it.

Implementing Bucket Sort in Python

Below is a clean, production-ready implementation of bucket sort for floating-point numbers in the range [0, 1).

def bucket_sort(arr):
    """Sort an array of floats uniformly distributed in [0, 1)."""
    n = len(arr)
    if n <= 1:
        return arr

    # Create empty buckets
    buckets = [[] for _ in range(n)]

    # Distribute elements into buckets
    for num in arr:
        index = int(num * n)
        # Guard against floating-point index = n (e.g., when num == 1.0)
        if index == n:
            index = n - 1
        buckets[index].append(num)

    # Sort each bucket and concatenate
    sorted_arr = []
    for bucket in buckets:
        sorted_arr.extend(sorted(bucket))  # Python's Timsort is efficient

    return sorted_arr

The function uses Python’s built-in sorted() to sort each bucket. For buckets that are small (typically 0–2 elements), this is very fast. For production use, you might replace sorted() with insertion sort for even lower overhead on tiny buckets.

Bucket Sort for Arbitrary Ranges

If your floating-point data spans a range other than [0, 1), you can normalize the values before distribution. The following variation maps any [min, max] range to [0, 1):

def bucket_sort_scaled(arr, min_val=None, max_val=None):
    if not arr:
        return arr
    if min_val is None:
        min_val = min(arr)
    if max_val is None:
        max_val = max(arr)

    # Guard against identical values
    if max_val == min_val:
        return arr

    n = len(arr)
    buckets = [[] for _ in range(n)]

    for num in arr:
        # Normalize to [0, 1)
        normalized = (num - min_val) / (max_val - min_val)
        index = int(normalized * n)
        if index == n:
            index = n - 1
        buckets[index].append(num)

    sorted_arr = []
    for bucket in buckets:
        sorted_arr.extend(sorted(bucket))
    return sorted_arr

This version is more general but requires knowing or computing the range. It works well when the data distribution is approximately uniform within that range.

Complexity Analysis

Understanding the computational cost of bucket sort is essential for deciding when to use it.

Time Complexity

Best case (uniformly distributed data): O(n + k), where k is the number of buckets (usually n). Distribution is O(n), and sorting each bucket takes constant time on average, so overall O(n).
Average case: O(n + n²/k) if using insertion sort for buckets. With k = n, this becomes O(n).
Worst case: O(n²) when all elements fall into the same bucket. This happens when the data is not uniformly distributed or when the range is very small relative to the number of elements.

Space Complexity

Bucket sort requires O(n + k) extra space for the buckets and their contents. With k = n, this is O(n). The space used is comparable to that of mergesort and higher than that of in-place sorts like quicksort.

Advantages and Use Cases

Bucket sort shines in specific scenarios where its assumptions hold:

Uniformly distributed floating-point data — e.g., sensor readings, Monte Carlo simulation outputs, or normalized probabilities.
Large datasets — the O(n) average-case performance makes it attractive for sorting millions of floats where comparison sorts would be less efficient.
External sorting — when data resides on disk, buckets can be processed independently and written to separate files, then concatenated.
Parallel and GPU computing — each bucket can be sorted independently, allowing massive parallelism.

One notable strength is that bucket sort is stable (if the per-bucket sort is stable), meaning the relative order of equal elements is preserved.

Limitations and Considerations

Despite its elegance, bucket sort has several limitations that can render it unsuitable for general-purpose sorting:

Sensitivity to input distribution: If the data is skewed (e.g., many values clustered together), most elements fall into a few buckets, increasing the sorting cost to O(n²).
Requires prior knowledge of the range: Without knowing the minimum and maximum values, you cannot effectively create buckets. The scaled version above mitigates this, but computing the range adds an extra pass.
Memory overhead: Creating n Python lists can consume significant memory, especially for very large arrays. Linked lists or arrays of arrays can reduce overhead, but Python’s list of lists is straightforward.
Overhead of per-bucket sorting: Sorting many tiny buckets with Python’s sorted() produces function calls that can add up. For extremely small buckets, an explicit insertion sort might be faster.

When Not to Use Bucket Sort

Avoid bucket sort when the data is not uniformly distributed, when the range is very large relative to the number of elements, or when memory is extremely constrained. In those cases, a comparison-based sort like quicksort or heapsort is a safer choice.

Comparison with Other Sorting Algorithms

Bucket sort occupies a unique niche among sorting algorithms. Here is how it compares to common alternatives:

Algorithm	Average Time	Space	Stable	Best For
Bucket Sort (with k = n)	O(n)	O(n)	Yes (if per-bucket sort is stable)	Uniform floats in known range
Quicksort	O(n log n)	O(log n)	No (typical)	General-purpose, in-place
Mergesort	O(n log n)	O(n)	Yes	Stable sorting, linked lists
Counting Sort	O(n + k)	O(k)	Yes	Integer data with limited range
Radix Sort	O(n × w)	O(n + 2^w)	Yes (LSD)	Integers or strings of fixed length

For floating-point numbers, bucket sort often outperforms radix sort (which requires bit manipulation of floats) and can be faster than O(n log n) comparison sorts when data is uniform.

Practical Python Tips and Optimizations

Choosing the Number of Buckets

Setting the number of buckets equal to the number of elements (k = n) is a standard rule of thumb. Fewer buckets increase the average bucket size and degrade performance; more buckets waste memory without improving speed.

Using Insertion Sort for Small Buckets

If you want fine-grained control, replace sorted(bucket) with a custom insertion sort for buckets smaller than, say, 20 elements:

def insertion_sort(arr):
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and arr[j] > key:
            arr[j + 1] = arr[j]
            j -= 1
        arr[j + 1] = key

def bucket_sort_insertion(arr):
    n = len(arr)
    if n <= 1:
        return arr
    buckets = [[] for _ in range(n)]
    for num in arr:
        index = int(num * n)
        if index == n:
            index = n - 1
        buckets[index].append(num)
    sorted_arr = []
    for bucket in buckets:
        insertion_sort(bucket)
        sorted_arr.extend(bucket)
    return sorted_arr

This can reduce overhead because Python’s sorted() has function-call overhead and general-purpose behavior that is overkill for 0- or 1-element lists.

Handling Non-Uniform Distributions

If you know the data distribution is not uniform but still want to use bucket sort, you can adapt the bucket boundaries. For example, if data follows a normal distribution, you can create buckets of unequal width to balance the load. However, this requires prior analysis of the data and is rarely done in practice.

External Resources

For further reading, consider the following authoritative references:

Wikipedia: Bucket Sort — detailed description and complexity proofs.
GeeksforGeeks: Bucket Sort — with code examples in multiple languages.
Python’s sorted() documentation — understand the underlying Timsort.
Real Python: Sorting Algorithms in Python — practical guide comparing bucket sort to other algorithms.

Conclusion

Bucket sort is an elegant, efficient algorithm for sorting floating-point numbers — especially when the data is uniformly distributed and the range is known. Its linear average-case time complexity makes it a valuable tool in the data scientist’s or engineer’s toolkit. However, its sensitivity to input distribution and additional memory requirements mean it should not be used blindly. By understanding when and how to apply bucket sort, and by implementing it carefully in Python with proper edge-case handling, you can achieve significant performance gains over general-purpose comparison sorts.

Whether you are sorting millions of sensor measurements or normalizing output from a stochastic simulation, bucket sort offers a fast, stable, and parallelizable solution — as long as your data plays by the rules.