Implementing Median Algorithm with Two Binary Heaps

labuladongOriginalAbout 3469 words

This article will resolve

LeetCode	Difficulty
295. Find Median from Data Stream	🔴

Prerequisite Knowledge

Before reading this article, you should first learn:

If given an array and asked to find the median, it is straightforward. Sort the array; if the length is odd, the middle element is the median. If the length is even, the average of the two middle elements is the median.

For very large datasets, sorting may be impractical. A probabilistic approach can be used instead: randomly select a portion of the data, sort it, and find the median to approximate the median of the entire dataset.

The median algorithm discussed in this article is more challenging and sophisticated, as seen in LeetCode Problem 295: "Find Median from Data Stream".

295. Find Median from Data Stream | LeetCode | 🔴

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.

For example, for arr = [2,3,4], the median is 3.
For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Implement the MedianFinder class:

MedianFinder() initializes the MedianFinder object.
void addNum(int num) adds the integer num from the data stream to the data structure.
double findMedian() returns the median of all elements so far. Answers within 10^-5 of the actual answer will be accepted.

Example 1:

Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]

Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

-10⁵ <= num <= 10⁵
There will be at least one element in the data structure before calling findMedian.
At most 5 * 10⁴ calls will be made to addNum and findMedian.

Follow up:

If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

The problem is from LeetCode 295. Find Median from Data Stream.

java

// The problem asks you to design such a class
class MedianFinder {

    // Add a number
    public void addNum(int num) {}

    // Calculate the median of all the numbers currently added
    public double findMedian() {}
}

cpp

// The problem asks you to design such a class
class MedianFinder {
public:
    // add a number
    void addNum(int num) {}

    // calculate the median of all the numbers added so far
    double findMedian() {}
};

python

# The problem asks you to design such a class
class MedianFinder:
    # add a number
    def addNum(self, num: int) -> None:
        pass
    
    # calculate the median of all currently added numbers
    def findMedian(self) -> float:
        pass

// The problem asks you to design such a class
type MedianFinder struct {}

// Add a number
func (this *MedianFinder) AddNum(num int) {}

// Calculate the median of all numbers currently added
func (this *MedianFinder) FindMedian() float64 {}

javascript

// The problem asks you to design such a class
var MedianFinder = function() {
    
    // Add a number
    this.addNum = function(num) {};
    
    // Calculate the median of all numbers currently added
    this.findMedian = function() {};
};

Actually, all algorithms related to "streams" are quite challenging. For instance, in my previous article Discussing Random Algorithms in Games, I wrote about how to randomly select an element from a data stream with equal probability. If you haven't encountered this problem before, it's difficult to come up with a solution.

This problem requires calculating the average from a data stream. Let's first consider the conventional approach.

Attempted Analysis

A straightforward solution is to use an array to record all numbers added by addNum, ensuring the elements in the array are sorted through insertion sort logic. When the findMedian method is called, the median can be directly calculated using array indices.

However, using an array as the underlying container has obvious issues. While addNum can use binary search to find the insertion position, the insertion operation requires shifting data, resulting in a worst-case time complexity of O(N).

What about using a linked list? Inserting elements into a linked list is fast, but finding the insertion position requires linear traversal, with a worst-case time complexity still at O(N). Additionally, the findMedian method also needs to traverse to find the middle index, resulting in a worst-case time complexity of O(N).

So, how about using a balanced binary tree, where the complexity for insertions, deletions, and searches is O(logN)? Would that work?

For example, using Java's TreeSet container, which is based on a Red-Black tree, addNum can insert directly, and findMedian can deduce the rank of the median element based on the current number of elements.

Unfortunately, this still doesn't work due to two issues:

First, TreeSet is a type of Set that does not allow duplicate elements, but our data stream might include duplicate data, and calculating the median requires considering these duplicates.

Second, TreeSet does not provide an API to quickly calculate elements by rank. For instance, if I want to find the 5th largest element in a TreeSet, there is no ready-made method to achieve this.

Info

If you were to implement a method select(int index) to calculate the corresponding element by rank in a binary search tree, how would you design it? Think about it, and I will post the answer in the comments section.

Aside from balanced binary trees, is there any commonly used data structure that is dynamically ordered? What about a priority queue (binary heap)?

It seems that won't work either, because a priority queue is a restricted data structure that only allows adding/deleting elements from the top. Our addNum method can insert elements from the top, but the findMedian function needs to retrieve from the middle of the data, a feature that priority queues cannot provide.

As we can see, finding a median is quite challenging. Despite our best efforts, we haven't found an efficient approach. Let's now look at the solution, which is quite ingenious.

Implementing Median Algorithm with Two Binary Heaps

295. Find Median from Data Stream https://leetcode.com/problems/find-median-from-data-stream

Attempted Analysis

Solution Approach