Implementing Median Algorithm with Two Binary Heaps
This article will resolve
LeetCode | Difficulty |
---|---|
295. Find Median from Data Stream | 🔴 |
Prerequisites
Before reading this article, you should first learn:
If you are given an array and asked to find the median, it's straightforward: sort the array. If the array length is odd, the middle element is the median. If the array length is even, the average of the two middle elements is the median.
If the data size is extremely large, sorting may not be practical. In such cases, you can use a probabilistic algorithm: randomly sample a portion of the data, sort it, and calculate the median as an approximation for the entire dataset.
The median algorithm discussed in this article is more challenging and sophisticated, corresponding to LeetCode problem 295, "Find Median from Data Stream":
295. Find Median from Data Stream | LeetCode | 🔴
The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.
- For example, for
arr = [2,3,4]
, the median is3
. - For example, for
arr = [2,3]
, the median is(2 + 3) / 2 = 2.5
.
Implement the MedianFinder class:
MedianFinder()
initializes theMedianFinder
object.void addNum(int num)
adds the integernum
from the data stream to the data structure.double findMedian()
returns the median of all elements so far. Answers within10-5
of the actual answer will be accepted.
Example 1:
Input ["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"] [[], [1], [2], [], [3], []] Output [null, null, null, 1.5, null, 2.0] Explanation MedianFinder medianFinder = new MedianFinder(); medianFinder.addNum(1); // arr = [1] medianFinder.addNum(2); // arr = [1, 2] medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2) medianFinder.addNum(3); // arr[1, 2, 3] medianFinder.findMedian(); // return 2.0
Constraints:
-105 <= num <= 105
- There will be at least one element in the data structure before calling
findMedian
. - At most
5 * 104
calls will be made toaddNum
andfindMedian
.
Follow up:
- If all integer numbers from the stream are in the range
[0, 100]
, how would you optimize your solution? - If
99%
of all integer numbers from the stream are in the range[0, 100]
, how would you optimize your solution?
// The problem asks you to design such a class
class MedianFinder {
// Add a number
public void addNum(int num) {}
// Calculate the median of all the numbers currently added
public double findMedian() {}
}
// The problem asks you to design such a class
class MedianFinder {
public:
// add a number
void addNum(int num) {}
// calculate the median of all the numbers added so far
double findMedian() {}
};
# The problem asks you to design such a class
class MedianFinder:
# add a number
def addNum(self, num: int) -> None:
pass
# calculate the median of all currently added numbers
def findMedian(self) -> float:
pass
// The problem asks you to design such a class
type MedianFinder struct {}
// Add a number
func (this *MedianFinder) AddNum(num int) {}
// Calculate the median of all numbers currently added
func (this *MedianFinder) FindMedian() float64 {}
// The problem asks you to design such a class
var MedianFinder = function() {
// Add a number
this.addNum = function(num) {};
// Calculate the median of all numbers currently added
this.findMedian = function() {};
};
Actually, all algorithms related to "streams" are quite challenging. For instance, in my previous article Discussing Random Algorithms in Games, I wrote about how to randomly select an element from a data stream with equal probability. If you haven't encountered this problem before, it's difficult to come up with a solution.
This problem requires calculating the average from a data stream. Let's first consider the conventional approach.
Attempted Analysis
A straightforward solution is to use an array to record all numbers added by addNum
, ensuring the elements in the array are sorted through insertion sort logic. When the findMedian
method is called, the median can be directly calculated using array indices.
However, using an array as the underlying container has obvious issues. While addNum
can use binary search to find the insertion position, the insertion operation requires shifting data, resulting in a worst-case time complexity of O(N).
What about using a linked list? Inserting elements into a linked list is fast, but finding the insertion position requires linear traversal, with a worst-case time complexity still at O(N). Additionally, the findMedian
method also needs to traverse to find the middle index, resulting in a worst-case time complexity of O(N).
So, how about using a balanced binary tree, where the complexity for insertions, deletions, and searches is O(logN)? Would that work?
For example, using Java's TreeSet
container, which is based on a Red-Black tree, addNum
can insert directly, and findMedian
can deduce the rank of the median element based on the current number of elements.
Unfortunately, this still doesn't work due to two issues:
First, TreeSet
is a type of Set
that does not allow duplicate elements, but our data stream might include duplicate data, and calculating the median requires considering these duplicates.
Second, TreeSet
does not provide an API to quickly calculate elements by rank. For instance, if I want to find the 5th largest element in a TreeSet
, there is no ready-made method to achieve this.
Info
If you were to implement a method select(int index)
to calculate the corresponding element by rank in a binary search tree, how would you design it? Think about it, and I will post the answer in the comments section.
Aside from balanced binary trees, is there any commonly used data structure that is dynamically ordered? What about a priority queue (binary heap)?
It seems that won't work either, because a priority queue is a restricted data structure that only allows adding/deleting elements from the top. Our addNum
method can insert elements from the top, but the findMedian
function needs to retrieve from the middle of the data, a feature that priority queues cannot provide.
As we can see, finding a median is quite challenging. Despite our best efforts, we haven't found an efficient approach. Let's now look at the solution, which is quite ingenious.