Implementing Median Algorithm with Two Binary Heaps
Note
Now all the plugins has supported English. I'm still improving the website...
This article will resolve
LeetCode | Difficulty |
---|---|
295. Find Median from Data Stream | 🔴 |
Prerequisites
Before reading this article, you should first learn:
If you are given an array and asked to find the median, it's straightforward. Just sort the array. If the array length is odd, the middle element is the median. If the array length is even, the average of the two middle elements is the median.
However, if the data set is extremely large, sorting becomes impractical. In such cases, you can use probabilistic algorithms. Randomly sample a portion of the data, sort it, and use the median of this sample as the median for the entire dataset.
The median algorithm discussed in this article is more challenging and intricate. It is related to LeetCode problem #295, "Find Median from Data Stream":
295. Find Median from Data Stream | LeetCode |
The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.
- For example, for
arr = [2,3,4]
, the median is3
. - For example, for
arr = [2,3]
, the median is(2 + 3) / 2 = 2.5
.
Implement the MedianFinder class:
MedianFinder()
initializes theMedianFinder
object.void addNum(int num)
adds the integernum
from the data stream to the data structure.double findMedian()
returns the median of all elements so far. Answers within10-5
of the actual answer will be accepted.
Example 1:
Input ["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"] [[], [1], [2], [], [3], []] Output [null, null, null, 1.5, null, 2.0] Explanation MedianFinder medianFinder = new MedianFinder(); medianFinder.addNum(1); // arr = [1] medianFinder.addNum(2); // arr = [1, 2] medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2) medianFinder.addNum(3); // arr[1, 2, 3] medianFinder.findMedian(); // return 2.0
Constraints:
-105 <= num <= 105
- There will be at least one element in the data structure before calling
findMedian
. - At most
5 * 104
calls will be made toaddNum
andfindMedian
.
Follow up:
- If all integer numbers from the stream are in the range
[0, 100]
, how would you optimize your solution? - If
99%
of all integer numbers from the stream are in the range[0, 100]
, how would you optimize your solution?
// The problem asks you to design such a class
class MedianFinder {
// Add a number
public void addNum(int num) {}
// Calculate the median of all the numbers currently added
public double findMedian() {}
}
// The problem asks you to design such a class
class MedianFinder {
public:
// add a number
void addNum(int num) {}
// calculate the median of all the numbers added so far
double findMedian() {}
};
# The problem asks you to design such a class
class MedianFinder:
# add a number
def addNum(self, num: int) -> None:
pass
# calculate the median of all currently added numbers
def findMedian(self) -> float:
pass
// The problem asks you to design such a class
type MedianFinder struct {}
// Add a number
func (this *MedianFinder) AddNum(num int) {}
// Calculate the median of all numbers currently added
func (this *MedianFinder) FindMedian() float64 {}
// The problem asks you to design such a class
var MedianFinder = function() {
// Add a number
this.addNum = function(num) {};
// Calculate the median of all numbers currently added
this.findMedian = function() {};
};
Actually, all algorithms related to "streams" are quite challenging. For example, in my previous article Discussing Random Algorithms in Games, I wrote about how to randomly select an element from a data stream with equal probability. If you haven't encountered this problem before, it's hard to think of a solution.
This problem requires calculating the average from a data stream. Let's first consider the conventional approach.
Attempted Analysis
A straightforward solution is to use an array to record all numbers added by addNum
, ensuring the elements in the array are sorted through insertion sort logic. When the findMedian
method is called, the median can be directly calculated using array indices.
However, using an array as the underlying container has obvious issues. While addNum
can use binary search to find the insertion position, the insertion operation requires shifting data, resulting in a worst-case time complexity of O(N).
What about using a linked list? Inserting elements in a linked list is fast, but finding the insertion position requires linear traversal, with a worst-case time complexity of O(N). Additionally, the findMedian
method also needs to traverse to find the middle index, again with a worst-case time complexity of O(N).
So, how about using a balanced binary tree, where the complexity for insertions, deletions, and searches is O(logN)? Would that work?
For instance, using Java's TreeSet
container, which is based on a Red-Black tree, addNum
can insert directly, and findMedian
can deduce the rank of the median element based on the current number of elements.
Unfortunately, this still doesn't work due to two issues:
First, TreeSet
is a type of Set
that does not allow duplicate elements, but our data stream might include duplicate data, and calculating the median requires considering these duplicates.
Second, TreeSet
does not provide an API to quickly calculate elements by rank. For example, if I want to find the 5th largest element in a TreeSet
, there is no ready-made method to achieve this.
Info
If you were to implement a method select(int index)
in a binary search tree to calculate the corresponding element by rank, how would you design it? Think about it, and I'll post the answer in the comments section.
Besides balanced binary trees, is there any other commonly used data structure that is dynamically ordered? How about a priority queue (binary heap)?
It seems that won't work either, because a priority queue is a restricted data structure that only allows adding/removing elements from the top. Our addNum
method can insert elements at the top, but the findMedian
function needs to extract from the middle of the data, which a priority queue cannot provide.
As you can see, finding a median is quite challenging. Despite our best efforts, we haven't found an efficient approach. Let's directly look at the solution, which is quite clever.