# 295. Find Median from Data Stream

Design a data structure that supports the following two operations:

void addNum(int num) - Add a integer number from the data stream to the data structure.
double findMedian() - Return the median of all elements so far.

Example:

addNum(1)
addNum(2)
findMedian() -> 1.5
addNum(3) 
findMedian() -> 2

Follow up:

If all integer numbers from the stream are between 0 and 100, how would you optimize it?
If 99% of all integer numbers from the stream are between 0 and 100, how would you optimize it?

# Solution

Approach 1: Sorting.

Approach 2: use 2 heaps.

Alternative approaches:

  • keep a sorted list: O(N) to add (copy smaller portion, add new num, copy larger portion), O(1) to find median, O(N) space
  • use a self balancing BST: O(logN) to add, O(1) to find median (the root), O(N) space
  • can use reservior sampling to estimate the median

Follow ups:

  • If all nums in range(0, 100), then we can bucket them and keep the cumulative frequencies. O(1) to add (increment count for every bucket >= current num), O(logN) to find median (binary search), O(1) space.
  • if 99% of all nums are in range(0, 100), we can have buckets plus 1 heap per side (smaller than 0 and larger than 100).

# Code (Python)

Approach 2:

class MedianFinder:
    def __init__(self):
        """
        initialize your data structure here.
        """
        # 2 heaps: min_heap stores the larger 1/2 of the data, max_heap stores the smaller 1/2
        self.min_heap, self.max_heap = [], [] 

    def addNum(self, num: int) -> None:
        # O(logN) time, O(N) space
        if self.min_heap and self.min_heap[0] <= num:
            heapq.heappush(self.min_heap, num)
        else:
            heapq.heappush(self.max_heap, -num)
        if len(self.min_heap) - len(self.max_heap) > 1:
            heapq.heappush(self.max_heap, -heapq.heappop(self.min_heap))
        elif len(self.max_heap) - len(self.min_heap) > 1:
            heapq.heappush(self.min_heap, -heapq.heappop(self.max_heap))

    def findMedian(self) -> float:
        # O(1) time
        if len(self.max_heap) == len(self.min_heap):
            return (self.min_heap[0] - self.max_heap[0]) / 2
        elif len(self.max_heap) > len(self.min_heap):
            return -self.max_heap[0]
        else:
            return self.min_heap[0]

# Code (C++)

Approach 1:

// Sorting.
class MedianFinder {
private:
    vector<int> buf;
public:
    // initialize your data structure here.
    MedianFinder() {}
    
    void addNum(int num) {
        int i = 0;
        for (; i < buf.size(); ++i)
        {
            if (buf[i] >= num)
                break;
        }
        buf.insert(buf.begin()+i, num);
    }
    
    double findMedian() {
        int mid = buf.size() / 2;
        if (buf.size() % 2 != 0)
            return buf[mid];
        else
            return (double)(buf[mid] + buf[mid-1]) / 2;
    }
};
/**
 * Your MedianFinder object will be instantiated and called as such:
 * MedianFinder* obj = new MedianFinder();
 * obj->addNum(num);
 * double param_2 = obj->findMedian();
 */

Approach 2:

class MedianFinder {
private:
    priority_queue<int> lo; // max heap.
    priority_queue<int, vector<int>, greater<int>> hi; // min heap.
public:
    /** initialize your data structure here. */
    MedianFinder() {}
    
    void addNum(int num) {
        lo.push(num);
        hi.push(lo.top());
        lo.pop();
        if (lo.size() < hi.size())
        {
            lo.push(hi.top());
            hi.pop();
        }
    }
    
    double findMedian() {
        if (lo.size() > hi.size())
            return lo.top();
        else
            return (lo.top() + hi.top()) / 2.0;
    }
};