数据流 挖掘:
对数据流的两类Query:
滑动窗口:
当窗口的长度大到无法存储于主存内时,或者有太多不同数据流,内存不够存储多个窗口,则需要更有趣的方法。
例子:平均数
方法:
Python
import random # sliding window average: stream = [random.random() for i in range(99)] def slidingWindowAverage(stream, windowSize = 10): window = stream[:windowSize] avg = mean(window) index = windowSize while True: option = raw_input('n for new input, e for exit: ') if option == 'n': j = window.pop(0) i = stream[index] window.append(i) index += 1 avg += float(i-j)/windowSize print 'current average value is {0}'.format(avg) if index >= len(stream) or option == 'e': break return avg avg = slidingWindowAverage(stream) n for new input, e for exit: n current average value is 0.524461680244 n for new input, e for exit: n current average value is 0.530545388643 n for new input, e for exit: n current average value is 0.540916392055 n for new input, e for exit: n current average value is 0.502333068095 n for new input, e for exit: e
Counting Bits 问题:
DGIM 算法:
Query,最近
个数据中有多少个
:
Python
def dgimTimeStamp(stream, windowSize, t = 0, buckets = dict()): # input n to add a new bit # input e to exit program # input a number ot query sizes = [2**i for i in range(int(round(math.log(windowSize,2))))] while True: option = raw_input('e to exit; n to add 1 bit; any number to query ') if option == 'n': # remove too old buckets timestamp = t%windowSize for key in buckets.keys(): if key == timestamp: buckets.pop(key) # adding a new bit in if stream[t] == 1: buckets[timestamp] = 1 t += 1 # buckets cascading updating for size in sizes: if sum(np.array(buckets.values()) == size) == 3: keys = [b for b in buckets.keys() if buckets[b] == size] for i in range(len(keys)): if keys[i] <= timestamp: keys[i] += windowSize keys.sort() tbr = keys[0] if tbr < windowSize: buckets.pop(tbr) else: buckets.pop(tbr - windowSize) tbm = keys[1] if tbm < windowSize: buckets[tbm] *= 2 else: buckets[tbm-windowSize] *= 2 if t >= len(stream): break elif option == 'e': break else: k = int(option) keys = [i%windowSize for i in range(t-k,t)] initial = True s = 0 for key in keys: if key in buckets and initial == False: s += buckets[key] if key in buckets and initial == True: s += buckets[key]*0.5 initial =False print s, t print buckets return buckets, t stream = [1 for i in range(99)] b, t = dgimTimeStamp(stream, 5) e to exit; n to add 1 bit; any number to query n {0: 1} e to exit; n to add 1 bit; any number to query n {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query 1 0.5 2 {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query 2 1.5 2 {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query n {1: 2, 2: 1} e to exit; n to add 1 bit; any number to query n {1: 2, 2: 1, 3: 1} e to exit; n to add 1 bit; any number to query n {1: 2, 3: 2, 4: 1} e to exit; n to add 1 bit; any number to query n {0: 1, 1: 2, 3: 2, 4: 1} e to exit; n to add 1 bit; any number to query n {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query 2 2.0 7 {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query 4 4.0 7 {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query e