转载

挖掘数据流 Mining Data Stream

数据流挖掘：

输入元组通过一个或多个端口快速地进入
系统无法存储整个数据流
在有限内存的条件下对数据流做出一些计算

对数据流的两类Query：

Ad-hoc queries，例如当前数据流中的最大值是多少
standing queries，例如每当数据流中出现了新的最大值就汇报一次

滑动窗口：

Query时总是针对数据流中固定长度N的一个窗口（最新进入系统的N个数据）
或者：最近T时间内进入系统的数据

当窗口的长度大到无法存储于主存内时，或者有太多不同数据流，内存不够存储多个窗口，则需要更有趣的方法。

例子：平均数

数据流为一个个整数
窗口尺寸
standing query: 当前窗口中数字的平均值为多少？

方法：

对于前个数据，计算其平均值
对于每一个新来的数据，找到当前窗口中最旧的数据，更新平均值为

Python

import random # sliding window average: stream = [random.random() for i in range(99)] def slidingWindowAverage(stream, windowSize = 10):  window = stream[:windowSize]  avg = mean(window)  index = windowSize  while True:   option = raw_input('n for new input, e for exit: ')   if option == 'n':    j = window.pop(0)    i = stream[index]    window.append(i)    index += 1    avg += float(i-j)/windowSize    print 'current average value is {0}'.format(avg)   if index >= len(stream) or option == 'e':    break  return avg avg = slidingWindowAverage(stream) n for new input, e for exit: n current average value is 0.524461680244 n for new input, e for exit: n current average value is 0.530545388643 n for new input, e for exit: n current average value is 0.540916392055 n for new input, e for exit: n current average value is 0.502333068095 n for new input, e for exit: e

Counting Bits 问题：

给定一个数据为和的数据流，希望回答的问题是：最近k个数据中有多少个，
如果存储最近的个数据，很容易回答。
假设均很大，并且有非常多数据流同时流入并且需要计算，无法这样做。

DGIM 算法：

对每一个数据进行timestamp
对窗口内的数据进行summary为一个个bucket
每个bucket代表窗口内报括了个的一段数据流，记录为(该段的末尾位置, ）为该bucket的大小
如果同样大小的bucket出现了次，就进行合并，丢掉时间戳较老的，更新较新的那个味(该段的末尾位置, )

挖掘数据流 Mining Data Stream

Query，最近挖掘数据流 Mining Data Stream

个数据中有多少个

挖掘数据流 Mining Data Stream

：

只看结尾时间在最近k时间以内的bucket
将所有bucket的大小加起来，最旧的一个bucket的值只算一半
这样的误差，最大可能会有

Python

def dgimTimeStamp(stream, windowSize, t = 0, buckets = dict()):  # input n to add a new bit  # input e to exit program  # input a number ot query  sizes = [2**i for i in range(int(round(math.log(windowSize,2))))]  while True:   option = raw_input('e to exit; n to add 1 bit; any number to query  ')   if option == 'n':    # remove too old buckets    timestamp = t%windowSize    for key in buckets.keys():     if key == timestamp:      buckets.pop(key)    # adding a new bit in    if stream[t] == 1:     buckets[timestamp] = 1    t += 1    # buckets cascading updating    for size in sizes:     if sum(np.array(buckets.values()) == size) == 3:      keys = [b for b in buckets.keys() if buckets[b] == size]      for i in range(len(keys)):       if keys[i] <= timestamp:        keys[i] += windowSize      keys.sort()      tbr = keys[0]      if tbr < windowSize:       buckets.pop(tbr)      else:       buckets.pop(tbr - windowSize)      tbm = keys[1]      if tbm < windowSize:       buckets[tbm] *= 2      else:       buckets[tbm-windowSize] *= 2    if t >= len(stream):     break   elif option == 'e':    break   else:    k = int(option)    keys = [i%windowSize for i in range(t-k,t)]    initial = True    s = 0    for key in keys:     if key in buckets and initial == False:      s += buckets[key]     if key in buckets and initial == True:      s += buckets[key]*0.5      initial =False    print s, t   print buckets  return buckets, t stream = [1 for i in range(99)] b, t = dgimTimeStamp(stream, 5) e to exit; n to add 1 bit; any number to query  n {0: 1} e to exit; n to add 1 bit; any number to query  n {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query  1 0.5 2 {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query  2 1.5 2 {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query  n {1: 2, 2: 1} e to exit; n to add 1 bit; any number to query  n {1: 2, 2: 1, 3: 1} e to exit; n to add 1 bit; any number to query  n {1: 2, 3: 2, 4: 1} e to exit; n to add 1 bit; any number to query  n {0: 1, 1: 2, 3: 2, 4: 1} e to exit; n to add 1 bit; any number to query  n {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query  2 2.0 7 {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query  4 4.0 7 {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query  e

正文到此结束