spark之sample

By | 2018年12月31日

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/hochoy/article/details/80599234

源码:

/**
 * Return a sampled subset of this RDD.
 *
 * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
 * @param fraction expected size of the sample as a fraction of this RDD's size
 *  without replacement: probability that each element is chosen; fraction must be [0, 1]
 *  with replacement: expected number of times each element is chosen; fraction must be >= 0
 * @param seed seed for the random number generator
 */
def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = withScope {
  require(fraction >= 0.0, "Negative fraction value: " + fraction)
  if (withReplacement) {
    new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
  } else {
    new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
  }
}

参数说明:

withReplacement :取样后是否将元素放回。 
true:元素放回 ,返回的子集会有重复,可以被多次抽样;
false:元素不放回 ,返回的子集没有重复
fraction:期望样本的大小作为RDD大小的一部分,

seed:随机数生成器的种子

示例:

object sample{
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName(this.getClass.getName)
    val sc = new SparkContext(conf)
    val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
    rdd1.sample(false,0.6).collect().mkString(",").map(print)// 输出 2,3,6,8,9
    rdd1.sample(true,0.6).collect().mkString(".").map(print) // 输出 2.4.4.6
  }
}


发表评论