本文整理汇总了Python中pyspark.mllib.common.callMLlibFunc函数的典型用法代码示例。如果您正苦于以下问题:Python callMLlibFunc函数的具体用法?Python callMLlibFunc怎么用?Python callMLlibFunc使用的例子?那么恭喜您, 这里精选的函数代码示例或许可以为您提供帮助。
在下文中一共展示了callMLlibFunc函数的20个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Python代码示例。
示例1: corr
def corr(x, y=None, method=None):
"""
Compute the correlation (matrix) for the input RDD(s) using the
specified method.
Methods currently supported: I{pearson (default), spearman}.
If a single RDD of Vectors is passed in, a correlation matrix
comparing the columns in the input RDD is returned. Use C{method=}
to specify the method to be used for single RDD inout.
If two RDDs of floats are passed in, a single float is returned.
>>> x = sc.parallelize([1.0, 0.0, -2.0], 2)
>>> y = sc.parallelize([4.0, 5.0, 3.0], 2)
>>> zeros = sc.parallelize([0.0, 0.0, 0.0], 2)
>>> abs(Statistics.corr(x, y) - 0.6546537) < 1e-7
True
>>> Statistics.corr(x, y) == Statistics.corr(x, y, "pearson")
True
>>> Statistics.corr(x, y, "spearman")
0.5
>>> from math import isnan
>>> isnan(Statistics.corr(x, zeros))
True
>>> from pyspark.mllib.linalg import Vectors
>>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
... Vectors.dense([6, 7, 0, 8]), Vectors.dense([9, 0, 0, 1])])
>>> pearsonCorr = Statistics.corr(rdd)
>>> print str(pearsonCorr).replace('nan', 'NaN')
[[ 1. 0.05564149 NaN 0.40047142]
[ 0.05564149 1. NaN 0.91359586]
[ NaN NaN 1. NaN]
[ 0.40047142 0.91359586 NaN 1. ]]
>>> spearmanCorr = Statistics.corr(rdd, method="spearman")
>>> print str(spearmanCorr).replace('nan', 'NaN')
[[ 1. 0.10540926 NaN 0.4 ]
[ 0.10540926 1. NaN 0.9486833 ]
[ NaN NaN 1. NaN]
[ 0.4 0.9486833 NaN 1. ]]
>>> try:
... Statistics.corr(rdd, "spearman")
... print "Method name as second argument without 'method=' shouldn't be allowed."
... except TypeError:
... pass
"""
# Check inputs to determine whether a single value or a matrix is needed for output.
# Since it's legal for users to use the method name as the second argument, we need to
# check if y is used to specify the method name instead.
if type(y) == str:
raise TypeError("Use 'method=' to specify method name.")
if not y:
return callMLlibFunc("corr", x.map(_convert_to_vector), method).toArray()
else:
return callMLlibFunc("corr", x.map(float), y.map(float), method)
开发者ID:BViki,项目名称:spark,代码行数:54,代码来源:stat.py
示例2: __init__
def __init__(self, predictionAndLabels):
sc = predictionAndLabels.ctx
sql_ctx = SQLContext.getOrCreate(sc)
df = sql_ctx.createDataFrame(predictionAndLabels,
schema=sql_ctx._inferSchema(predictionAndLabels))
java_model = callMLlibFunc("newRankingMetrics", df._jdf)
super(RankingMetrics, self).__init__(java_model)
开发者ID:advancedxy,项目名称:spark,代码行数:7,代码来源:evaluation.py
示例3: fit
def fit(self, data):
"""
Computes a [[PCAModel]] that contains the principal components of the input vectors.
:param data: source vectors
"""
jmodel = callMLlibFunc("fitPCA", self.k, data)
return PCAModel(jmodel)
开发者ID:ChenZhongPu,项目名称:Simba,代码行数:7,代码来源:feature.py
示例4: train
def train(cls, rdd, k, convergenceTol=1e-3, maxIterations=100, seed=None):
"""Train a Gaussian Mixture clustering model."""
weight, mu, sigma = callMLlibFunc("trainGaussianMixture",
rdd.map(_convert_to_vector), k,
convergenceTol, maxIterations, seed)
mvg_obj = [MultivariateGaussian(mu[i], sigma[i]) for i in range(k)]
return GaussianMixtureModel(weight, mvg_obj)
开发者ID:Checkroth,项目名称:apriori,代码行数:7,代码来源:clustering.py
示例5: train
def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False,
seed=None):
"""
Train a matrix factorization model given an RDD of ratings by users
for a subset of products. The ratings matrix is approximated as the
product of two lower-rank matrices of a given rank (number of
features). To solve for these features, ALS is run iteratively with
a configurable level of parallelism.
:param ratings:
RDD of `Rating` or (userID, productID, rating) tuple.
:param rank:
Rank of the feature matrices computed (number of features).
:param iterations:
Number of iterations of ALS.
(default: 5)
:param lambda_:
Regularization parameter.
(default: 0.01)
:param blocks:
Number of blocks used to parallelize the computation. A value
of -1 will use an auto-configured number of blocks.
(default: -1)
:param nonnegative:
A value of True will solve least-squares with nonnegativity
constraints.
(default: False)
:param seed:
Random seed for initial matrix factorization model. A value
of None will use system time as the seed.
(default: None)
"""
model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
lambda_, blocks, nonnegative, seed)
return MatrixFactorizationModel(model)
开发者ID:0xqq,项目名称:spark,代码行数:35,代码来源:recommendation.py
示例6: train
def train(cls, rdd, k, maxIterations=100, initMode="random"):
"""
:param rdd:
An RDD of (i, j, s\ :sub:`ij`\) tuples representing the
affinity matrix, which is the matrix A in the PIC paper. The
similarity s\ :sub:`ij`\ must be nonnegative. This is a symmetric
matrix and hence s\ :sub:`ij`\ = s\ :sub:`ji`\ For any (i, j) with
nonzero similarity, there should be either (i, j, s\ :sub:`ij`\) or
(j, i, s\ :sub:`ji`\) in the input. Tuples with i = j are ignored,
because it is assumed s\ :sub:`ij`\ = 0.0.
:param k:
Number of clusters.
:param maxIterations:
Maximum number of iterations of the PIC algorithm.
(default: 100)
:param initMode:
Initialization mode. This can be either "random" to use
a random vector as vertex properties, or "degree" to use
normalized sum similarities.
(default: "random")
"""
model = callMLlibFunc(
"trainPowerIterationClusteringModel", rdd.map(_convert_to_vector), int(k), int(maxIterations), initMode
)
return PowerIterationClusteringModel(model)
开发者ID:ChineseDr,项目名称:spark,代码行数:25,代码来源:clustering.py
示例7: _train
def _train(cls, data, algo, categoricalFeaturesInfo,
loss, numIterations, learningRate, maxDepth, maxBins):
first = data.first()
assert isinstance(first, LabeledPoint), "the data should be RDD of LabeledPoint"
model = callMLlibFunc("trainGradientBoostedTreesModel", data, algo, categoricalFeaturesInfo,
loss, numIterations, learningRate, maxDepth, maxBins)
return GradientBoostedTreesModel(model)
开发者ID:0xqq,项目名称:spark,代码行数:7,代码来源:tree.py
示例8: logNormalRDD
def logNormalRDD(sc, mean, std, size, numPartitions=None, seed=None):
"""
Generates an RDD comprised of i.i.d. samples from the log normal
distribution with the input mean and standard distribution.
:param sc: SparkContext used to create the RDD.
:param mean: mean for the log Normal distribution
:param std: std for the log Normal distribution
:param size: Size of the RDD.
:param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`).
:param seed: Random seed (default: a random long integer).
:return: RDD of float comprised of i.i.d. samples ~ log N(mean, std).
>>> from math import sqrt, exp
>>> mean = 0.0
>>> std = 1.0
>>> expMean = exp(mean + 0.5 * std * std)
>>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
>>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - expStd) < 0.5
True
"""
return callMLlibFunc("logNormalRDD", sc._jsc, float(mean), float(std),
size, numPartitions, seed)
开发者ID:01azzerghhuy,项目名称:spark,代码行数:30,代码来源:random.py
示例9: update
def update(self, data, decayFactor, timeUnit):
"""Update the centroids, according to data
:param data:
RDD with new data for the model update.
:param decayFactor:
Forgetfulness of the previous centroids.
:param timeUnit:
Can be "batches" or "points". If points, then the decay factor
is raised to the power of number of new points and if batches,
then decay factor will be used as is.
"""
if not isinstance(data, RDD):
raise TypeError("Data should be of an RDD, got %s." % type(data))
data = data.map(_convert_to_vector)
decayFactor = float(decayFactor)
if timeUnit not in ["batches", "points"]:
raise ValueError(
"timeUnit should be 'batches' or 'points', got %s." % timeUnit)
vectorCenters = [_convert_to_vector(center) for center in self.centers]
updatedModel = callMLlibFunc(
"updateStreamingKMeansModel", vectorCenters, self._clusterWeights,
data, decayFactor, timeUnit)
self.centers = array(updatedModel[0])
self._clusterWeights = list(updatedModel[1])
return self
开发者ID:11wzy001,项目名称:spark,代码行数:26,代码来源:clustering.py
示例10: _train
def _train(
cls,
data,
algo,
numClasses,
categoricalFeaturesInfo,
numTrees,
featureSubsetStrategy,
impurity,
maxDepth,
maxBins,
seed,
):
first = data.first()
assert isinstance(first, LabeledPoint), "the data should be RDD of LabeledPoint"
if featureSubsetStrategy not in cls.supportedFeatureSubsetStrategies:
raise ValueError("unsupported featureSubsetStrategy: %s" % featureSubsetStrategy)
if seed is None:
seed = random.randint(0, 1 << 30)
model = callMLlibFunc(
"trainRandomForestModel",
data,
algo,
numClasses,
categoricalFeaturesInfo,
numTrees,
featureSubsetStrategy,
impurity,
maxDepth,
maxBins,
seed,
)
return RandomForestModel(model)
开发者ID:CatKyo,项目名称:SparkInfoSystem,代码行数:33,代码来源:tree.py
示例11: train
def train(cls, data, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000):
"""
Finds the complete set of frequent sequential patterns in the
input sequences of itemsets.
:param data:
The input data set, each element contains a sequence of
itemsets.
:param minSupport:
The minimal support level of the sequential pattern, any
pattern that appears more than (minSupport *
size-of-the-dataset) times will be output.
(default: 0.1)
:param maxPatternLength:
The maximal length of the sequential pattern, any pattern
that appears less than maxPatternLength will be output.
(default: 10)
:param maxLocalProjDBSize:
The maximum number of items (including delimiters used in the
internal storage format) allowed in a projected database before
local processing. If a projected database exceeds this size,
another iteration of distributed prefix growth is run.
(default: 32000000)
"""
model = callMLlibFunc("trainPrefixSpanModel",
data, minSupport, maxPatternLength, maxLocalProjDBSize)
return PrefixSpanModel(model)
开发者ID:0xqq,项目名称:spark,代码行数:27,代码来源:fpm.py
示例12: loadLabeledPoints
def loadLabeledPoints(sc, path, minPartitions=None):
"""
Load labeled points saved using RDD.saveAsTextFile.
:param sc: Spark context
:param path: file or directory path in any Hadoop-supported file
system URI
:param minPartitions: min number of partitions
@return: labeled data stored as an RDD of LabeledPoint
>>> from tempfile import NamedTemporaryFile
>>> from pyspark.mllib.util import MLUtils
>>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 4.56e-7)])), \
LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))]
>>> tempFile = NamedTemporaryFile(delete=True)
>>> tempFile.close()
>>> sc.parallelize(examples, 1).saveAsTextFile(tempFile.name)
>>> loaded = MLUtils.loadLabeledPoints(sc, tempFile.name).collect()
>>> type(loaded[0]) == LabeledPoint
True
>>> print examples[0]
(1.1,(3,[0,2],[-1.23,4.56e-07]))
>>> type(examples[1]) == LabeledPoint
True
>>> print examples[1]
(0.0,[1.01,2.02,3.03])
"""
minPartitions = minPartitions or min(sc.defaultParallelism, 2)
return callMLlibFunc("loadLabeledPoints", sc, path, minPartitions)
开发者ID:BViki,项目名称:spark,代码行数:29,代码来源:util.py
示例13: train
def train(
cls,
rdd,
k,
maxIterations=100,
runs=1,
initializationMode="k-means||",
seed=None,
initializationSteps=5,
epsilon=1e-4,
initialModel=None,
):
"""Train a k-means clustering model."""
clusterInitialModel = []
if initialModel is not None:
if not isinstance(initialModel, KMeansModel):
raise Exception(
"initialModel is of " + str(type(initialModel)) + ". It needs " "to be of <type 'KMeansModel'>"
)
clusterInitialModel = [_convert_to_vector(c) for c in initialModel.clusterCenters]
model = callMLlibFunc(
"trainKMeansModel",
rdd.map(_convert_to_vector),
k,
maxIterations,
runs,
initializationMode,
seed,
initializationSteps,
epsilon,
clusterInitialModel,
)
centers = callJavaFunc(rdd.context, model.clusterCenters)
return KMeansModel([c.toArray() for c in centers])
开发者ID:BeforeRain,项目名称:spark,代码行数:34,代码来源:clustering.py
示例14: gammaRDD
def gammaRDD(sc, shape, scale, size, numPartitions=None, seed=None):
"""
Generates an RDD comprised of i.i.d. samples from the Gamma
distribution with the input shape and scale.
:param sc: SparkContext used to create the RDD.
:param shape: shape (> 0) parameter for the Gamma distribution
:param scale: scale (> 0) parameter for the Gamma distribution
:param size: Size of the RDD.
:param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`).
:param seed: Random seed (default: a random long integer).
:return: RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).
>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> abs(stats.stdev() - expStd) < 0.5
True
"""
return callMLlibFunc("gammaRDD", sc._jsc, float(shape),
float(scale), size, numPartitions, seed)
开发者ID:01azzerghhuy,项目名称:spark,代码行数:29,代码来源:random.py
示例15: train
def train(self, rdd, k=4, maxIterations=20, minDivisibleClusterSize=1.0, seed=-1888008604):
"""
Runs the bisecting k-means algorithm return the model.
:param rdd:
Training points as an `RDD` of `Vector` or convertible
sequence types.
:param k:
The desired number of leaf clusters. The actual number could
be smaller if there are no divisible leaf clusters.
(default: 4)
:param maxIterations:
Maximum number of iterations allowed to split clusters.
(default: 20)
:param minDivisibleClusterSize:
Minimum number of points (if >= 1.0) or the minimum proportion
of points (if < 1.0) of a divisible cluster.
(default: 1)
:param seed:
Random seed value for cluster initialization.
(default: -1888008604 from classOf[BisectingKMeans].getName.##)
"""
java_model = callMLlibFunc(
"trainBisectingKMeans", rdd.map(_convert_to_vector),
k, maxIterations, minDivisibleClusterSize, seed)
return BisectingKMeansModel(java_model)
开发者ID:11wzy001,项目名称:spark,代码行数:26,代码来源:clustering.py
示例16: uniformRDD
def uniformRDD(sc, size, numPartitions=None, seed=None):
"""
Generates an RDD comprised of i.i.d. samples from the
uniform distribution U(0.0, 1.0).
To transform the distribution in the generated RDD from U(0.0, 1.0)
to U(a, b), use
C{RandomRDDs.uniformRDD(sc, n, p, seed)\
.map(lambda v: a + (b - a) * v)}
:param sc: SparkContext used to create the RDD.
:param size: Size of the RDD.
:param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`).
:param seed: Random seed (default: a random long integer).
:return: RDD of float comprised of i.i.d. samples ~ `U(0.0, 1.0)`.
>>> x = RandomRDDs.uniformRDD(sc, 100).collect()
>>> len(x)
100
>>> max(x) <= 1.0 and min(x) >= 0.0
True
>>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions()
4
>>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions()
>>> parts == sc.defaultParallelism
True
"""
return callMLlibFunc("uniformRDD", sc._jsc, size, numPartitions, seed)
开发者ID:01azzerghhuy,项目名称:spark,代码行数:28,代码来源:random.py
示例17: __init__
def __init__(self, rows, numRows=0, numCols=0):
"""
Note: This docstring is not shown publicly.
Create a wrapper over a Java RowMatrix.
Publicly, we require that `rows` be an RDD. However, for
internal usage, `rows` can also be a Java RowMatrix
object, in which case we can wrap it directly. This
assists in clean matrix conversions.
>>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6]])
>>> mat = RowMatrix(rows)
>>> mat_diff = RowMatrix(rows)
>>> (mat_diff._java_matrix_wrapper._java_model ==
... mat._java_matrix_wrapper._java_model)
False
>>> mat_same = RowMatrix(mat._java_matrix_wrapper._java_model)
>>> (mat_same._java_matrix_wrapper._java_model ==
... mat._java_matrix_wrapper._java_model)
True
"""
if isinstance(rows, RDD):
rows = rows.map(_convert_to_vector)
java_matrix = callMLlibFunc("createRowMatrix", rows, long(numRows), int(numCols))
elif (isinstance(rows, JavaObject)
and rows.getClass().getSimpleName() == "RowMatrix"):
java_matrix = rows
else:
raise TypeError("rows should be an RDD of vectors, got %s" % type(rows))
self._java_matrix_wrapper = JavaModelWrapper(java_matrix)
开发者ID:1574359445,项目名称:spark,代码行数:34,代码来源:distributed.py
示例18: normalRDD
def normalRDD(sc, size, numPartitions=None, seed=None):
"""
Generates an RDD comprised of i.i.d. samples from the standard normal
distribution.
To transform the distribution in the generated RDD from standard normal
to some other normal N(mean, sigma^2), use
C{RandomRDDs.normal(sc, n, p, seed)\
.map(lambda v: mean + sigma * v)}
:param sc: SparkContext used to create the RDD.
:param size: Size of the RDD.
:param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`).
:param seed: Random seed (default: a random long integer).
:return: RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).
>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - 0.0) < 0.1
True
>>> abs(stats.stdev() - 1.0) < 0.1
True
"""
return callMLlibFunc("normalRDD", sc._jsc, size, numPartitions, seed)
开发者ID:01azzerghhuy,项目名称:spark,代码行数:26,代码来源:random.py
示例19: exponentialVectorRDD
def exponentialVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None):
"""
Generates an RDD comprised of vectors containing i.i.d. samples drawn
from the Exponential distribution with the input mean.
:param sc: SparkContext used to create the RDD.
:param mean: Mean, or 1 / lambda, for the Exponential distribution.
:param numRows: Number of Vectors in the RDD.
:param numCols: Number of elements in each Vector.
:param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`)
:param seed: Random seed (default: a random long integer).
:return: RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).
>>> import numpy as np
>>> mean = 0.5
>>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1)
>>> mat = np.mat(rdd.collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(mat.std() - sqrt(mean)) < 0.5
True
"""
return callMLlibFunc("exponentialVectorRDD", sc._jsc, float(mean), numRows, numCols,
numPartitions, seed)
开发者ID:01azzerghhuy,项目名称:spark,代码行数:27,代码来源:random.py
示例20: gammaVectorRDD
def gammaVectorRDD(sc, shape, scale, numRows, numCols, numPartitions=None, seed=None):
"""
Generates an RDD comprised of vectors containing i.i.d. samples drawn
from the Gamma distribution.
:param sc: SparkContext used to create the RDD.
:param shape: Shape (> 0) of the Gamma distribution
:param scale: Scale (> 0) of the Gamma distribution
:param numRows: Number of Vectors in the RDD.
:param numCols: Number of elements in each Vector.
:param numPartitions: Number of partitions in the RDD (default: `sc.defaultParallelism`).
:param seed: Random seed (default: a random long integer).
:return: RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).
>>> import numpy as np
>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - expMean) < 0.1
True
>>> abs(mat.std() - expStd) < 0.1
True
"""
return callMLlibFunc("gammaVectorRDD", sc._jsc, float(shape), float(scale),
numRows, numCols, numPartitions, seed)
开发者ID:01azzerghhuy,项目名称:spark,代码行数:30,代码来源:random.py
注:本文中的pyspark.mllib.common.callMLlibFunc函数示例由纯净天空整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。 |
请发表评论