madeleineudell/LowRankModels.jl: LowRankModels.jl is a julia package for modelin ...

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称：

madeleineudell/LowRankModels.jl

开源软件地址：

https://github.com/madeleineudell/LowRankModels.jl

开源编程语言：

Julia 100.0%

开源软件介绍：

LowRankModels.jl

LowRankModels.jl is a Julia package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a data array by a low rank matrix, and include many well known models in data analysis, such as principal components analysis (PCA), matrix completion, robust PCA, nonnegative matrix factorization, k-means, and many more.

For more information on GLRMs, see our paper. There is a python interface to this package, and a GLRM implementation in the H2O machine learning platform with interfaces in a variety of languages.

LowRankModels.jl makes it easy to mix and match loss functions and regularizers to construct a model suitable for a particular data set. In particular, it supports

using different loss functions for different columns of the data array, which is useful when data types are heterogeneous (e.g., real, boolean, and ordinal columns);
fitting the model to only some of the entries in the table, which is useful for data tables with many missing (unobserved) entries; and
adding offsets and scalings to the model without destroying sparsity, which is useful when the data is poorly scaled.

Installation

To install, just call

Pkg.add("LowRankModels")

at the Julia prompt.

Generalized Low Rank Models

GLRMs form a low rank model for tabular data A with m rows and n columns, which can be input as an array or any array-like object (for example, a data frame). It is fine if only some of the entries have been observed (i.e., the others are missing); the GLRM will only be fit on the !ismissing entries. The desired model is specified by choosing a rank k for the model, an array of loss functions losses, and two regularizers, rx and ry. The data is modeled as X'*Y, where X is a kxm matrix and Y is a kxn matrix. X and Y are found by solving the optimization problem

minimize sum_{(i,j) in obs} losses[j]((X'*Y)[i,j], A[i,j]) + sum_i rx(X[:,i]) + sum_j ry(Y[:,j])

The basic type used by LowRankModels.jl is the GLRM. To form a GLRM, the user specifies

the data A (any AbstractArray, such as an array, a sparse matrix, or a data frame)
the array of loss functions losses
the regularizers rx and ry
the rank k

The user may also specify

the observed entries obs
starting matrices X₀ and Y₀

obs is a list of tuples of the indices of the observed entries in the matrix, and may be omitted if all the entries in the matrix have been observed. If A is a sparse matrix, implicit zeros are interpreted as missing entries by default; see the discussion of sparse matrices below for more details. X₀ and Y₀ are initialization matrices that represent a starting guess for the optimization.

Losses and regularizers must be of type Loss and Regularizer, respectively, and may be chosen from a list of supported losses and regularizers, which include

Losses:

quadratic loss QuadLoss
hinge loss HingeLoss
logistic loss LogisticLoss
Poisson loss PoissonLoss
weighted hinge loss WeightedHingeLoss
l1 loss L1Loss
ordinal hinge loss OrdinalHingeLoss
periodic loss PeriodicLoss
multinomial categorical loss MultinomialLoss
multinomial ordinal (aka ordered logit) loss OrderedMultinomialLoss
bigger-vs-smaller loss BvSLoss (for ordinal data)
one-vs all-loss OvALoss (for categorical data)

The constructors for all the ordinal and categorical losses take as an argument the maximum (or both minimum and maximum) value the variable may take. Using the one-vs-all loss is equivalent to transforming a categorical value to a one-hot vector and using a binary loss on each entry in that vector. Using the bigger-vs-smaller loss is equivalent to transforming the ordinal value to a Boolean vector and using a binary loss on each entry in that vector. By default, the binary loss used is the logistic loss.

Regularizers:

quadratic regularization QuadReg
constrained squared euclidean norm QuadConstraint
l1 regularization OneReg
no regularization ZeroReg
nonnegative constraint NonNegConstraint (e.g., for nonnegative matrix factorization)
1-sparse constraint OneSparseConstraint (e.g., for orthogonal NNMF)
unit 1-sparse constraint UnitOneSparseConstraint (e.g., for k-means)
simplex constraint SimplexConstraint
l1 regularization, combined with nonnegative constraint NonNegOneReg
fix features at values y0 FixedLatentFeaturesConstraint(y0)

Each of these losses and regularizers can be scaled (for example, to increase the importance of the loss relative to the regularizer) by calling mul!(loss, newscale). Users may also implement their own losses and regularizers, or adjust internal parameters of the losses and regularizers; see losses.jl and regularizers.jl for more details.

Example

For example, the following code forms a k-means model with k=5 on the 100x100 matrix A:

using LowRankModels
m, n, k = 100, 100, 5
losses = QuadLoss() # minimize squared distance to cluster centroids
rx = UnitOneSparseConstraint() # each row is assigned to exactly one cluster
ry = ZeroReg() # no regularization on the cluster centroids
glrm = GLRM(A, losses, rx, ry, k)

To fit the model, call

X, Y, ch = fit!(glrm)

which runs an alternating directions proximal gradient method on glrm to find the X and Y minimizing the objective function. (ch gives the convergence history; see Technical details below for more information.)

The losses argument can also be an array of loss functions, with one for each column (in order). For example, for a data set with 3 columns, you could use

losses = Loss[QuadLoss(), LogisticLoss(), HingeLoss()]

Similiarly, the ry argument can be an array of regularizers, with one for each column (in order). For example, for a data set with 3 columns, you could use

ry = Regularizer[QuadReg(1), QuadReg(10), FixedLatentFeaturesConstraint([1.,2.,3.])]

This regularizes the first to columns of Y with ||Y[:,1]||^2 + 10||Y[:,2]||^2 and constrains the third (and last) column of Y to be equal to [1,2,3].

More examples here.

Missing data

If not all entries are present in your data table, just tell the GLRM which observations to fit the model to by listing tuples of their indices in obs, e.g., if obs=[(1,2), (5,3)], exactly two entries have been observed. Then initialize the model using

GLRM(A, losses, rx, ry, k, obs=obs)

If A is a DataFrame and you just want the model to ignore any entry that is missing, you can use

obs = observations(A)

Standard low rank models

Low rank models can easily be used to fit standard models such as PCA, k-means, and nonnegative matrix factorization. The following functions are available:

pca: principal components analysis
qpca: quadratically regularized principal components analysis
rpca: robust principal components analysis
nnmf: nonnegative matrix factorization
k-means: k-means

See the code for usage. Any keyword argument valid for a GLRM object, such as an initial value for X or Y or a list of observations, can also be used with these standard low rank models.

Scaling and offsets

If you choose, LowRankModels.jl can add an offset to your model and scale the loss functions and regularizers so all columns have the same pull in the model. Simply call

glrm = GLRM(A, losses, rx, ry, k, offset=true, scale=true)

This transformation generalizes standardization, a common preprocessing technique applied before PCA. (For more about offsets and scaling, see the code or the paper.)

You can also add offsets and scalings to previously unscaled models:

Add an offset to the model (by applying no regularization to the last row of the matrix Y, and enforcing that the last column of X be all 1s) using

add_offset!(glrm)

Scale the loss functions and regularizers by calling

equilibrate_variance!(glrm)

Scale only the columns associated to QuadLoss or HuberLoss loss functions.

prob_scale!(glrm)

Fitting DataFrames

Perhaps all this sounds like too much work. Perhaps you happen to have a DataFrame df lying around that you'd like a low rank (e.g., k=2) model for. For example,

import RDatasets
df = RDatasets.dataset("psych", "msq")

Never fear! Just call

glrm, labels = GLRM(df, k)
X, Y, ch = fit!(glrm)

This will fit a GLRM with rank k to your data, using a QuadLoss loss for real valued columns, HingeLoss loss for boolean columns, and ordinal HingeLoss loss for integer columns, a small amount of QuadLoss regularization, and scaling and adding an offset to the model as described here. It returns the column labels for the columns it fit, along with the model. Right now, all other data types are ignored. NaN values are treated as missing values (missings) and ignored in the fit.

The full call signature is

function GLRM(df::DataFrame, k::Int;
              losses = Loss[], rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false,
              prob_scale = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

Or to specify a map from data types to losses, define a new loss_map from datatypes to losses (like probabilistic_losses, below):

probabilistic_losses = Dict{Symbol, Any}(
    :real        => QuadLoss,
    :bool        => LogisticLoss,
    :ord         => MultinomialOrdinalLoss,
    :cat         => MultinomialLoss
)

and input an array of datatypes (one for each column of your data frame: GLRM(A, k, datatypes; loss_map = loss_map). The full call signature is

function GLRM(df::DataFrame, k::Int, datatypes::Array{Symbol,1};
              loss_map = probabilistic_losses,
              rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false, prob_scale = true,
              transform_data_to_numbers = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

To fit a data frame with categorical values, you can use the function expand_categoricals! to turn categorical columns into a Boolean column for each level of the categorical variable. For example, expand_categoricals!(df, [:gender]) will replace the gender column with a column corresponding to gender=male, a column corresponding to gender=female, and other columns corresponding to labels outside the gender binary, if they appear in the data set.

You can use the model to get some intuition for the data set. For example, try plotting the columns of Y with the labels; you might see that similar features are close to each other!

Fitting Sparse Matrices

If you have a very large, sparsely observed dataset, then you may want to encode your data as a sparse matrix. By default, LowRankModels interprets the sparse entries of a sparse matrix as missing entries (i.e. NA values). There is no need to pass the indices of observed entries (obs) -- this is done automatically when GLRM(A::SparseMatrixCSC,...) is called. In addition, calling fit!(glrm) when glrm.A is a sparse matrix will use the sparse variant of the proximal gradient descent algorithm, fit!(glrm, SparseProxGradParams(); kwargs...).

If, instead, you'd like to interpret the sparse entries as zeros, rather than missing or NA entries, use:

glrm = GLRM(...; sparse_na=false)

In this case, the dataset is dense in terms of observations, but sparse in terms of nonzero values. Thus, it may make more sense to fit the model with the vanilla proximal gradient descent algorithm, fit!(glrm, ProxGradParams(); kwargs...).

Parallel fitting (experimental)

LowRankModels makes use of Julia v0.5's new multithreading functionality to fit models in parallel. To fit a LowRankModel in parallel using multithreading, simply set the number of threads from the command line before starting Julia: e.g.,

export JULIA_NUM_THREADS=4

Technical details

Optimization

The function fit! uses an alternating directions proximal gradient method to minimize the objective. This method is not guaranteed to converge to the optimum, or even to a local minimum. If your code is not converging or is converging to a model you dislike, there are a number of parameters you can tweak.

Warm start

The algorithm starts with glrm.X and glrm.Y as the initial estimates for X and Y. If these are not given explicitly, they will be initialized randomly. If you have a good guess for a model, try setting them explicitly. If you think that you're getting stuck in a local minimum, try reinitializing your GLRM (so as to construct a new initial random point) and see if the model you obtain improves.

The function fit! sets the fields glrm.X and glrm.Y after fitting the model. This is particularly useful if you want to use the model you generate as a warm start for further iterations. If you prefer to preserve the original glrm.X and glrm.Y (e.g., for cross validation), you should call the function fit, which does not mutate its arguments.

You can even start with an easy-to-optimize loss function, run fit!, change the loss function (glrm.losses = newlosses), and keep going from your warm start by calling fit! again to fit the new loss functions.

Initialization

If you don't have a good guess at a warm start for your model, you might try one of the initializations provided in LowRankModels.

init_svd! initializes the model as the truncated SVD of the matrix of observed entries, with unobserved entries filled in with zeros. This initialization is known to result in provably good solutions for a number of "PCA-like" problems. See our paper for details.
init_kmeanspp! initializes the model using a modification of the kmeans++ algorithm for data sets with missing entries; see our paper for details. This works well for fitting clustering models, and may help in achieving better fits for nonnegative matrix factorization problems as well.
init_nndsvd! initializes the model using a modification of the NNDSVD algorithm as implemented by the NMF package. This modification handles data sets with missing entries by replacing missing entries with zeros. Optionally, by setting the argument max_iters=n with n>0, it will iteratively replace missing entries by their values as imputed by the NNDSVD, and call NNDSVD again on the new matrix. (This procedure is similar to the soft impute method of Mazumder, Hastie and Tibshirani for matrix completion.)

Parameters

As mentioned earlier, LowRankModels uses alternating proximal gradient descent to derive estimates of X and Y. This can be done by two slightly different procedures: (A) compute the full reconstruction, X' * Y, to compute the gradient and objective function; (B) only compute the model estimate for entries of A that are observed. The first method is likely preferred when there are few missing entries for A because of hardware level optimizations (e.g. chunking the operations so they just fit in various caches). The second method is likely preferred when there are many missing entries of A.

To fit with the first (dense) method:

fit!(glrm, ProxGradParams(); kwargs...)

To fit with the second (sparse) method:

fit!(glrm, SparseProxGradParams(); kwargs...)

The first method is used by default if glrm.A is a standard matrix/array. The second method is used by default if glrm.A is a SparseMatrixCSC.

ProxGradParams() and SparseProxGradParams() run these respective methods with the default parameters:

stepsize: The step size controls the speed of convergence. Small step sizes will slow convergence, while large ones will cause divergence. stepsize should be of order 1.
abs_tol: The algorithm stops when the decrease in the objective per iteration is less than abs_tol*length(obs).
rel_tol: The algorithm stops when the decrease in the objective per iteration is less than rel_tol.
max_iter: The algorithm also stops if maximum number of rounds max_iter has been reached.
min_stepsize: The algorithm also stops if stepsize decreases below this limit.
inner_iter: specifies how many proximal gradient steps to take on X before moving on to Y (and vice versa).

The default parameters are: ProxGradParams(stepsize=1.0;max_iter=100,inner_iter=1,abs_tol=0.00001,rel_tol=0.0001,min_stepsize=0.01*stepsize)

Convergence

ch gives the convergence history so that the success of the optimization can be monitored; ch.objective stores the objective values, and ch.times captures the times these objective values were achieved. Try plotting this to see if you just need to increase max_iter to converge to a better model.

Imputation

After fitting a GLRM, you can use it to impute values of A in four different ways:

impute(glrm) gives the maximum likelihood estimates for each entry
impute_missing(glrm) imputes missing entries and leaves observed entries unchanged
sample(glrm) gives a draw from the posterior distribution, conditioned on the fit values of X and Y, for each entry
sample_missing(glrm) samples missing entries and leaves observed entries unchanged

Cross validation

A number of useful functions are available to help you check whether a given low rank model overfits to the test data set. These functions should help you choose adequate regularization for your model.

Cross validation

cross_validate(glrm::GLRM, nfolds=5, params=Params(); verbose=false, use_folds=None, error_fn=objective, init=None): performs n-fold cross validation and returns average loss among all folds. More specifically, splits observations in glrm into nfolds groups, and builds new GLRMs, each with one group of observations left out. Fits each GLRM to the training set (the observations revealed to each GLRM) and returns the average loss on the test sets (the observations left out of each GLRM).

Optional arguments:
- use_folds: build use_folds new GLRMs instead of n_folds new GLRMs, each with 1/nfolds of the entries left out. (use_folds defaults to nfolds.)
- error_fn: use a custom error function to evaluate the fit, rather than the objective. For example, one might use the imputation error by setting error_fn = error_metric.
- init: initialize the fit using a particular procedure. For example, consider init=init_svd!. See Initialization for more options.
cv_by_iter(glrm::GLRM, holdout_proportion=.1, params=Params(1,1,.01,.01), niters=30; verbose=true): computes the test error and train error of the GLRM as it is trained. Splits the observations into a training set (1-holdout_proportion of the original observations) and a test set (holdout_proportion of the original observations). Performs params.maxiter iterations of the fitting algorithm on the training set niters times, and returns the test and train error as a function of iteration.

Regularization paths

regularization_path(glrm::GLRM; params=Params(), reg_params=exp10.(range(2,stop=-2,length=5)), holdout_proportion=.1, verbose=true, ch::ConvergenceHistory=ConvergenceHistory("reg_path")): computes the train and test error for GLRMs varying the scaling of the regularization through any scaling factor in the array reg_params.

Utilities

get_train_and_test(obs, m, n, holdout_proportion=.1): splits observations obs into a train and test set. m and n must be at least as large as the maximal value of the first or second elements of the tuples in observations, respectively. Returns observed_features and observed_examples for both train and test sets.

ScikitLearn

This library implements the ScikitLearn.jl interface. These models are available: SkGLRM, PCA, QPCA, NNMF, KMeans, RPCA. See their docstrings for more information (e.g. ?QPCA). All models support the ScikitLearnBase.fit! and ScikitLearnBase.transform interface. Examples:

## Apply PCA to the iris dataset
using LowRankModels
import ScikitLearnBase
using RDatasets    # may require Pkg.add("RDatasets")

A = convert(Matrix, dataset("datasets", "iris")[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
ScikitLearnBase.fit_transform!(PCA(k=3, max_iter=500), A)

## Fit K-Means to a fake dataset of two Gaussians
using LowRankModels
import ScikitLearnBase

# Generate two disjoint Gaussians with 100 and 50 points
gaussian1 = randn(100, 2) + 5
gaussian2 = randn(50, 2) - 10
# Merge them into a single dataset
A = vcat(gaussian1, gaussian2)

model = ScikitLearnBase.fit!(LowRankModels.KMeans(), A)
# Count how many points are assigned to each Gaussians (should be 100 and 50)
Set(sum(ScikitLearnBase.transform(model, A), 1))

These models can be used inside a ScikitLearn pipeline, and every hyperparameter can be tuned with GridSearchCV.

Citing this package

If you use LowRankModels for published work, we encourage you to cite the software.

Use the following BibTeX citation:

@article{glrm,
    title = {Generalized Low Rank Models},
    author ={Madeleine Udell and Horn, Corinne and Zadeh, Reza and Boyd, Stephen},
    doi = {10.1561/2200000055},
    year = {2016},
    archivePrefix = "arXiv",
    eprint = {1410.0342},
    primaryClass = "stat-ml",
    journal = {Foundations and Trends in Machine Learning},
    number = {1},
    volume = {9},
    issn = {1935-8237},
    url = {http://dx.doi.org/10.1561/2200000055},
}

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

doobwa/MCMC.jl: MCMC tools for Julia发布时间：2022-07-09

vinhphunguyen/MPM-Julia: Julia implementation of the material point method (MPM)发布时间：2022-07-09

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18314|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9694|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8191|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8557|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8467|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9408|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8441|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7873|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8425|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7402|2022-11-06

客服电话

电子邮件

madeleineudell/LowRankModels.jl: LowRankModels.jl is a julia package for modelin ...

开源软件名称：

开源软件地址：

开源编程语言：

开源软件介绍：

LowRankModels.jl

Installation

Generalized Low Rank Models

Example

Missing data

Standard low rank models

Scaling and offsets

Fitting DataFrames

Fitting Sparse Matrices

Parallel fitting (experimental)

Technical details

Optimization

Warm start

Initialization

Parameters

Convergence

Imputation

Cross validation

Cross validation

Regularization paths

Utilities

ScikitLearn

Citing this package

请发表评论

全部评论

上一篇：

下一篇：

ctf0/Laravel-Media-Manager: A &quot;Vuej

bradtraversy/iweather: Ionic 3 mobile we

joaomh/curso-de-matlab

断牙刷新位置时间（断牙属性及刷新位置介绍

rugk/mastodon-simplified-federation: Sim

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053

ctf0/Laravel-Media-Manager: A "Vuej