java - 如何在Spark中使用分类变量进行双变量统计？(How to do bivariate statistics with categorial variables in Spark?)

Question

Welcome To Ask or Share your Answers For Others

java - 如何在Spark中使用分类变量进行双变量统计？(How to do bivariate statistics with categorial variables in Spark?)

posted Mar 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - 如何在Spark中使用分类变量进行双变量统计？(How to do bivariate statistics with categorial variables in Spark?)

I'm currently able to resolve bivariate analysis problems with Apache Spark , and calculate for example a correlation coefficient, provided the variables engaged are quantitatives,

(我目前能够使用Apache Spark解决双变量分析问题，并计算（例如）相关系数，前提是所涉及的变量是定量变量，)
then I use a dataset.stat().corr("variable_1", "variable_2", "pearson")

(然后我使用dataset.stat().corr("variable_1", "variable_2", "pearson"))

or ordinal, and then I use this small code :

(或序数，然后使用以下小代码：)

List<Row> data =  /* One Row per individual, each column : a variable */
   RowFactory.create(Vectors.dense(8.3, 7.9)),
   ...
);

StructType schema = new StructType(new StructField[]{
  new StructField("features", new VectorUDT(), false, Metadata.empty()),
});

Dataset<Row> df = this.session.createDataFrame(data, schema);

Row r = Correlation.corr(df, "features", "spearman").head();
/* r value is in r.get(0) */

But I don't know how to proceed when variables are categorial (qualitatives).

(但是我不知道变量是定性的（定性的）时如何进行。)

Currently I know (manually) the way where you have, let say, an I x C table of observed effectives :

(目前，我（手动）知道您拥有I x C观察到的有效表格的方式：)

|     |  P  |  D  |  I   
| --- | --- | --- | ---   
|  F  | 21  | 15  |  9   
|  H  | 39  | 13  |  3

and you compare it with the theorical effectives expected :

(然后将其与预期的理论效果进行比较：)

|     |  P  |   D  |   I   
| --- | --- | ---  | ---   
|  F  | 27  | 12.6 | 5.4   
|  H  | 33  | 15.4 | 6.6

and you measure gaps, calculating a D2,

(然后测量间隙，计算D2，)

D2 = 7.61
(D2= 7.61)

and you can see the contributions of the elements :

(您可以看到这些元素的贡献：)
(gaps2 / theorical effective2) / D2

(（间隙2/理论有效值2）/D2)

|     |   P   |   D   |   I   
| --- | ----- | ----- | -----   
|  F  | 0.175 | 0.060 | 0.315   
|  H  | 0.143 | 0.049 | 0.258

What is the way to achieve this with Apache Spark ?

(用Apache Spark实现此目的的方法是什么？)

ask by Marc Le Bihan translate from so

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

java - 如何在Spark中使用分类变量进行双变量统计？(How to do bivariate statistics with categorial variables in Spark?)

java - 如何在Spark中使用分类变量进行双变量统计？(How to do bivariate statistics with categorial variables in Spark?)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags