I'm currently able to resolve bivariate analysis problems with Apache Spark
, and calculate for example a correlation coefficient, provided the variables engaged are quantitatives,
(我目前能够使用Apache Spark
解决双变量分析问题,并计算(例如)相关系数,前提是所涉及的变量是定量变量,)
then I use a dataset.stat().corr("variable_1", "variable_2", "pearson")
(然后我使用dataset.stat().corr("variable_1", "variable_2", "pearson")
)
or ordinal, and then I use this small code :
(或序数,然后使用以下小代码:)
List<Row> data = /* One Row per individual, each column : a variable */
RowFactory.create(Vectors.dense(8.3, 7.9)),
...
);
StructType schema = new StructType(new StructField[]{
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
Dataset<Row> df = this.session.createDataFrame(data, schema);
Row r = Correlation.corr(df, "features", "spearman").head();
/* r value is in r.get(0) */
But I don't know how to proceed when variables are categorial (qualitatives).
(但是我不知道变量是定性的(定性的)时如何进行。)
Currently I know (manually) the way where you have, let say, an I x C table of observed effectives :
(目前,我(手动)知道您拥有I x C观察到的有效表格的方式:)
| | P | D | I
| --- | --- | --- | ---
| F | 21 | 15 | 9
| H | 39 | 13 | 3
and you compare it with the theorical effectives expected :
(然后将其与预期的理论效果进行比较:)
| | P | D | I
| --- | --- | --- | ---
| F | 27 | 12.6 | 5.4
| H | 33 | 15.4 | 6.6
and you measure gaps, calculating a D2,
(然后测量间隙,计算D2,)
D2 = 7.61
(D2= 7.61)
and you can see the contributions of the elements :
(您可以看到这些元素的贡献:)
(gaps2 / theorical effective2) / D2
((间隙2/理论有效值2)/D2)
| | P | D | I
| --- | ----- | ----- | -----
| F | 0.175 | 0.060 | 0.315
| H | 0.143 | 0.049 | 0.258
What is the way to achieve this with Apache Spark
?
(用Apache Spark
实现此目的的方法是什么?)
ask by Marc Le Bihan translate from so