Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
322 views
in Technique[技术] by (71.8m points)

java - 如何在Spark中使用分类变量进行双变量统计?(How to do bivariate statistics with categorial variables in Spark?)

I'm currently able to resolve bivariate analysis problems with Apache Spark , and calculate for example a correlation coefficient, provided the variables engaged are quantitatives,

(我目前能够使用Apache Spark解决双变量分析问题,并计算(例如)相关系数,前提是所涉及的变量是定量变量,)
then I use a dataset.stat().corr("variable_1", "variable_2", "pearson")

(然后我使用dataset.stat().corr("variable_1", "variable_2", "pearson"))

or ordinal, and then I use this small code :

(或序数,然后使用以下小代码:)

List<Row> data =  /* One Row per individual, each column : a variable */
   RowFactory.create(Vectors.dense(8.3, 7.9)),
   ...
);

StructType schema = new StructType(new StructField[]{
  new StructField("features", new VectorUDT(), false, Metadata.empty()),
});

Dataset<Row> df = this.session.createDataFrame(data, schema);

Row r = Correlation.corr(df, "features", "spearman").head();
/* r value is in r.get(0) */

But I don't know how to proceed when variables are categorial (qualitatives).

(但是我不知道变量是定性的(定性的)时如何进行。)

Currently I know (manually) the way where you have, let say, an I x C table of observed effectives :

(目前,我(手动)知道您拥有I x C观察到的有效表格的方式:)

|     |  P  |  D  |  I   
| --- | --- | --- | ---   
|  F  | 21  | 15  |  9   
|  H  | 39  | 13  |  3

and you compare it with the theorical effectives expected :

(然后将其与预期的理论效果进行比较:)

|     |  P  |   D  |   I   
| --- | --- | ---  | ---   
|  F  | 27  | 12.6 | 5.4   
|  H  | 33  | 15.4 | 6.6

and you measure gaps, calculating a D2,

(然后测量间隙,计算D2,)

D2 = 7.61

(D2= 7.61)

and you can see the contributions of the elements :

(您可以看到这些元素的贡献:)
(gaps2 / theorical effective2) / D2

((间隙2/理论有效值2)/D2)

|     |   P   |   D   |   I   
| --- | ----- | ----- | -----   
|  F  | 0.175 | 0.060 | 0.315   
|  H  | 0.143 | 0.049 | 0.258

What is the way to achieve this with Apache Spark ?

(用Apache Spark实现此目的的方法是什么?)

  ask by Marc Le Bihan translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...