在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:Statfactory/JuML.jl开源软件地址:https://github.com/Statfactory/JuML.jl开源编程语言:Julia 100.0%开源软件介绍:JuMLJuML is a machine learning package written in pure Julia. This is still very much work in progress so the package is not registered yet and you will need to clone this repo to try it. At the moment JuML contains a custom built DataFrame with associated types (Factor, Covariate etc) and an independent XGBoost implementation (logistic only). The XGBoost part around 600 lines of Julia code and has speed similar to the original C++ implementation with smaller memory footprint. Example usage: Airline dataset with 1M obsThe datasets can be downloaded from here: https://s3.amazonaws.com/benchm-ml--main/train-1m.csv https://s3.amazonaws.com/benchm-ml--main/test.csv Let's rename the datasets into airlinetrain and airlinetest. First we have to import the csv datasets into a special binary format. We will import both datasets into 1 airlinetraintest dataframe with test data stacked under train data:
The data will be converted into a special binary format and saved in a new folder named airlinetraintest. Each data column is stored in a separate binary file. We can now load the dataset into JuML DataFrame:
You should see a summary of the dataframe in your REPL. JuML DataFrame is just a collection of Factors and Covariates. Categorical data is stored in Factors and numeric data in Covariates.
We can access each Factor or Covariate by name:
We can see a quick stat summary:
JuML XGBoost expects label to be a Covariate and all features to be Factors. Our label is dep_delayed_15min, which is a Factor, and there are 2 Covariates in the data: Distance and DepTime. Fortunately we can easily convert between factors and covariates in JuML. Let's create a Covariate which is equal to 1 when dep_delayed_15min is Y and 0 otherwise:
Covariates can be converted into factors by binning. We can bin on every possible value with function factor:
We have stacked train and test data in 1 dataframe. We will need to define selectors for each part:
The last thing to do before we can run XGBoost is to create a vector of factors as features. We need to add deptime and distance factors to train dataframe factors (dep_delayed_15min will be excluded from the model features automatically):
We are now ready to run XGBoost:
We can now calculate auc and logloss for both train and validation:
|
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论