在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:JuliaIO/Parquet.jl开源软件地址:https://github.com/JuliaIO/Parquet.jl开源编程语言:Julia 100.0%开源软件介绍:ParquetReaderA parquet file or dataset can be loaded using the
Options:
The returned object is a Tables.jl compatible Table and can be converted to other forms, e.g. a using Parquet, DataFrames
df = DataFrame(read_parquet(path)) Partitions in a parquet file or dataset can also be iterated over using an iterator returned by the using Parquet, DataFrames
for partition in Tables.partitions(read_parquet(path))
df = DataFrame(partition)
...
end Lower Level ReaderLoad a parquet file. Only metadata is read initially, data is loaded in chunks on demand. (Note: ParquetFiles.jl also provides load support for Parquet files under the FileIO.jl package.)
The julia> using Parquet
julia> filename = "customer.impala.parquet";
julia> parquetfile = Parquet.File(filename)
Parquet file: customer.impala.parquet
version: 1
nrows: 150000
created by: impala version 1.2-INTERNAL (build a462ec42e550c75fccbff98c720f37f3ee9d55a3)
cached: 0 column chunks Examine the schema. julia> nrows(parquetfile)
150000
julia> ncols(parquetfile)
8
julia> colnames(parquetfile)
8-element Array{Array{String,1},1}:
["c_custkey"]
["c_name"]
["c_address"]
["c_nationkey"]
["c_phone"]
["c_acctbal"]
["c_mktsegment"]
["c_comment"]
julia> schema(parquetfile)
Schema:
schema {
optional INT64 c_custkey
optional BYTE_ARRAY c_name
optional BYTE_ARRAY c_address
optional INT32 c_nationkey
optional BYTE_ARRAY c_phone
optional DOUBLE c_acctbal
optional BYTE_ARRAY c_mktsegment
optional BYTE_ARRAY c_comment
} The reader performs logical type conversions automatically for String (from byte arrays), decimals (from fixed length byte arrays) and DateTime (from Int96). It depends on the converted type being populated correctly in the file metadata to detect such conversions. To take care of files where such metadata is not populated, an optional julia> mapping = Dict(["column_name"] => (String, Parquet.logical_string));
julia> parquetfile = Parquet.File("filename"; map_logical_types=mapping); The reader will interpret logical types based on the
Variants of these methods or custom methods can also be applied by caller. BatchedColumnsCursorCreate cursor to iterate over batches of column values. Each iteration returns a named tuple of column names with batch of column values. Files with nested schemas can not be read with this cursor. BatchedColumnsCursor(parquetfile::Parquet.File; kwargs...) Cursor options:
Example: julia> typemap = Dict(["c_name"]=>(String,Parquet.logical_string), ["c_address"]=>(String,Parquet.logical_string));
julia> parquetfile = Parquet.File("customer.impala.parquet"; map_logical_types=typemap);
julia> cc = BatchedColumnsCursor(parquetfile)
Batched Columns Cursor on customer.impala.parquet
rows: 1:150000
batches: 1
cols: c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
julia> batchvals, state = iterate(cc);
julia> propertynames(batchvals)
(:c_custkey, :c_name, :c_address, :c_nationkey, :c_phone, :c_acctbal, :c_mktsegment, :c_comment)
julia> length(batchvals.c_name)
150000
julia> batchvals.c_name[1:5]
5-element Array{Union{Missing, String},1}:
"Customer#000000001"
"Customer#000000002"
"Customer#000000003"
"Customer#000000004"
"Customer#000000005" RecordCursorCreate cursor to iterate over records. In parallel mode, multiple remote cursors can be created and iterated on in parallel. RecordCursor(parquetfile::Parquet.File; kwargs...) Cursor options:
Example: julia> typemap = Dict(["c_name"]=>(String,Parquet.logical_string), ["c_address"]=>(String,Parquet.logical_string));
julia> parquetfile = Parquet.File("customer.impala.parquet"; map_logical_types=typemap);
julia> rc = RecordCursor(parquetfile)
Record Cursor on customer.impala.parquet
rows: 1:150000
cols: c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
julia> records = collect(rc);
julia> length(records)
150000
julia> first_record = first(records);
julia> isa(first_record, NamedTuple)
true
julia> propertynames(first_record)
(:c_custkey, :c_name, :c_address, :c_nationkey, :c_phone, :c_acctbal, :c_mktsegment, :c_comment)
julia> first_record.c_custkey
1
julia> first_record.c_name
"Customer#000000001"
julia> first_record.c_address
"IVhzIApeRb ot,c,E" WriterYou can write any Tables.jl column-accessible table that contains columns of these types and their union with However, Writer Exampletbl = (
int32 = Int32.(1:1000),
int64 = Int64.(1:1000),
float32 = Float32.(1:1000),
float64 = Float64.(1:1000),
bool = rand(Bool, 1000),
string = [randstring(8) for i in 1:1000],
int32m = rand([missing, 1:100...], 1000),
int64m = rand([missing, 1:100...], 1000),
float32m = rand([missing, Float32.(1:100)...], 1000),
float64m = rand([missing, Float64.(1:100)...], 1000),
boolm = rand([missing, true, false], 1000),
stringm = rand([missing, "abc", "def", "ghi"], 1000)
)
file = tempname()*".parquet"
write_parquet(file, tbl) |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论