Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

python - Using read_excel with converters for reading Excel file into Pandas DataFrame results in a numeric column of object type

I am reading this Excel file United Nations Energy Indicators using the code snippet here:

def convert_energy(energy):
    if isinstance(energy, float):
        return energy*1000000
    else:
        return energy

def energy_df():
    return pd.read_excel("Energy Indicators.xls", skiprows=17, skip_footer=38, usecols=[2,3,4,5], na_values=['...'], names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'], converters={1: convert_energy}).set_index('Country')

This results in Energy Supply column having the object type instead of float. Why is it the case?

energy = energy_df()
print(energy.dtypes)

Energy Supply                object
Energy Supply per Capita    float64
% Renewable                 float64
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Let's remove the converters argument for a moment -

c = ['Energy Supply', 'Energy Supply per Capita', '% Renewable']
df = pd.read_excel("Energy Indicators.xls", 
                   skiprows=17, 
                   skip_footer=38, 
                   usecols=[2,3,4,5], 
                   na_values=['...'], 
                   names=c,
                   index_col=[0])

df.index.name = 'Country'
df.head()    
                Energy Supply  Energy Supply per Capita  % Renewable
Country                                                             
Afghanistan             321.0                      10.0    78.669280
Albania                 102.0                      35.0   100.000000
Algeria                1959.0                      51.0     0.551010
American Samoa            NaN                       NaN     0.641026
Andorra                   9.0                     121.0    88.695650

df.dtypes

Energy Supply               float64
Energy Supply per Capita    float64
% Renewable                 float64
dtype: object

Your data loads just fine without a converter. There's a trick to understanding why this happens.

By default, pandas will read in the column and try to "interpret" your data. By specifying your own converter, you override pandas conversion, so this does not happen.

pandas passes integer and string values to convert_energy, so the isinstance(energy, float) is never evaluated to True. Instead, the else runs, and these values are returned as is, so your resultant column is a mixture of strings and integers. If you put a print(type(energy)) inside your function, this becomes obvious.

Since you have mixtures of types, the resultant type is object. However, if you do not use a converter, pandas will attempt to interpret your data, and will successfully parse it to numeric.

So, just doing -

df['Energy Supply'] *= 1000000

Would be more than enough.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...