Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
296 views
in Technique[技术] by (71.8m points)

python - Pandas read_csv dtype leading zeros

So I'm reading in a station codes csv file from NOAA which looks like this:

"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"

The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % i since they are always six digits, but you know... that's the lazy mans way.

The csv is obtained using this code:

file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

which is all well and good but when I go and try and read it using this:

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})

or

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})

I get a nasty error message:

File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 216, in _read
    return parser.read()
  File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandassrcparser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandassrcparser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandassrcparser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandassrcparser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandassrcparser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandassrcparser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandassrcparser.c:14616)
TypeError: data type not understood

It's a pretty big csv (31k rows) so maybe that has something to do with it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is an issue of pandas dtype guessing.

Pandas sees numbers and guesses you want it to be numbers.

To make pandas not doubt your intentions, you should set the dtype you want: object

pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})

Will do the trick

Update as it helps others:

To have all columns as str, one can do this (from the comment):

pd.read_csv('sample.csv', dtype = str)

To have most or selective columns as str, one can do this:

# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...