python - Pandas read_csv dtype leading zeros

Question

Welcome To Ask or Share your Answers For Others

python - Pandas read_csv dtype leading zeros

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Pandas read_csv dtype leading zeros

So I'm reading in a station codes csv file from NOAA which looks like this:

"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"

The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % i since they are always six digits, but you know... that's the lazy mans way.

The csv is obtained using this code:

file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

which is all well and good but when I go and try and read it using this:

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})

or

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})

I get a nasty error message:

File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 216, in _read
    return parser.read()
  File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:Python27libsite-packagespandas-0.11.0-py2.7-win32.eggpandasioparsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandassrcparser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandassrcparser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandassrcparser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandassrcparser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandassrcparser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandassrcparser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandassrcparser.c:14616)
TypeError: data type not understood

It's a pretty big csv (31k rows) so maybe that has something to do with it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:04:17+0000

This is an issue of pandas dtype guessing.

Pandas sees numbers and guesses you want it to be numbers.

To make pandas not doubt your intentions, you should set the dtype you want: object

pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})

Will do the trick

Update as it helps others:

To have all columns as str, one can do this (from the comment):

pd.read_csv('sample.csv', dtype = str)

To have most or selective columns as str, one can do this:

# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)

Categories

python - Pandas read_csv dtype leading zeros

python - Pandas read_csv dtype leading zeros

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags