Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.3k views
in Technique[技术] by (71.8m points)

apache spark - Unable to infer schema for Parquet. It must be specified manually

I am running all the code from within EMR Notebooks.

spark.version

'3.0.1-amzn-0'

temp_df.printSchema()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- weekday: integer (nullable = true)

temp_df.show(2)

+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|        dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01|            25.544|                        1.171| Acre| Brazil|1855|    5|  1|      3|
|1855-06-01|            24.228|                        1.103| Acre| Brazil|1855|    6|  1|      6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows

temp_df.write.parquet(path='s3://project7878/clean_data/temperatures.parquet', mode='overwrite', partitionBy=['year'])

enter image description here

enter image description here

spark.read.parquet(path='s3://project7878/clean_data/temperatures.parquet').show(2)

An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

I have referred to other stack overflow posts, but the solution provided there (problem due to empty files written) does not apply to me.

Please help me out. Thank You !!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Don't use path in the read.parquet call:

>>> spark.read.parquet(path='a.parquet')
21/01/02 22:53:38 WARN DataSource: All paths were ignored:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/marti/bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/home/marti/bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/home/marti/bin/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
>>> spark.read.parquet('a.parquet')
DataFrame[_2: string, _1: double]

This is because the path argument does not exist.

It is valid if you use load

>>> spark.read.load(path='a', format='parquet')
DataFrame[_1: string, _2: string]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...