Why does my Athena query fail with the error "HIVE_BAD_DATA: Error parsing field value for field X: For input string: "12312845691""?

6 minute read
0

When I query data in Amazon Athena, I get an error similar to either "HIVE_BAD_DATA: Error parsing field value for field X: For input string: "12312845691" or "HIVE_BAD_DATA: Error parsing column '0': target scale must be larger than source scale".

Short description

There are several versions of the HIVE_BAD_DATA error. If the error message specifies a null or empty input string, such as "For input string: """, then see Why do I get the error "HIVE_BAD_DATA: Error parsing field value for field X for input string:" when I query JSON data in Amazon Athena?

Errors that specify an input string with a value occur in one of the following scenarios:

  • The data type that's defined in the table definition doesn't match the actual source data.
  • A single field contains different types of data, such as an integer value for one record and a decimal value for another record.

Resolution

It's a best practice to use only one data type in a column. Otherwise, the query might fail. To resolve errors, be sure that each column contains values of the same data type and that the values are in the allowed ranges.

If you still get errors, then change the column's data type to a compatible data type that has a higher range. If the change to the data type doesn't resolve the issue, then try the solutions in the following examples. 

Example 1

In the following example, the source format is JSON. The issue is with the last record because the id key value is the DECIMAL data type (0.54). However, the id key value for the other records is set to INT.

Source data:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
{ "id" : 0.54, "name":"Jill" }

Data Definition Language (DDL) statement:

CREATE EXTERNAL TABLE jsontest_error_hive_bad_data (    id INT,
    name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/jsontest_error_hive_bad_data/';

Data Manipulation Language (DML) statement:

SELECT * FROM jsontest_error_hive_bad_data

Error:

"Your query has the following error(s):HIVE_BAD_DATA: Error parsing field value '0.54' for field 0: For input string: "0.54" This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: bd50793b-94fc-42a7-b131-b7c81da273b2."

To resolve this issue, redefine the id column as string. The STRING data type can correctly represent all values in this dataset.

Example:

CREATE EXTERNAL TABLE jsontest_error_hive_bad_data_correct_id_data_type (    id STRING,
    name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/jsontest_error_hive_bad_data/';

DML statement:

SELECT * FROM jsontest_error_hive_bad_data_correct_id_data_type

You can also cast to your preferred data type. For example, you can cast a string as an integer. However, for some data types this might return null or inaccurate results that you cast from and to. Values that you can't cast are discarded. For example, if you cast the string value 0.54 to INT, then null results are returned:

SELECT TRY_CAST(id AS INTEGER) FROM jsontest_error_hive_bad_data_correct_id_data_type

Example output:

Results     _col0
1    50
2    51
3    53
4

The output shows that the 0.54 value was discarded. You can't cast that value directly from a string to an integer. To resolve this issue, use COALESCE to cast the mixed type values in the same column as the output. Then, allow the aggregate function to run on the column.

Example:

SELECT COALESCE(TRY_CAST(id AS INTEGER), TRY_CAST(id AS DECIMAL(10,2))) FROM jsontest_error_hive_bad_data_correct_id_data_type

Output:

Results     _col0
1    50.00
2    51.00
3    53.00
4    0.54

Run aggregate functions:

SELECT SUM(COALESCE(TRY_CAST(id AS INTEGER), TRY_CAST(id AS DECIMAL(10,2)))) FROM jsontest_error_hive_bad_data_correct_id_data_type

Output:

_col01    154.54

Example 2

In the following example, the source format is JSON. The the id column is defined as INT. Athena can't parse "49612833315" because the range for INT values in Presto is -2147483648 to 2147483647.

Source data:

{ "id" : 50, "name":"John" }{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
{ "id" : 49612833315, "name":"Jill" }

DDL statement:

CREATE EXTERNAL TABLE jsontest_error_hive_bad_data_sample_2 (    id INT,
    name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/jsontest_error_hive_bad_data_2/';

DML statement:

SELECT * FROM jsontest_error_hive_bad_data_sample_2

Error:

"Your query has the following error(s):HIVE_BAD_DATA: Error parsing field value '49612833315' for field 0: For input string: "49612833315" This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 05b55fb3-481a-4012-8c0d-c27ef1ee746f."

To resolve this issue, define the id column as bigint because the BIGINT data type can read the value "49612833315." For more information, see Integer types.

Modified DDL statement:

CREATE EXTERNAL TABLE jsontest_error_hive_bad_data_sample_2_corrected (    id BIGINT,
    name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/jsontest_error_hive_bad_data_2/';

Example 3

In the following example, the source format is JSON. The input data is DECIMAL and the column is defined as decimal in the table definition. However, the scale is defined as 2 and doesn't match the "0.000054" value. For more information, see DECIMAL or NUMERIC type.

Source data:

{ "id" : 0.50, "name":"John" }{ "id" : 0.51, "name":"Jane" }
{ "id" : 0.53, "name":"Jill" }
{ "id" : 0.000054, "name":"Jill" }

DDL statement:

CREATE EXTERNAL TABLE jsontest_error_hive_bad_data_sample_3(    id DECIMAL(10,2),
    name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/jsontest_error_hive_bad_data_3/';

DML statement:

SELECT * FROM jsontest_error_hive_bad_data_sample_3

Error:

"Your query has the following error(s):HIVE_BAD_DATA: Error parsing column '0': target scale must be larger than source scale This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 1c3c7278-7012-48bb-8642-983852aff999."

To resolve this issue, redefine the column with a scale that captures all input values. For example, instead of (10,2) use (10,7).

CREATE EXTERNAL TABLE jsontest_error_hive_bad_data_sample_3_corrected(    id DECIMAL(10,7),
    name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/jsontest_error_hive_bad_data_3/';

Example 4

In the following example, the source format is CSV. In the schema, attribute2 contains STRING, and the format is date. The table schema defines it as the DATE type.

Source data:

"attribute1","attribute2","attribute3","attribute4"
"1","2018-01-01","10.01","Hello!"
"2","2018-01-02","20.02","Hi!"
"3","2018-01-03","30.03","How are you"

DDL statement:

CREATE EXTERNAL TABLE test (attribute1 int, attribute2 date, attribute3  float, attribute4 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://awsexamplebucket/csvtest_error_hive_bad_data/'
TBLPROPERTIES ("skip.header.line.count"="1");

Error:

"HIVE_BAD_DATA: Error reading field value: Cannot convert value 2018-01-01 of type String to a LONG value"

To resolve this issue, change the attribute2 type from DATE to STRING.

Modified source data:

"attribute1","attribute2","attribute3","attribute4"
"1","17532","10.01","Hello!"
"2","17533","20.02","Hi!"
"3","17534","30.03","How are you"

Modified DDL statement:

CREATE EXTERNAL TABLE test (attribute1 int, attribute2 date, attribute3  float, attribute4 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://awsexamplebucket/csvtest_error_hive_bad_data2/'
TBLPROPERTIES ("skip.header.line.count"="1");

Related information

Data types in Amazon Athena

Troubleshooting in Athena

AWS OFFICIAL
AWS OFFICIALUpdated a month ago