Skip to content

Comments

fix: improve CSV header validation and error messages#692

Open
Prathamesh9284 wants to merge 3 commits intoapache:mainfrom
Prathamesh9284:fix/csv-header-validation
Open

fix: improve CSV header validation and error messages#692
Prathamesh9284 wants to merge 3 commits intoapache:mainfrom
Prathamesh9284:fix/csv-header-validation

Conversation

@Prathamesh9284
Copy link

Summary

Closes #690

Improves CSV error handling in the SQL API filesystem source (JavaCSVTableSource) to provide clear, actionable error messages when CSV files are malformed or misconfigured.

Changes

  • Improved error message in parseLine: Shows expected vs actual column count, the separator used, the offending line, and a hint about the required Calcite header format (name:type)
  • Added validateHeaderLine method: Validates the CSV header before data parsing begins — checks that the comma-separated column count matches the table schema and that each column follows the name:type format
  • Added empty file detection in streamLines: Throws a clear error if the CSV file has no lines at all
  • Removed static from streamLines: Required to access instance fields (fieldTypes, sourcePath) for header validation

Context

Calcite's CSV adapter requires a typed header row (e.g., id:int,name:string,email:string) using commas, while data rows use Wayang's configurable separator (default ;). Without a proper header, the previous error was:

Error while parsing CSV file ... at line ..., using separator ;

This gave no indication of what was actually wrong. The new errors clearly explain the issue:

  • CSV file '...': header has 1 comma-separated columns but table schema expects 4.
  • CSV file '...': header column 'NAMEA' missing required type. Expected 'name:type' format.
  • Column count mismatch in CSV file '...': expected 4 columns but found 1 (separator ';').

@mspruc
Copy link
Contributor

mspruc commented Feb 18, 2026

Thanks for your contribution! Left some comments for your PR, I prefer if we keep the accessors as-is so we don't accidentally deprecate anything people might be using.

@Prathamesh9284
Copy link
Author

Hi @mspruc @zkaoudi

I’ve updated the PR based on your feedback to keep the existing methods and signatures the same. I moved the validateHeaderLine logic to createStream instead of changing streamLines.

Please let me know if anything else needs to be improved.

mspruc
mspruc previously approved these changes Feb 18, 2026
Copy link
Contributor

@mspruc mspruc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zkaoudi can you verify this with your example?

@zkaoudi
Copy link
Contributor

zkaoudi commented Feb 18, 2026

I checked with the right file and it works.

I also checked with an incorrect file which has this heading:
id:int;name:string;email:string;country:string

And got this error message: "Column count mismatch in CSV file 'file:///Users/zoi/Work/WAYANG/wayang-examples/src/main/resources/input/customers.csv': expected 1 columns but found 4 (separator ';'). Line: '1;Alice Johnson;alice@example.com;USA'. Ensure the header uses 'name:type' format with commas and data rows use ';' as delimiter."

The line it prints is the second line of my file, not the header line. Can we print the header line?

@Prathamesh9284
Copy link
Author

Prathamesh9284 commented Feb 18, 2026

Hi @zkaoudi @mspruc,

I've updated the PR to handle this case. With the CSV file you shared using the header id:int;name:string;email:string;country:string, the error now correctly identifies the header issue:

CSV file 'file:///Users/zoi/Work/WAYANG/wayang-examples/src/main/resources/input/customers.csv': header uses ';' as separator, but Calcite requires commas. Header: 'id:int;name:string;email:string;country:string'. Expected format: 'id:int,name:string,email:string,country:string'.

It now prints the header line instead of the data line and clearly tells the user what to fix.

* @param path the filesystem path to the CSV file
*/
private void validateHeaderLine(final String path) {
final FileSystem fileSystem = FileSystems.getFileSystem(path).orElseThrow(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to do the check directly when we are reading the file? We are now opening the file twice which could be costly?

Copy link
Author

@Prathamesh9284 Prathamesh9284 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zkaoudi since streamLines() is static and is the place where file opening and iterator creation are already defined, I considered it the appropriate location to perform header validation. However, because it is static, it cannot access the instance-level separator, and we cannot modify its signature or behavior to pass the separator or expose the header.

Given these constraints, performing header validation within the same file-open operation would require changing streamLines(), which @mspruc wanted to avoid to preserve the existing definition. As a result, the file is currently opened twice.

Is there something we can do here to avoid the double file open while keeping the existing structure intact? I’d appreciate your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parsing error when reading from filesystem with the SQL api

3 participants