Databricks | No Need To Skip Rows Before Header Row while reading a CSV File

- June 14, 2022

Man! The past couple of weeks has been really tough. Hardcore development on Azure Data Factory, and Azure Databricks as we are up against a tight deadline (again :-) ).

Loads of different scenarios and loads of new learnings. Sharing one below, keep reading.

We are receiving a source file (let's call it Test.csv) which has a blank row before the header row

1
2 "colname1", "colname2"
3 "value1","value2"

we are using spark.read.format to load this into a data frame.

Looking at the file contents, one would assume that you need to somehow skip the first blank row.

So I began researching it. Found that spark.read.format does nt provide any such property.

After spending couple of hours with no major break through, I thought of testing the code as it is

val rawdataframe=

spark.read.format("csv").option("header","true").option("inferSchema","true").option("delimiter", s",").option("escape","\"").option("encoding", "UTF-8").option("multiLine", "true").load(s"Test.csv")

display(rawdataframe)

and to my relief it worked!

I then test by adding more blank rows and it still worked.

1
2
3
4 "colname1", "colname2"
5 "value1","value2"

Impressed with the "smart" capabilities of spark.

Maybe I am too excited and need to actually check the row count.

Have anyone come across this scenario?
Are my findings accurate?

Till next learning post !!!!

Search This Blog

iLearn

Databricks | No Need To Skip Rows Before Header Row while reading a CSV File

Comments

Post a Comment

Popular posts from this blog

SQL QUERY NIGHTMARE

Handling decimal and non numeric types using Case statement

Issues Integrating Azure Data Factory with GITHUB | IN spite of admin rights on repository