Databricks | No Need To Skip Rows Before Header Row while reading a CSV File
Man! The past couple of weeks has been really tough. Hardcore development on Azure Data Factory, and Azure Databricks as we are up against a tight deadline (again :-) ).
Loads of different scenarios and loads of new learnings. Sharing one below, keep reading.
We are receiving a source file (let's call it Test.csv) which has a blank row before the header row
1
2 "colname1", "colname2"
3 "value1","value2"
we are using spark.read.format to load this into a data frame.
Looking at the file contents, one would assume that you need to somehow skip the first blank row.
So I began researching it. Found that spark.read.format does nt provide any such property.
After spending couple of hours with no major break through, I thought of testing the code as it is
val rawdataframe=
spark.read.format("csv").option("header","true").option("inferSchema","true").option("delimiter", s",").option("escape","\"").option("encoding", "UTF-8").option("multiLine", "true").load(s"Test.csv")
display(rawdataframe)
and to my relief it worked!
I then test by adding more blank rows and it still worked.
1
2
3
4 "colname1", "colname2"
5 "value1","value2"
Impressed with the "smart" capabilities of spark.
Maybe I am too excited and need to actually check the row count.
Have anyone come across this scenario?
Are my findings accurate?
Till next learning post !!!!
Comments
Post a Comment