Databricks | No Need To Skip Rows Before Header Row while reading a CSV File

 




Man! The past couple of weeks has been really tough. Hardcore development on Azure Data Factory, and Azure Databricks as we are up against a tight deadline (again  :-) ).

Loads of different scenarios and loads of new learnings. Sharing one below, keep reading.

We are receiving a source file (let's call it Test.csv) which has a blank row before the header row 

1
2 "colname1", "colname2"
3 "value1","value2"

we are using spark.read.format to load this into a data frame. 

Looking at the file contents, one would assume that you need to somehow skip the first blank row. 

So I began researching it. Found that spark.read.format does nt provide any such property. 

After spending couple of hours with no major break through, I thought of testing the code as it is 

val rawdataframe= 

spark.read.format("csv").option("header","true").option("inferSchema","true").option("delimiter", s",").option("escape","\"").option("encoding", "UTF-8").option("multiLine", "true").load(s"Test.csv")

display(rawdataframe)

and to my relief it worked!

I then test by adding more blank rows and it still worked. 

1
2
3
4 "colname1", "colname2"
5 "value1","value2"


Impressed with the "smart" capabilities of spark. 

Maybe I am too excited and need to actually check the row count. 

Have anyone come across this scenario?
Are my findings accurate?

Till next learning post !!!!



Comments

Popular posts from this blog

SQL QUERY NIGHTMARE

Visual Studio Git Error | "terminal prompts disabled"

Issues Integrating Azure Data Factory with GITHUB | IN spite of admin rights on repository