Highlight
Excel files are one of the most commonly used file format on the market. Popularity of the tool itself among the business users, business analysts and data engineers is driven by its flexibility, ease of use, powerful integration features and low price.
Intro
This is why every data engineer out there should be to understand advantages and disadvantages of this format. The variety of different internal formats like XLS, XLSX, XLSB and XLSM and which tools to use in order to process those files effectively in the cloud.
Today I bring to you a quick introduction to the process of building ETL solutions with Excel files in Azure using Data Factory and Databricks services.
Code samples: https://github.com/MarczakIO/azure4everyone-samples/tree/master/azure-excel-file-processing-with-data-factory-and-databricks
Agenda
- 00:00 Introduction
- 00:25 Excel Business Justification
- 01:22 Excel Challenges
- 02:20 Supported Services
- 04:30 Data Factory Introduction
- 05:35 Demo Setup
- 07:13 Demo using Data Factory
- 13:36 Databricks Introduction
- 14:44 Databricks Setup
- 18:14 Databricks Demo - Reading Excels
- 20:55 Databricks Demo - Reading Excels using References
- 25:56 Databricks Demo - Workbook Metadata
- 28:05 Databricks Demo - Defining Schema
- 30:03 Databricks Demo - Defining Schema
- 32:53 Additional Options
Video
Next steps for you after watching the video
- Excel format in Data Factory
- Spark Excel by Crealytics documentation