Querying on CSV file using Spark SQL
Apache Spark SQL is a Spark module to simplify working with structured data using DataFrame and DataSet abstractions in Python, Java, and Scala. These abstractions are the distributed collection of data organized into named columns. It provides a good optimization technique. Using Spark SQL we can query data, both from inside a Spark program and from external tools that connect through standard database connectors (JDBC/ODBC) to Spark SQL.
So, let’s start Spark SQL tutorial.
The following example describe how to process csv structural data with the help of spark sql.
1)Load customer.csv file from local server. in this example we are considering customer.csv file is available on Windows Server.
Download customer.csv file from following link.
2) Create view as explained below . CreateOrReplaceTempView will create a temporary view of the table on memory it is not persistent at this moment but you can run sql query on top of that . if you want to save it you can either persist or use saveAsTable to save.
3)Call spark.sql function to write sql query. here we are finding all the employees whose employee number is greater than 7788.
4)Show method will display the data
following is the consolidated code for the above explanation.
scala> spark.read.format(“csv”).option(“header”,”true”).load(“F:/Hadoop Youtube/customer.csv”)
res6: org.apache.spark.sql.DataFrame = [empno: string, ename: string … 5 more fields]
scala> .show
+—–+——+———–+——-+———-+—-+——+
|empno| ename|designation|manager| hire_date| sal|deptno|
+—–+——+———–+——-+———-+—-+——+
| 7369| SMITH| CLERK| 7902|12/17/1980| 800| 20|
| 7499| ALLEN| SALESMAN| 7698| 2/20/1981|1600| 30|
| 7521| WARD| SALESMAN| 7698| 2/22/1981|1250| 30|
| 7566|TURNER| MANAGER| 7839| 4/2/1981|2975| 20|
| 7654|MARTIN| SALESMAN| 7698| 9/28/1981|1250| 30|
| 7698|MILLER| MANAGER| 7839| 5/1/1981|2850| 30|
| 7782| CLARK| MANAGER| 7839| 6/9/1981|2450| 10|
| 7788| SCOTT| ANALYST| 7566| 12/9/1982|3000| 20|
| 7844|TURNER| SALESMAN| 7698| 9/8/1981|1500| 30|
| 7876| ADAMS| CLERK| 7788| 1/12/1983|1100| 20|
| 7900| JAMES| CLERK| 7698| 12/3/1981| 950| 30|
| 7902| FORD| ANALYST| 7566| 12/3/1981|3000| 20|
| 7934|MILLER| CLERK| 7782| 1/23/1982|1300| 10|
+—–+——+———–+——-+———-+—-+——+
scala>
scala> spark.read.json(“F:/Hadoop Youtube/employee.json”)
res2: org.apache.spark.sql.DataFrame = [age: string, id: string … 1 more field]
scala> .show
+—+—-+——-+
|age| id| name|
+—+—-+——-+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 23|1205| prudvi|
+—+—-+——-+
Check out the video with practical session. Please subscribe to the channel for future update.