Hadoop
Connection from Spark to Oracle Database Download ojdbc jar connector Code to connect oracle to spark Spark submit with jar file Mysql Database with Insert statement Spark Code to connect to mysql database Write Spark DF to MYSQL database
–define or –hivevar Options — database Option -s, -e Options, Environment variables & Redirecting Output to File Connecting Remote Hive Server Running Queries from file Hive Batch Mode Commands
Spark repartition() vs coalesce() – repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way. Example 2 Rdd Coalesce Rdd Repartition
Mapper Programe Reducer Programe Driver Programe Run Mapreduce Programe
Mapper Code Reducer Code Driver Code Add external Jar Files Input file Run Mapreduce programme Mapreduce Wordcount output
sample data Analytical data sample Load sample data to hive table RANK() Function DENSE_RANK() Function ROW_NUMBER()
Download Sample Data Sample Data Create hive table and see data contain header Load data to hive table and see data has header Create table with header properties Load data into hive table. header will not loaded to hive table
Hive Bucketing Click on below link to download dataset Bucketing Dataset Create Bucketing table with partition column Create Non Bucket Table as temporary Set hive bucketing properties Load data from non bucket to bucket table Verify data from select query Verify Bucketing data from Hdfs location Bucketing in Hive : Querying from a particular bucket
Putty Setting [Assuming Ubuntu installation completed] Below setting will help to take remote of hadoop cluster using putty https://learnomate.org/settings-to-connect-to-putty-with-remove-oracle-database-server/ Change Hostname in Ubuntu open /etc/hosts file and change hostname Fire hostname and ensure hostname has changed Install Packages that will help you to take ssh and enable copy paste from vmware Switch to root user […]