- ANKUSH THAVALI
- 03 Mar, 2023
- 0 Comments
- 29 Secs Read
Pyspark and amazon s3 integration
# Create SparkSession from builder
import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[6]") \ .appName('SparkByExamples.com') \ .getOrCreate() conf = pyspark.SparkConf() sc = pyspark.SparkContext.getOrCreate(conf=conf) accessKeyId='AKIAYH7INJR2EIYR422G'secretAccessKey='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' hadoopConf = sc._jsc.hadoopConfiguration() hadoopConf.set('fs.s3a.access.key',accessKeyId) hadoopConf.set('fs.s3a.secret.key', secretAccessKey) s3_df=spark.read.csv('s3a://learnomate.spark/partition_zipcodes20.csv',header=True,inferSchema=True) s3_df.show(5)
Spark submit
spark-submit –jars
C:\Users\Acer\Downloads\aws-java-sdk-1.7.4.jar,
C:\Users\Acer\Downloads\hadoop-aws-2.7.7.jar
C:\Users\Acer\PycharmProjects\DecHadoopBatch\readS3.py