[SparkByExamples] Pyspark Tutorial

PySpark Tutorial For Beginners | Python Examples

Spark with Python (PySpark) Tutorial For Beginners In this PySpark Tutorial (Spark with Python) with examples, you will learn what is PySpark? it's features, advantages, modules, packages, and how to use RDD & DataFrame with sample examples in Python code.

sparkbyexamples.com

pyspark 사용 시 구글링할 때 자주 나오는 예제 사이트이다. 유용했던 사이트라, 스파크 스터디원들과 함께 pyspark tutorial을 쭉 정독하고 새롭게 알게 된 것들을 공유하려고 한다.

PySpark Tutorial For Beginners

What is PySpark?

Pyspark는 Apache Spark의 Python API
- Apache Spark는 대규모 데이터를 분산 처리하고 머신러닝 응용을 위한 분석 처리 엔진
- Spark는 기본적으로 스칼라로 작성되었고 후에 Py4J를 이용하여 Pyspark 출시
  - Py4J는 python이 jvm과 동적으로 의사소통하게끔 만들어주는 자바 라이브러리
  - 그래서 Pyspark 사용 시 자바도 깔아야 함!

PySpark Architecture

Cluster Manager Types

Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark applications.
Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.

우리는 Hadoop YARN 사용 중

PySpark Modules & Packages

Spark third-party libraries 사용 시 참고

Spark Packages

spark-packages.org

PySpark RDD – Resilient Distributed Dataset

Spark session은 내부적으로 sparkContext 변수를 생성해냄
SparkSession은 여러 개 만들 수 있지만 JVM당 SparkContext는 무조건 하나
따라서 새로운 sparkContext를 만드려면 기존 sparkContext를 멈춰야만 함

PySpark SQL Tutorial

createOrReplaceTempView() 함수를 사용하고 나면 sql로 접근 가능

 df.createOrReplaceTempView("PERSON_DATA")
df2 = spark.sql("SELECT * from PERSON_DATA")
df2.printSchema()
df2.show()

PySpark GraphFrames

GraphX와 GraphFrame의 차이

GraphX : RDD에서 동작
GraphFrames : DataFrame에서 동작

PySpark – repartition() vs coalesce()

repartition() vs coalesce()

repartition() is used to increase or decrease the RDD/DataFrame partitions.
coalesce() is used to only decrease the number of partitions in an efficient way.

PySpark – Broadcast Variables

브로드캐스트 변수(Broadcast Variables)

클러스터에 있는 모든 노드에서 이용가능하고 캐시되는 읽기 전용 공유 변수
모든 노드에서 이용 가능한 전역변수 같은 느낌? 노드는 컴퓨터 한 대라고 볼 수 있으니, 같은 클러스터를 사용하고 있는 사람들은 모두 사용 가능한 전역 변수

PySpark – Accumulator

Accumulator

모든 executor에 의해 정보를 업데이트하고 더하는 쓰기 전용 공유 변수
드라이버에서만 값에 접근 가능하고 워커는 업데이트하고 더하는 태스크만 수행
sparkContext.accumulator() is used to define accumulator variables.
add() function is used to add/update a value in accumulator
value property on the accumulator variable is used to retrieve the value from the accumulator.

저작자표시 변경금지

'Spark & Hadoop' 카테고리의 다른 글

[PySpark Documentation/API Reference] Spark SQL (0)	2022.10.23

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

뢀뢀이의 기술 블로그

[SparkByExamples] Pyspark Tutorial

PySpark Tutorial For Beginners

What is PySpark?

PySpark Architecture

Cluster Manager Types

PySpark Modules & Packages

PySpark RDD – Resilient Distributed Dataset

PySpark SQL Tutorial

PySpark GraphFrames

PySpark – repartition() vs coalesce()

PySpark – Broadcast Variables

PySpark – Accumulator

'Spark & Hadoop' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

	df.createOrReplaceTempView("PERSON_DATA")
	df2 = spark.sql("SELECT * from PERSON_DATA")
	df2.printSchema()
	df2.show()

[SparkByExamples] Pyspark Tutorial

PySpark Tutorial For Beginners

What is PySpark?

PySpark Architecture

Cluster Manager Types

PySpark Modules & Packages

PySpark RDD – Resilient Distributed Dataset

PySpark SQL Tutorial

PySpark GraphFrames

PySpark – repartition() vs coalesce()

PySpark – Broadcast Variables

PySpark – Accumulator

'Spark & Hadoop' 카테고리의 다른 글

'Spark & Hadoop' Related Articles

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역