Python and Spark for Big Data (PySpark) Training Course
Python is a high-level programming language famous for its clear syntax and code readability. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.
Duration: 25hrs
Course Content:
Understanding Big Data
Overview of Spark
Overview of Python
Overview of PySpark
- Distributing Data Using Resilient Distributed Datasets Framework
- Distributing Computation Using Spark API Operators
Setting Up PySpark
Using Amazon Web Services (AWS) EC2 Instances for Spark
Setting Up Databricks
Setting Up the AWS EMR Cluster
Learning the Basics of Python Programming
- Getting Started with Python
- Using the Jupyter Notebook
- Using Variables and Simple Data Types
- Working with Lists
- Using if Statements
- Using User Inputs
- Working with while Loops
- Implementing Functions
- Working with Classes
- Working with Files and Exceptions
- Working with Projects, Data, and APIs
- Getting Started with Spark DataFrames
- Implementing Basic Operations with Spark
- Using Groupby and Aggregate Operations
- Working with Timestamps and Dates
Understanding Machine Learning with MLlib
Working with MLlib, Spark, and Python for Machine Learning
Understanding Regressions
- Learning Linear Regression Theory
- Implementing a Regression Evaluation Code
- Working on a Sample Linear Regression Exercise
- Learning Logistic Regression Theory
- Implementing a Logistic Regression Code
- Working on a Sample Logistic Regression Exercise
- Learning Tree Methods Theory
- Implementing Decision Trees and Random Forest Codes
- Working on a Sample Random Forest Classification Exercise
- Understanding K-means Clustering Theory
- Implementing a K-means Clustering Code
- Working on a Sample Clustering Exercise
Implementing Natural Language Processing
- Understanding Natural Language Processing (NLP)
- Overview of NLP Tools
- Working on a Sample NLP Exercise
- Overview Streaming with Spark
- Sample Spark Streaming Exercise