Python and Spark for Big Data (PySpark) Certificate for Piotr Patrzałek
Add to LinkedIn
Certificate ID:
607443
Authentication Code:
2feb7
Certified Person Name:
Piotr Patrzałek
Trainer Name:
Tomasz Zając
Duration Days:
4
Duration Hours:
28
Course Name:
Python and Spark for Big Data (PySpark)
Course Date:
28 January 2020 09:00 to 31 January 2020 16:00
Venue:
Gdańsk
Course Outline:
Plan szkolenia
I Part – 2 days, Machine learning in Python and Spark
- Machine Learning introduction
- Types of Machine learning – supervised vs unsupervised learning
- From Statistical learning to Machine learning
- The Data Mining workflow:
- Business understanding
- Data Understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
- Machine learning algorithms
- Choosing appropriate algorithm to the problem
- Overfitting and bias-variance tradeoff in ML
- Testing ML algorithms
- Generalization and overfitting
- Avoiding overfitting
- Holdout method
- Cross-Validation
- Bootstrapping
- Evaluating numerical predictions
- Measures of accuracy: ME, MSE, RMSE, MAPE
- Parameter and prediction stability
- Evaluating classification algorithms
- Accuracy and its problems
- The confusion matrix
- Unbalanced classes problem
- Visualizing model performance
- Profit curve
- ROC curve
- Lift curve
- Model selection
- Model tuning – grid search strategies
- Examples in Python
- Intro to Spark
- Spark structured API
- RDD - Resilient Distributed Dataset
- DataFrame
- Dataset
- Spark structured API
- Data preparation – common tasks with data
- Data transformations – Data wrangling
- Exploratory analysis
- Missing observations – detection and solutions
- Outliers – detection and strategies
- Standarization, normalization, binarization
- Qualitative data recoding
- Examples in Spark (and python for comparison)
- Machine Learning Workflow in Python Scikit-learn and Spark
- Feature transformations
- ML Pipeline
- Model tuning and hyperparameters
- Model persistence
- Solving classification problems with Spark and Scikit-learn
- Binary vs multiclass classification
- Classification via mathematical functions
- Linear discriminant functions
- Quadratic discriminant functions
- Logistic regression and probability approach
- k-nearest neighbors
- Naïve Bayes
- Decision trees
- CART
- Random Forests
- Xgboost
- Support Vector Machines and kernels
- Maximal Margin Classifier
- Support Vector Machine
- Ensemble learning
- Examples in Spark (and python for comparison)
- Unsupervised learning
- Dimensionality reduction
- Principal component analysis and factor analysis
- Singular value decomposition
- Dimensionality reduction
-
- Multidimensional Scaling
- Examples in Python
II Part - 2 days Deep learning with Tensorflow
1. Python frameworks for AI applications:
- TensorFlow and Keras
- Tensorflow on Spark
2. Neural network architectures – Multilayer Perceptron and its extensions
- Overview of the methods
- Components of model (connection and weights, propagation function, etc.)
- Hyperparameters (Learning rate, cost function, backpropagation, etc.)
- Implementing ANN
- overfitting
- bias/variance trade-off
- biases in observational data
- Working ANN Exercise in Python / Spark
3. Recurrent Neural Networks
- Sequential data types
- Design – Recurrent NN, LSTM, GRU
- Hyperparameters
- Training and optimization methods
- overfitting
- bias/variance trade-off
- biases in observational data
- neural network poisoning
- Good practices
- Implementing RNN/LSTM in Spark
- Working RNN/LSTM Exercise in Spark
4. Machine Learning in production
- Keep consistency between model made by Data Scientist and production
- Keep consistency between model on different machines (move model file / seed / etc.)
- Adding new features to model – compare new model with production
- Other aspects
Course Name Eng:
PySpark ML and DL