Skip to main content
  • English
    • English
    • 简体中文
    • Deutsch
    • Polski
    • العربية
    • Nederlands
    • Français
    • Magyar
    • Italiano
    • 日本語
    • 한국어
    • Português
    • Română
    • Русский
    • Español
Home

Certificate Authentication

Python and Spark for Big Data (PySpark) Certificate for Piotr Patrzałek

Add to LinkedIn

Certificate ID: 
607443
Authentication Code: 
2feb7
Certified Person Name: 
Piotr Patrzałek
Trainer Name: 
Tomasz Zając
Duration Days: 
4
Duration Hours: 
28
Course Name: 
Python and Spark for Big Data (PySpark)
Course Date: 
28 January 2020 09:00 to 31 January 2020 16:00
Venue: 
Gdańsk
Course Outline: 

Plan szkolenia

I Part – 2 days, Machine learning in Python and Spark

  1. Machine Learning introduction
    • Types of Machine learning – supervised vs unsupervised learning
    • From Statistical learning to Machine learning
    • The Data Mining workflow:
      • Business understanding
      • Data Understanding
      • Data preparation
      • Modelling
      • Evaluation
      • Deployment
    • Machine learning algorithms
    • Choosing appropriate algorithm to the problem
    • Overfitting and bias-variance tradeoff in ML
  2. Testing ML algorithms
    • Generalization and overfitting
    • Avoiding overfitting
      • Holdout method
      • Cross-Validation
      • Bootstrapping
    • Evaluating numerical predictions
      • Measures of accuracy: ME, MSE, RMSE, MAPE
      • Parameter and prediction stability
    • Evaluating classification algorithms
      • Accuracy and its problems
      • The confusion matrix
      • Unbalanced classes problem
    • Visualizing model performance
      • Profit curve
      • ROC curve
      • Lift curve
    • Model selection
    • Model tuning – grid search strategies
    • Examples in Python
  3. Intro to Spark
    • Spark structured API
      1. RDD - Resilient Distributed Dataset
      2. DataFrame
      3. Dataset
  4. Data preparation – common tasks with data
    • Data transformations – Data wrangling
    • Exploratory analysis
    • Missing observations – detection and solutions
    • Outliers – detection and strategies
    • Standarization, normalization, binarization
    • Qualitative data recoding
    • Examples in Spark (and python for comparison)
  5. Machine Learning Workflow in Python Scikit-learn and Spark
    • Feature transformations
    • ML Pipeline
    • Model tuning and hyperparameters
    • Model persistence
  6. Solving classification problems with Spark and Scikit-learn
    • Binary vs multiclass classification
    • Classification via mathematical functions
      • Linear discriminant functions
      • Quadratic discriminant functions
    • Logistic regression and probability approach
    • k-nearest neighbors
    • Naïve Bayes
    • Decision trees
      • CART
      • Random Forests
      • Xgboost
    • Support Vector Machines and kernels
      • Maximal Margin Classifier
      • Support Vector Machine
    • Ensemble learning
    • Examples in Spark (and python for comparison)
  7. Unsupervised learning
    • Dimensionality reduction
      • Principal component analysis and factor analysis
      • Singular value decomposition
    • Multidimensional Scaling
    • Examples in Python

II Part - 2 days Deep learning with Tensorflow

1. Python frameworks for AI applications:

  • TensorFlow and Keras
  • Tensorflow on Spark

2. Neural network architectures – Multilayer Perceptron and its extensions

  1. Overview of the methods
  2. Components of model (connection and weights, propagation function, etc.)
  3. Hyperparameters (Learning rate, cost function, backpropagation, etc.)
  4. Implementing ANN
  5. overfitting
  6. bias/variance trade-off
  7. biases in observational data
  8. Working ANN Exercise in Python / Spark

3. Recurrent Neural Networks

  1. Sequential data types
  2. Design – Recurrent NN, LSTM, GRU
  3. Hyperparameters
  4. Training and optimization methods
  5. overfitting
  6. bias/variance trade-off
  7. biases in observational data
  8. neural network poisoning
  9. Good practices
  10. Implementing RNN/LSTM  in  Spark
  11. Working RNN/LSTM Exercise in Spark

4. Machine Learning in production

  • Keep consistency between model made by Data Scientist and production
  • Keep consistency between model on different machines (move model file / seed / etc.)
  • Adding new features to model – compare new model with production
  • Other aspects

 

 

Course Name Eng: 
PySpark ML and DL
Staff Login