DP-100: Designing and Implementing a Data Science Solution on Azure

Notes

Define and prepare the development environment (15-20%)

Select development environment

  • assess the deployment environment constraints
  • analyze and recommend tools that meet system requirements
  • select the development environment

Set up development environment

  • create an Azure data science environment
  • configure data science work environments

Quantify the business problem

  • define technical success metrics
  • quantify risks

Prepare data for modeling (25-30%)

Transform data into usable datasets

  • develop data structures
  • design a data sampling strategy
  • design the data preparation flow

Perform Exploratory Data Analysis (EDA)

  • review visual analytics data to discover patterns and determine next steps
  • identify anomalies, outliers, and other data inconsistencies
  • create descriptive statistics for a dataset

Cleanse and transform data

  • resolve anomalies, outliers, and other data inconsistencies
  • standardize data formats
  • set the granularity for data

Perform feature engineering (15-20%)

Perform feature extraction

  • perform feature extraction algorithms on numerical data
  • perform feature extraction algorithms on non-numerical data
  • scale features

Perform feature selection

  • define the optimality criteria
  • apply feature selection algorithms

Develop models (40-45%)

Select an algorithmic approach

  • determine appropriate performance metrics
  • implement appropriate algorithms
  • consider data preparation steps that are specific to the selected algorithms

Split datasets

  • determine ideal split based on the nature of the data
  • determine number of splits
  • determine relative size of splits
  • ensure splits are balanced

Identify data imbalances

  • resample a dataset to impose balance
  • adjust performance metric to resolve imbalances
  • implement penalization

Train the model

  • select early stopping criteria
  • tune hyper-parameters

Evaluate model performance

  • score models against evaluation metrics
  • implement cross-validation
  • identify and address overfitting
  • identify root cause of performance results

Terminology

  • Azure Machine Learning Workspace
  • Azure Machine Learning Studio
  • Azure Machine Learning Designer
  • Azure Machine Learning SDK
  • Estimator
  • Log Metrics
  • Hyperdrive, Hyperparameters
  • Model Interpreter
  • Models,
    • Trained Model
    • Model History
  • Model as a Service
  • Designer Pipeline as a Web Service

Blogs

Training

Videos

References