DP-100: Designing and Implementing a Data Science Solution on Azure

33%

Question 61

You are creating a machine learning model. You have a dataset that contains null rows.

You need to use the Clean Missing Data module in Azure Machine Learning Studio to identify and resolve the null and missing data in the dataset.

Which parameter should you use?
Replace with mean
Remove entire column
Remove entire row
Hot Deck
Custom substitution value
Replace with mode




Answer is Remove entire row

Completely removes any row in the dataset that has one or more missing values.
This is useful if the missing value can be considered randomly missing.

References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missing-data

Question 62

You are a data scientist using Azure Machine Learning Studio.
You need to normalize values to produce an output column into bins to predict a target column.

Solution: Apply an Equal Width with Custom Start and Stop binning mode.

Does the solution meet the goal?
Yes
No




Answer is No

Use the Entropy MDL binning mode which has a target column.

References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-datainto-bins

Question 63

You are analyzing a numerical dataset which contains missing values in several columns.
You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.
You need to analyze a full dataset to include all values.

Solution: Remove the entire column that contains the missing data point.

Does the solution meet the goal?
Yes
No




Answer is No.

Use the Multiple Imputation by Chained Equations (MICE) method.

References:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/clean-missingdata

Question 64

You are creating a new experiment in Azure Machine Learning Studio.
One class has a much smaller number of observations than the other classes in the training set.
You need to select an appropriate data sampling strategy to compensate for the class imbalance.

Solution: You use the Scale and Reduce sampling mode.

Does the solution meet the goal?
Yes
No




Answer is No.

Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode. Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.

References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

Question 65

You are creating a new experiment in Azure Machine Learning Studio.
One class has a much smaller number of observations than the other classes in the training set.
You need to select an appropriate data sampling strategy to compensate for the class imbalance.

Solution: You use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode.

Does the solution meet the goal?
Yes
No




Answer is Yes

SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.

References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote

Question 66

You are evaluating a completed binary classification machine learning model.
You need to use the precision as the valuation metric.

Which visualization should you use?
Violin pilot
Gradient descent
Box pilot
Binary classification confusion matrix




Answer is Binary classification confusion matrix

References:
https://machinelearningknowledge.ai/confusion-matrix-and-performance-metrics-machine-learning/

Question 67

You plan to deliver a hands-on workshop to several students. The workshop will focus on creating data visualizations using Python. Each student will use a device that has internet access.
Student devices are not configured for Python development. Students do not have administrator access to install software on their devices. Azure subscriptions are not available for students.
You need to ensure that students can run Python-based data visualization code.

Which Azure tool should you use?
Anaconda Data Science Platform
Azure BatchAl
Azure Notebooks
Azure Machine Learning Service




Answer is Azure Notebooks

References:
https://notebooks.azure.com/

Question 68

You are a data scientist using Azure Machine Learning Studio.
You need to normalize values to produce an output column into bins to predict a target column.

Solution: Apply a Quantiles binning mode with a PQuantile normalization.

Does the solution meet the goal?
Yes
No




Answer is No

Use the Entropy MDL binning mode which has a target column.

References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/group-data-into-bins

Question 69

You are analyzing a numerical dataset which contains missing values in several columns.
You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.
You need to analyze a full dataset to include all values.

Solution: Use the Last Observation Carried Forward (LOCF) method to impute the missing data points.

Does the solution meet the goal?
Yes
No




Answer is No

Instead use the Multiple Imputation by Chained Equations (MICE) method.

Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as "Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained Equations". With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.

Note: Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal studies. If a person drops out of a study before it ends, then his or her last observed score on the dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to maintain the sample size and to reduce the bias caused by the attrition of participants in a study.

References:
https://methods.sagepub.com/reference/encyc-of-research-design/n211.xml

Question 70

You are analyzing a dataset containing historical data from a local taxi company. You are developing a regression model.
You must predict the fare of a taxi trip.
You need to select performance metrics to correctly evaluate the regression model.

Which two metrics can you use? Each correct answer presents a complete solution?
A Root Mean Square Error value that is low
An R-Squared value close to 0
An F1 score that is low
An R-Squared value close to 1
An F1 score that is high
A Root Mean Square Error value that is high




Answer are "A Root Mean Square Error value that is low" and "An R-Squared value close to 1"

RMSE and R2 are both metrics for regression models.

A: Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction.

D: Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect.

Incorrect Answers:
C, E: F-score is used for classification models, not for regression models.

References:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/evaluate-model

< Previous PageNext Page >

Quick access to all questions in this exam