Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Questions and Answers

Questions 4

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

Options:

PySpark DataFrame API

pandas API on Spark

Spark SQL

Feature Store

Buy Now

Questions 5

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

Options:

Run each notebook interactively

Review the matrix view in the Job's runs

Migrate the Job to a Delta Live Tables pipeline

Change each Task’s setting to use a dedicated cluster

Buy Now

Questions 6

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

spark_df.to_sql()

import pandas as pd

df = pd.DataFrame(spark_df)

spark_df.to_pandas()

Buy Now

Questions 7

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

spark_df.summary ()

spark_df.stats()

spark_df.describe().head()

spark_df.printSchema()

spark_df.toPandas()

Buy Now

Questions 8

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.

Which of the following describes why?

Options:

Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

Gradient boosting requires access to all data at once which cannot happen during parallelization.

Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

Buy Now

Answer:

Explanation:

Gradient boosting is fundamentally an iterative algorithm where each new tree is built based on the errors of the previous ones. This sequential dependency makes it difficult to parallelize the training of trees in gradient boosting, as each step relies on the results from the preceding step. Parallelization in this context would undermine the core methodology of the algorithm, which depends on sequentially improving the model'sperformance with each iteration.References:

Machine Learning Algorithms (Challenges with Parallelizing Gradient Boosting).

Gradient boosting is an ensemble learning technique that builds models in a sequential manner. Each new model corrects the errors made by the previous ones. This sequential dependency means that each iteration requires the results of the previous iteration to make corrections. Here is a step-by-step explanation of why this makes parallelization challenging:

Sequential Nature: Gradient boosting builds one tree at a time. Each tree is trained to correct the residual errors of the previous trees. This requires the model to complete one iteration before starting the next.
Dependence on Previous Iterations: The gradient calculation at each step depends on the predictions made by the previous models. Therefore, the model must wait until the previous tree has been fully trained and evaluated before starting to train the next tree.
Difficulty in Parallelization: Because of this dependency, it is challenging to parallelize the training process. Unlike algorithms that process data independently in each step (e.g., random forests), gradient boosting cannot easily distribute the work across multiple processors or cores for simultaneous execution.

This iterative and dependent nature of the gradient boosting process makes it difficult to parallelize effectively.

References

Gradient Boosting Machine Learning Algorithm
Understanding Gradient Boosting Machines

Questions 9

Which of the following machine learning algorithms typically uses bagging?

Options:

Gradient boosted trees

K-means

Random forest

Linear regression

Decision tree

Buy Now

Questions 10

Which statement describes a Spark ML transformer?

Options:

A transformer is an algorithm which can transform one DataFrame into another DataFrame

A transformer is a hyperparameter grid that can be used to train a model

A transformer chains multiple algorithms together to transform an ML workflow

A transformer is a learning algorithm that can use a DataFrame to train a model

Buy Now

Questions 11

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model bycomparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.

Which of the following possible explanations for this difference is invalid?

Options:

The second model is much more accurate than the first model

The data scientist failed to exponentiate the predictions in the second model prior tocomputingthe RMSE

The datascientist failed to take the logof the predictions in the first model prior to computingthe RMSE

The first model is much more accurate than the second model

The RMSE is an invalid evaluation metric for regression problems

Buy Now

Questions 12

Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?

Options:

MLflow Experiment Tracking

Spark ML

Autoscaling clusters

Delta Lake

Buy Now

Answer:

Explanation:

Spark ML (part of Apache Spark's MLlib) is designed to handle machine learning tasks across multiple nodes in a cluster, effectively parallelizing tasks like hyperparameter tuning. It supports various machine learning algorithms that can be optimized over a Spark cluster, making it suitable for parallelizing hyperparameter tuning for single-node machine learning models when they are adapted to run on Spark.

References

Apache Spark MLlib Guide:https://spark.apache.org/docs/latest/ml-guide.html

Spark ML is a library within Apache Spark designed for scalable machine learning. It provides tools to handle large-scale machine learning tasks, including parallelizing the hyperparameter tuning process for single-node machine learning models using a Spark cluster. Here’s a detailed explanation of how Spark ML can be used:

Hyperparameter Tuning with CrossValidator: Spark ML includes theCrossValidatorandTrainValidationSplitclasses, which are used for hyperparameter tuning. These classes can evaluate multiple sets of hyperparameters in parallel using a Spark cluster.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Define the model

model = ...

# Create a parameter grid

paramGrid = ParamGridBuilder() \

addGrid(model.hyperparam1, [value1, value2]) \

addGrid(model.hyperparam2, [value3, value4]) \

build()

# Define the evaluator

evaluator = BinaryClassificationEvaluator()

# Define the CrossValidator

crossval = CrossValidator(estimator=model,

estimatorParamMaps=paramGrid,

evaluator=evaluator,

numFolds=3)

Parallel Execution: Spark distributes the tasks of training models with different hyperparameters across the cluster’s nodes. Each node processes a subset of the parameter grid, which allows multiple models to be trained simultaneously.
Scalability: Spark ML leverages the distributed computing capabilities of Spark. This allows for efficient processing of large datasets and training of models across many nodes, which speeds up the hyperparameter tuning process significantly compared to single-node computations.

References

Apache Spark MLlib Documentation
Hyperparameter Tuning in Spark ML

Questions 13

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.

Which of the following classification metrics should be used to evaluate the model?

Options:

RMSE

Precision

Area under the residual operating curve

Accuracy

Recall

Buy Now

Questions 14

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

Options:

Open the MLmodel artifact in the MLflow run paqe

Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe

Click the "Source" link in the row corresponding to the run in the MLflow experiment page

Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page

Buy Now

Questions 15

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

Options:

Option A

Option B

Option C

Option D

Option E

Buy Now

Questions 16

A data scientist has created a linear regression model that useslog(price)as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFramepreds_df.

They are using the following code block to evaluate the model:

regression_evaluator.setMetricName("rmse").evaluate(preds_df)

Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable withprice?

Options:

They should exponentiate the computed RMSE value

They should take the log of the predictions before computing the RMSE

They should evaluate the MSE of the log predictions to compute the RMSE

They should exponentiate the predictions before computing the RMSE

Buy Now

Questions 17

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.

Which of the following lines of code will return the metadata description?

Options:

There is no way to return the metadata description programmatically.

fs.create_training_set("new_table")

fs.get_table("new_table").description

fs.get_table("new_table").load_df()

fs.get_table("new_table")

Buy Now

Questions 18

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?

Options:

R-squared

MAE

MSE

Buy Now

Questions 19

Which of the following machine learning algorithms typically uses bagging?

Options:

IGradient boosted trees

K-means

Random forest

Decision tree

Buy Now

Questions 20

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.

They have developed this code block to accomplish this task:

The code block is returning an error.

Which of the following adjustments does the data scientist need to make to accomplish this task?

Options:

They need to specify the method parameter to the OneHotEncoder.

They need to remove the line with the fit operation.

They need to use Stringlndexer prior to one-hot encodinq the features.

They need to useVectorAssemblerprior to one-hot encoding the features.

Buy Now

Questions 21

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

Options:

Option A

Option B

Option C

Option D

Buy Now

Questions 22

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

Options:

Leave-one-out encoding

Target encoding

One-hot encoding

Categorical

String indexing

Buy Now

ML Data Scientist |

Exam Code: Databricks-Machine-Learning-Associate

Exam Name: Databricks Certified Machine Learning Associate Exam

Last Update: Jun 29, 2025

Questions: 74

Databricks-Machine-Learning-Associate PDF

$29.75 ~~$84.99~~

Add to Cart

Databricks-Machine-Learning-Associate Engine

Databricks-Machine-Learning-Associate Testing Engine

$35 ~~$99.99~~

Add to Cart

Databricks-Machine-Learning-Associate PDF + Engine

Databricks-Machine-Learning-Associate PDF + Testing Engine

$47.25 ~~$134.99~~

Add to Cart

Summer Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: cramtreat

cramtick logo

Navigation:

Hot Vendors:

Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Databricks-Machine-Learning-Associate PDF

Databricks-Machine-Learning-Associate Testing Engine

Databricks-Machine-Learning-Associate PDF + Testing Engine

Quick Links

Recently New Released Certification Exams

Site Secure