
[Apr-2025] Get 100% Real Professional-Data-Engineer Exam Questions, Accurate & Verified Test4Sure Dumps in the Real Exam!
Pass Your Google Cloud Certified Exams Fast. All Top Professional-Data-Engineer Exam Questions Are Covered.
NEW QUESTION # 132
Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?
- A. Store each data entry as the primary key in a separate database and apply an index.
- B. Compute the hash value of each data entry, and compare it with all historical data.
- C. Maintain a database table to store the hash value and other metadata for each data entry.
- D. Assign global unique identifiers (GUID) to each data entry.
Answer: C
Explanation:
Using Hash values we can remove duplicate values from a database. Hashvalues will be same for duplicate data and thus can be easily rejected.
NEW QUESTION # 133
You want to schedule a number of sequential load and transformation jobs Data files will be added to a Cloud Storage bucket by an upstream process There is no fixed schedule for when the new data arrives Next, a Dataproc job is triggered to perform some transformations and write the data to BigQuery. You then need to run additional transformation jobs in BigQuery The transformation jobs are different for every table These jobs might take hours to complete You need to determine the most efficient and maintainable workflow to process hundreds of tables and provide the freshest data to your end users. What should you do?
- A. 1Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage. Dataproc. and BigQuery operators
2 Use a single shared DAG for all tables that need to go through the pipeline
3 Schedule the DAG to run hourly - B. 1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators
2 Use a single shared DAG for all tables that need to go through the pipeline.
3 Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG - C. 1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage, Dataproc. and BigQuery operators
2 Create a separate DAG for each table that needs to go through the pipeline
3 Schedule the DAGs to run hourly - D. 1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators.
2 Create a separate DAG for each table that needs to go through the pipeline
3 Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG
Answer: D
Explanation:
This option is the most efficient and maintainable workflow for your use case, as it allows you to process each table independently and trigger the DAGs only when new data arrives in the Cloud Storage bucket. By using the Dataproc and BigQuery operators, you can easily orchestrate the load and transformation jobs for each table, and leverage the scalability and performance of these services12. By creating a separate DAG for each table, you can customize the transformation logic and parameters for each table, and avoid the complexity and overhead of a single shared DAG3. By using a Cloud Storage object trigger, you can launch a Cloud Function that triggers the DAG for the corresponding table, ensuring that the data is processed as soon as possible and reducing the idle time and cost of running the DAGs on a fixed schedule4 .
Option A is not efficient, as it runs the DAG hourly regardless of the data arrival, and it uses a single shared DAG for all tables, which makes it harder to maintain and debug. Option C is also not efficient, as it runs the DAGs hourly and does not leverage the Cloud Storage object trigger. Option D is not maintainable, as it uses a single shared DAG for all tables, and it does not use the Cloud Storage operator, which can simplify the data ingestion from the bucket. Reference:
1: Dataproc Operator | Cloud Composer | Google Cloud
2: BigQuery Operator | Cloud Composer | Google Cloud
3: Choose Workflows or Cloud Composer for service orchestration | Workflows | Google Cloud
4: Cloud Storage Object Trigger | Cloud Functions Documentation | Google Cloud
[5]: Triggering DAGs | Cloud Composer | Google Cloud
[6]: Cloud Storage Operator | Cloud Composer | Google Cloud
NEW QUESTION # 134
You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?
- A. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
- B. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.
- C. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
- D. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
Answer: C
NEW QUESTION # 135
You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes.
You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.
What should you do?
- A. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
- B. Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
- C. Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.
- D. Export Bigtable dump to GCS and run your analytical job on top of the exported files.
Answer: B
NEW QUESTION # 136
You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes. You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.
What should you do?
- A. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
- B. Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
- C. Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.
- D. Export Bigtable dump to GCS and run your analytical job on top of the exported files.
Answer: B
NEW QUESTION # 137
What are the minimum permissions needed for a service account used with Google Dataproc?
- A. Write to Google Cloud Storage; read to Google Cloud Logging
- B. Read and write to Google Cloud Storage; write to Google Cloud Logging
- C. Execute to Google Cloud Storage; execute to Google Cloud Logging
- D. Execute to Google Cloud Storage; write to Google Cloud Logging
Answer: B
Explanation:
Service accounts authenticate applications running on your virtual machine instances to other Google Cloud Platform services. For example, if you write an application that reads and writes files on Google Cloud Storage, it must first authenticate to the Google Cloud Storage API. At a minimum, service accounts used with Cloud Dataproc need permissions to read and write to Google Cloud Storage, and to write to Google Cloud Logging.
NEW QUESTION # 138
You receive data files in CSV format monthly from a third party. You need to cleanse this data, but every third month the schema of the files changes. Your requirements for implementing these transformations include:
Executing the transformations on a schedule
Enabling non-developer analysts to modify transformations
Providing a graphical tool for designing transformations
What should you do?
- A. Help the analysts write a Cloud Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data's schema changes
- B. Load each month's CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query
- C. Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis
- D. Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file before creating a Dataframe. Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery
Answer: C
Explanation:
you can use dataprep for continuously changing target schema
In general, a target consists of the set of information required to define the expected data in a dataset. Often referred to as a "schema," this target schema information can include:
Names of columns
Order of columns
Column data types
Data type format
Example rows of data
A dataset associated with a target is expected to conform to the requirements of the schema. Where there are differences between target schema and dataset schema, a validation indicator (or schema tag) is displayed.
https://cloud.google.com/dataprep/docs/html/Overview-of-RapidTarget_136155049
NEW QUESTION # 139
You are designing the architecture of your application to store data in Cloud Storage. Your application consists of pipelines that read data from a Cloud Storage bucket that contains raw data, and write the data to a second bucket after processing. You want to design an architecture with Cloud Storage resources that are capable of being resilient if a Google Cloud regional failure occurs. You want to minimize the recovery point objective (RPO) if a failure occurs, with no impact on applications that use the stored dat a. What should you do?
- A. Adopt a dual-region Cloud Storage bucket, and enable turbo replication in your architecture.
- B. Adopt multi-regional Cloud Storage buckets in your architecture.
- C. Adopt two regional Cloud Storage buckets, and create a daily task to copy from one bucket to the other.
- D. Adopt two regional Cloud Storage buckets, and update your application to write the output on both buckets.
Answer: A
Explanation:
To ensure resilience and minimize the recovery point objective (RPO) with no impact on applications, using a dual-region bucket with turbo replication is the best approach. Here's why option D is the best choice:
Dual-Region Buckets:
Dual-region buckets store data redundantly across two distinct geographic regions, providing high availability and durability.
This setup ensures that data remains available even if one region experiences a failure.
Turbo Replication:
Turbo replication ensures that data is replicated between the two regions within 15 minutes, aligning with the requirement to minimize the recovery point objective (RPO).
This feature provides near real-time replication, significantly reducing the risk of data loss.
No Impact on Applications:
Applications continue to access the dual-region bucket without any changes, ensuring seamless operation even during a regional failure.
The dual-region setup transparently handles failover, providing uninterrupted access to data.
Steps to Implement:
Create a Dual-Region Bucket:
Create a dual-region Cloud Storage bucket in the Google Cloud Console, selecting appropriate regions (e.g., us-central1 and us-east1).
Enable Turbo Replication:
Enable turbo replication to ensure rapid data replication between the selected regions.
Configure Applications:
Ensure that applications read and write to the dual-region bucket, benefiting from its high availability and durability.
Test Failover:
Simulate a regional failure to verify that the dual-region bucket and turbo replication meet the required RPO and ensure data resilience.
Reference:
Google Cloud Storage Dual-Region
Turbo Replication in Google Cloud Storage
NEW QUESTION # 140
You need to deploy additional dependencies to all of a Cloud Dataproc cluster at startup using an existing initialization action. Company security policies require that Cloud Dataproc nodes do not have access to the Internet so public initialization actions cannot fetch resources. What should you do?
- A. Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter
- B. Use an SSH tunnel to give the Cloud Dataproc cluster access to the Internet
- C. Deploy the Cloud SQL Proxy on the Cloud Dataproc master
- D. Use Resource Manager to add the service account used by the Cloud Dataproc cluster to the Network User role
Answer: A
NEW QUESTION # 141
Cloud Bigtable is a recommended option for storing very large amounts of ____________________________?
- A. multi-keyed data with very low latency
- B. single-keyed data with very low latency
- C. multi-keyed data with very high latency
- D. single-keyed data with very high latency
Answer: B
Explanation:
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key. Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.
NEW QUESTION # 142
The Dataflow SDKs have been recently transitioned into which Apache service?
- A. Apache Hadoop
- B. Apache Kafka
- C. Apache Beam
- D. Apache Spark
Answer: C
Explanation:
Dataflow SDKs are being transitioned to Apache Beam, as per the latest Google directive
NEW QUESTION # 143
Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of data. The view is described in legacy SQL. Next month, existing applications will be connecting to BigQuery to read the eventsdata via an ODBC connection. You need to ensure the applications can connect. Which two actions should you take? (Choose two.)
- A. Create a service account for the ODBC connection to use for authentication
- B. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared "events"
- C. Create a new view over events using standard SQL
- D. Create a new view over events_partitioned using standard SQL
- E. Create a new partitioned table using a standard SQL query
Answer: B,C
NEW QUESTION # 144
You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?
- A. The message body for the sensor event is too large.
- B. Your custom endpoint has an out-of-date SSL certificate.
- C. The Cloud Pub/Sub topic has too many messages published to it.
- D. Your custom endpoint is not acknowledging messages within the acknowledgement deadline.
Answer: D
Explanation:
Until or unless the message is not acknowledged within defined ack window period for every message, we will get duplicate (number of retries to send message can be defined).
https://cloud.google.com/pubsub/docs/troubleshooting#dupes
NEW QUESTION # 145
If you're running a performance test that depends upon Cloud Bigtable, all the choices except one below are recommended steps. Which is NOT a recommended step to follow?
- A. Use at least 300 GB of data.
- B. Before you test, run a heavy pre-test for several minutes.
- C. Run your test for at least 10 minutes.
- D. Do not use a production instance.
Answer: D
Explanation:
If you're running a performance test that depends upon Cloud Bigtable, be sure to follow these steps as you plan and execute your test:
Use a production instance. A development instance will not give you an accurate sense of how a production instance performs under load.
Use at least 300 GB of data. Cloud Bigtable performs best with 1 TB or more of data. However, 300 GB of data is enough to provide reasonable results in a performance test on a 3-node cluster. On larger clusters, use 100 GB of data per node.
Before you test, run a heavy pre-test for several minutes. This step gives Cloud Bigtable a chance to balance data across your nodes based on the access patterns it observes.
Run your test for at least 10 minutes. This step lets Cloud Bigtable further optimize your data, and it helps ensure that you will test reads from disk as well as cached reads from memory.
NEW QUESTION # 146
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?
- A. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
- B. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
- C. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
- D. Create a Google Cloud Dataflow job to process the data.
- E. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
Answer: B
NEW QUESTION # 147
Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old dat a. The current business requirements are:
* The old data can be deleted anytime
* You plan to use the visualization layer for current and historical reporting
* The old data should be available instantly when accessed
* There should not be any charges for data retrieval.
What should you do to optimize for cost?
- A. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearlme. 45 days to coldline. and 60 days to archive storage class Delete old data as needed.
- B. Create the bucket with the Autoclass storage class feature.
- C. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline. and 365 days to archive storage class. Delete old data as needed.
- D. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to coldline, 90 days to nearline. and 365 days to archive storage class Delete old data as needed.
Answer: C
Explanation:
- Autoclass automatically moves objects between storage classes without impacting performance or availability, nor incurring retrieval costs. - It continuously optimizes storage costs based on access patterns without the need to set specific lifecycle management policies.
NEW QUESTION # 148
......
Penetration testers simulate Professional-Data-Engineer exam: https://freecert.test4sure.com/Professional-Data-Engineer-exam-materials.html