> For the complete documentation index, see [llms.txt](https://docs.pentaho.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.pentaho.com/install/10.2-install/pentaho-configuration/tasks-to-be-performed-by-an-it-administrator/set-up-the-pentaho-server-to-connect-to-a-hadoop-cluster/additional-configurations-for-specific-distributions-connecting-to-a-cluster/advanced-settings-for-connecting-to-a-amazon-emr-cluster/before-you-begin-connect-to-emr-cluster.md).

# Before you begin

Before you begin setting up Pentaho to connect to an Amazon EMR cluster, perform the following tasks.

1. Check the [Components Reference](/install/10.2-install/components-reference.md) to verify that your Pentaho version supports your version of the Amazon EMR cluster.
2. Prepare your Amazon EMR cluster by performing the following tasks:
   1. Configure an Amazon EC2 cluster.

      View the [Amazon's documentation](https://aws.amazon.com/documentation/elastic-mapreduce/) if you need help.
   2. Install any required services and service client tools.
   3. Test the cluster.
3. Install PDI on an Amazon EC2 instance that is within the same Amazon Virtual Private Cloud (VPC) as the Amazon EMR cluster.

   **Note:** As a best practice, you should install PDI on your Amazon EC2 instance. Otherwise, you might not be able to write or read files to or from the cluster. To resolve this issue, see [Unable to read or write files to HDFS on the Amazon EMR cluster](/install/10.2-install/use-hadoop-with-pentaho/big-data-issues/unable-to-read-or-write-files-to-hdfs-on-amazon-emr-cluster.md).
4. Get the connection information for the cluster and services that you intend to use from your Hadoop administrator. Some of this information may be available from a cluster management tool. You also need to [supply some of this information to users](/install/10.2-install/use-hadoop-with-pentaho/get-started-with-hadoop-and-pdi/before-you-begin-get-started-with-hadoop-and-pdi/hadoop-connection-and-access-information-list.md) after you are finished.
5. Add the YARN user on the cluster to the group defined by **dfs.permissions.superusergroup** property. The **dfs.permissions.superusergroup** property can be found in `hdfs-site.xml` file on your cluster or in the cluster management application.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.pentaho.com/install/10.2-install/pentaho-configuration/tasks-to-be-performed-by-an-it-administrator/set-up-the-pentaho-server-to-connect-to-a-hadoop-cluster/additional-configurations-for-specific-distributions-connecting-to-a-cluster/advanced-settings-for-connecting-to-a-amazon-emr-cluster/before-you-begin-connect-to-emr-cluster.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.